Hacker News new | past | comments | ask | show | jobs | submit login
Rsync.net Technical Notes – Q3 2021 (rsync.net)
47 points by rsync on Sept 29, 2021 | hide | past | favorite | 29 comments



Happy to answer any questions or discuss any comments here.

Once again, thank you to Allan Jude at Klara Systems for the advice and guidance with the new ZFS "special" vdev for metadata caching that is discussed this quarter ...


No questions, just wanted to say thank you for such a great service.

Trusting a service provider is really hard in most cases, but you make it easy to trust rsync with posts like this and backing it up with reliable services.


Thanks for your kind words.

I can't speak for everyone here but I know that many of us, especially me, consider rsync.net to be our life's work.


Your "industries" page has a broken "about" link in the footer—and maybe other issues, as it's very different from most of the rest of your site, and there a some other little not-broken-but-not-ideal things about it that I see with a quick once-over, like the button-look "sign up now" at the bottom only having the text clickable.

https://www.rsync.net/industries.html


God that "industries" page is so lame. I can't believe we ever had that there.

The solution is not to fix that page but to remove all links to it ... I'll get the scientists working on it immediately.


I found the issue once before, and I admit it took a lot of clicking to find a route to it again this time (I thought you had removed it, at first).


> We believe that the risk of "logical failure" of an SSD is higher than the risk of physical failure. This means that some pattern of usage or strange edge-case causes the SSD to die instead of a physical failure. If we are correct, and if we mirror an SSD, then it is possible the two (or three, or four) SSDs will experience identical lifetime usage patterns. To put it simply, it is possible they could all just fail at exactly the same time. The way we mitigate that risk is by building mirrors of SSDs out of similarly spec'd and sized but not identical parts.

This makes sense to me (and is a good example of looking at more abstract failure domains in addition to the basic ones we all know and love) -- I'm curious if there's data to support this. rsync.net is in a good position to possibly collect that data.


This is actually just a non-scientific rule of thumb that I personally developed the first time I ever used an SSD as a boot mirror.

I have heard smart people confirm that this is a smart and reasonable practice but have never seen any data or supporting figures, etc.

It's basically cost-free and if you don't like other vendors, you can always pair up (current Intel drive) with (one generation ago Intel drive).


I think the "Mix SSD vendors/batch numbers" is a hold over from the very early days of SSDs where a handful of people get seriously slimed by having 90%~ of their same-brand-same-batch drives in a single machine fail at once due to some SSD batch failure.

As a side effect people generally now stagger SSDs a little to avoid something similar happening (ofc if you have multi machine replication this is less of a issue, but still a total machine loss can hurt due to capacity loss or parts shortages in edge locations, etc)

I've personally not seen a synchronised SSD array happen failure for a long time, but it's hard to know how much of that is because people now plan to avoid them.

[edit]

with the exception of: https://www.engadget.com/2020-03-25-hpe-ssd-bricked-firmware...


"I think the "Mix SSD vendors/batch numbers" is a hold over from the very early days of SSDs where a handful of people get seriously slimed by having 90%~ of their same-brand-same-batch drives in a single machine fail at once due to some SSD batch failure."

I want to clarify - there's the issue of a bad batch wherein their longevity is greatly reduced and they fail in a cluster, etc., etc.

But that is not what we are guarding against ...

Instead, the risk we're thinking about is that there is an actual bug in the firmware that causes a particular workload to brick the drive or destroy it or whatever.

The critical point is that if the drives are mirrored then they experience an identical workload over their lifespan and they could fail literally simultaneously.

So by all means - do indeed guard against bad batches or manufacturing defects by mixing drives. Just understand we're talking about something slightly different here ...


yup, the Intel failure I mentioned below was a firmware issue, not any actual failure of the flash modules


I remember the same mid vendors and drive batches suggestion for HDD RAID Arrays. Ahhh.


I can confirm that e.g. there has definitely been e.g. a batch of enterprise SSDs from Intel a couple years ago which failed en masse after a certain amount of powered-on time.


From my limited experience I did have a pair of Intel SSDs in RAID 1 fail within 2 days of each other in the same way. Thankfully the first was replace and recovered before the second failed.


I noticed that a borg list command executed way faster after e.g. the Zurich upgrade. Thanks!


There's more to that story ... in fact, the metadata special device was not the magic bullet we hoped it would be.

The real magic bullet was changing the "freezing" process where we transform 'borg' the python script into 'borg' the binary executable.

You'll see a full writeup of this in the Q4 technical notes :)


I just re-did my borg list commands again - it's now at the same speed as a couple of weeks ago. Strange.


Great writeup.

I have a ZFS account - how do those work under the hood? Is it a VM backed by a ZFS volume? How does the overhead compare to a normal account? I suppose using a VM eliminates some advantages of the special device.


If your account is enabled for zfs-send then it is a bhyve VM which has a zpool just for you (which is running on top of our zpool).

I will have to look into this - does your zpool benefit from the metadata cache ? Perhaps not since it is a different zpool ...


The metadata of the zvol on the host pool, does benefit from the metadata vdev (NOT A CACHE)


OK, that is good to know - I will edit the post to reflect that (and also remove the poor use of the word 'cache').


It sounds like you're still comfortable using HDD for the bulk of your storage, and add SSDs for fast caches.

Do you think you'll ever get to a point where you run zpools entirely of SSDs? If so, what criteria is important for you? (Raw price per gigabyte? Power usage? MTBF? Something else?)


In the current landscape I do not foresee rsync.net using all-flash zpools.

All of our access is over WAN so disk IO is not that important - it is raw price per GB and even with the "nice" SAS drives we buy ... ~$400 for 16TB is a huge difference vs. ~$700 for 1.8 TB (which is, roughly, the price of the Intel part mentioned in the list of cache drives).


To be clear on the 'special' vdev; for future writes, it serves as the sole metadata _store_ for the pool, not a cache, correct?


Yes, that's correct. ALL of the metadata from that point forward gets placed on the cache.

That is, until you fill it up - which you could.

At that point new metadata goes onto the spinning disk vdevs, as it did prior to the cache. As files age in and out, space gets freed on the cache and some new metadata makes it back on there.

So if you fill it up, it behaves more like a cache.

ALSO, you can add multiple metadata caches to a pool ... so if you fill one up, you can add another ...


The metadata vdev is not a cache, it is the primary store.

Too many years of thinking of the aux vdev types like SLOG and L2ARC as caches, has made people confuse this one frequently.


I knew about filecoin and Arweave but I didn't know about Chia.


TL;DR?


- They added fast metadata storage devices to their zfs pools. Since failure of these metadata device groups mean the whole storage pool is lost, they've set them up to have a similar level of redundancy to the rest of their system (i.e. it's fine as long as 4 devices don't all fail in a very short span of time).

- Chia is a cryptocurrency that made drives ~double in price for a while, but that's mostly stopped, though they're still a little more expensive than they were before. Also, reading between the lines, Chia looks really, really scammy/pyramid-schemey, even by cryptocurrency standards.

- A 3rd party SQLite streaming replication project added SCP as a transport mechanism, so now it works with rsync.net

- Rsync.net supports lots of their-side checksumming tools and they're really like it if you'd stop using computationally expensive ones if all you're doing is verifying data integrity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: