Rsync.net Technical Notes – Q3 2021

rsync · on Sept 29, 2021

Happy to answer any questions or discuss any comments here.

Once again, thank you to Allan Jude at Klara Systems for the advice and guidance with the new ZFS "special" vdev for metadata caching that is discussed this quarter ...

chanandler_bong · on Sept 29, 2021

No questions, just wanted to say thank you for such a great service.

Trusting a service provider is really hard in most cases, but you make it easy to trust rsync with posts like this and backing it up with reliable services.

rsync · on Sept 29, 2021

Thanks for your kind words.

I can't speak for everyone here but I know that many of us, especially me, consider rsync.net to be our life's work.

handrous · on Sept 29, 2021

Your "industries" page has a broken "about" link in the footer—and maybe other issues, as it's very different from most of the rest of your site, and there a some other little not-broken-but-not-ideal things about it that I see with a quick once-over, like the button-look "sign up now" at the bottom only having the text clickable.

https://www.rsync.net/industries.html

rsync · on Sept 29, 2021

God that "industries" page is so lame. I can't believe we ever had that there.

The solution is not to fix that page but to remove all links to it ... I'll get the scientists working on it immediately.

handrous · on Sept 29, 2021

I found the issue once before, and I admit it took a lot of clicking to find a route to it again this time (I thought you had removed it, at first).

mkelly · on Sept 29, 2021

> We believe that the risk of "logical failure" of an SSD is higher than the risk of physical failure. This means that some pattern of usage or strange edge-case causes the SSD to die instead of a physical failure. If we are correct, and if we mirror an SSD, then it is possible the two (or three, or four) SSDs will experience identical lifetime usage patterns. To put it simply, it is possible they could all just fail at exactly the same time. The way we mitigate that risk is by building mirrors of SSDs out of similarly spec'd and sized but not identical parts.

This makes sense to me (and is a good example of looking at more abstract failure domains in addition to the basic ones we all know and love) -- I'm curious if there's data to support this. rsync.net is in a good position to possibly collect that data.

rsync · on Sept 29, 2021

This is actually just a non-scientific rule of thumb that I personally developed the first time I ever used an SSD as a boot mirror.

I have heard smart people confirm that this is a smart and reasonable practice but have never seen any data or supporting figures, etc.

It's basically cost-free and if you don't like other vendors, you can always pair up (current Intel drive) with (one generation ago Intel drive).

benjojo12 · on Sept 29, 2021

I think the "Mix SSD vendors/batch numbers" is a hold over from the very early days of SSDs where a handful of people get seriously slimed by having 90%~ of their same-brand-same-batch drives in a single machine fail at once due to some SSD batch failure.

As a side effect people generally now stagger SSDs a little to avoid something similar happening (ofc if you have multi machine replication this is less of a issue, but still a total machine loss can hurt due to capacity loss or parts shortages in edge locations, etc)

I've personally not seen a synchronised SSD array happen failure for a long time, but it's hard to know how much of that is because people now plan to avoid them.

[edit]

with the exception of: https://www.engadget.com/2020-03-25-hpe-ssd-bricked-firmware...

rsync · on Sept 29, 2021

"I think the "Mix SSD vendors/batch numbers" is a hold over from the very early days of SSDs where a handful of people get seriously slimed by having 90%~ of their same-brand-same-batch drives in a single machine fail at once due to some SSD batch failure."

I want to clarify - there's the issue of a bad batch wherein their longevity is greatly reduced and they fail in a cluster, etc., etc.

But that is not what we are guarding against ...

Instead, the risk we're thinking about is that there is an actual bug in the firmware that causes a particular workload to brick the drive or destroy it or whatever.

The critical point is that if the drives are mirrored then they experience an identical workload over their lifespan and they could fail literally simultaneously.

So by all means - do indeed guard against bad batches or manufacturing defects by mixing drives. Just understand we're talking about something slightly different here ...

vluft · on Sept 29, 2021

yup, the Intel failure I mentioned below was a firmware issue, not any actual failure of the flash modules

lathiat · on Sept 29, 2021

I remember the same mid vendors and drive batches suggestion for HDD RAID Arrays. Ahhh.

vluft · on Sept 29, 2021

I can confirm that e.g. there has definitely been e.g. a batch of enterprise SSDs from Intel a couple years ago which failed en masse after a certain amount of powered-on time.

rwky · on Sept 29, 2021

From my limited experience I did have a pair of Intel SSDs in RAID 1 fail within 2 days of each other in the same way. Thankfully the first was replace and recovered before the second failed.

chinathrow · on Sept 29, 2021

I noticed that a borg list command executed way faster after e.g. the Zurich upgrade. Thanks!

rsync · on Sept 29, 2021

There's more to that story ... in fact, the metadata special device was not the magic bullet we hoped it would be.

The real magic bullet was changing the "freezing" process where we transform 'borg' the python script into 'borg' the binary executable.

You'll see a full writeup of this in the Q4 technical notes :)

chinathrow · on Sept 30, 2021

I just re-did my borg list commands again - it's now at the same speed as a couple of weeks ago. Strange.

wyager · on Sept 29, 2021

Great writeup.

I have a ZFS account - how do those work under the hood? Is it a VM backed by a ZFS volume? How does the overhead compare to a normal account? I suppose using a VM eliminates some advantages of the special device.

rsync · on Sept 29, 2021

If your account is enabled for zfs-send then it is a bhyve VM which has a zpool just for you (which is running on top of our zpool).

I will have to look into this - does your zpool benefit from the metadata cache ? Perhaps not since it is a different zpool ...

allanjude · on Sept 29, 2021

The metadata of the zvol on the host pool, does benefit from the metadata vdev (NOT A CACHE)

rsync · on Sept 29, 2021

OK, that is good to know - I will edit the post to reflect that (and also remove the poor use of the word 'cache').

eminence32 · on Sept 29, 2021

It sounds like you're still comfortable using HDD for the bulk of your storage, and add SSDs for fast caches.

Do you think you'll ever get to a point where you run zpools entirely of SSDs? If so, what criteria is important for you? (Raw price per gigabyte? Power usage? MTBF? Something else?)

rsync · on Sept 29, 2021

In the current landscape I do not foresee rsync.net using all-flash zpools.

All of our access is over WAN so disk IO is not that important - it is raw price per GB and even with the "nice" SAS drives we buy ... ~$400 for 16TB is a huge difference vs. ~$700 for 1.8 TB (which is, roughly, the price of the Intel part mentioned in the list of cache drives).

vluft · on Sept 29, 2021

To be clear on the 'special' vdev; for future writes, it serves as the sole metadata _store_ for the pool, not a cache, correct?

rsync · on Sept 29, 2021

Yes, that's correct. ALL of the metadata from that point forward gets placed on the cache.

That is, until you fill it up - which you could.

At that point new metadata goes onto the spinning disk vdevs, as it did prior to the cache. As files age in and out, space gets freed on the cache and some new metadata makes it back on there.

So if you fill it up, it behaves more like a cache.

ALSO, you can add multiple metadata caches to a pool ... so if you fill one up, you can add another ...

allanjude · on Sept 29, 2021

The metadata vdev is not a cache, it is the primary store.

Too many years of thinking of the aux vdev types like SLOG and L2ARC as caches, has made people confuse this one frequently.

gangelop · on Sept 29, 2021

I knew about filecoin and Arweave but I didn't know about Chia.

lyapustin · on Sept 29, 2021

TL;DR?

handrous · on Sept 29, 2021

- They added fast metadata storage devices to their zfs pools. Since failure of these metadata device groups mean the whole storage pool is lost, they've set them up to have a similar level of redundancy to the rest of their system (i.e. it's fine as long as 4 devices don't all fail in a very short span of time).

- Chia is a cryptocurrency that made drives ~double in price for a while, but that's mostly stopped, though they're still a little more expensive than they were before. Also, reading between the lines, Chia looks really, really scammy/pyramid-schemey, even by cryptocurrency standards.

- A 3rd party SQLite streaming replication project added SCP as a transport mechanism, so now it works with rsync.net

- Rsync.net supports lots of their-side checksumming tools and they're really like it if you'd stop using computationally expensive ones if all you're doing is verifying data integrity.