Hacker News new | past | comments | ask | show | jobs | submit login
I only lost 10 minutes of data, thanks to ZFS (mastodon.social)
436 points by chromakode on Aug 23, 2023 | hide | past | favorite | 280 comments




I had two drives in my mirrored zpool die within 8 minutes of one another.

Both HGST drives too. A very sad day.

Thankfully I had been regularly zfs sending my contents to another site and lost very little data.

ZFS is rad.


I also had two mirrored drives fail simultaneously in my zpool a few days ago. There was nothing on them so I wasn't worried. WD Reds in my case;

Using matched drives seems to be a very bad idea for mirrors. I'll probably replace them with two different brands.

I also have a matched pair of HGST SAS Helium drives in the same backplane so hopefully I can catch those before they fail too if they're going to go at once, I _do_ have data on those.


Alternatively, you can also buy your drives a few months apart to get different batches, most likely.


Yeah, that's an option; though since my pair failed at once, I need to buy at least two immediately to get the mirror back up.


Because of this, I always buy a different brand as well.


Any limitations, or am I good so long as they're the same label capacity? I assume I just lose a few MB or so from whichever is a little larger?


Is there a specific reason why the drives die at the same time? Electricity spike?


Buying two identical drives has high chances of them being from a single batch, which makes them physically almost identical. It’s a pretty well-known raid-related fact, but some people aren’t aware of it or don’t take it seriously.


Identical twins may both die of a heart attack, but not usually at the same time.

Normally, failures come from some amount of non-repeatability or randomness that the systems weren't robust to.

The drive industry is special (in a bad way) in that they can exactly reproduce their flaws, and most people's intuition isn't prepared for that.


If they're bought together, like mine were, and they have close serials, they've be almost identical; if you then run them in a ZFS mirror like I was, they'll receive identical "load" as well.

Since mine had ~43000 hours, they didn't fail prematurely, they just aged out, and since they appear to have been built pretty well, they both aged out at the same time. Annoying for a ZFS mirror, but indicates good quality control in my opinion.


If they're ~identical construction and being mirrored so that they have the same write/read pattern history, it could trigger the same failure mode simultaneously.


More likely to be from the same bad batch too. There was a post with very detailed comments about this just a few days ago.


Why bad? What's considered a good/bad lifetime for these? Mine had ~43000 power on hours, I don't know if that's good or bad for a WD Red (CMR) drive, but they weren't particularly heavily loaded, and their temps were good, so I'm fairly happy with how long they lasted (though longer would have been nice).


You're right it might be a natural end of life that coincides too.


> ZFS is rad. Typo: RAID


Redundant Array of Disks (cost not specified)


I always thought it was Independent Disks, though how one disk could be dependent on another is beyond me. Perhaps the I is redundant?



Looking it up I think you are right. I thought it was inexpensive.


Repeat after me: RAID is not a backup.

Sounds like it was the backup which saved the day here, not the raid array.


He didn't have a raid array on his laptop, that would have saved him as well, with not even a second worth of data lost.

And he could use his system without replacing any drive and he would have had a much quicker recovery and could have used the system during rebuild.

Now, raid on a laptop is mostly reserved to bigger units. And authors setup is awesome as well, but single drive failure is one of the most common issues (especially where you make use of snapshots) you need your backup and that is exactly what raid solves.

So try to do both :)


> I had two drives in my mirrored zpool die within 8 minutes of one another.

2 drives in a mirrored zpool, this is the equivalent of RAID1.


Missed that context, meant regarding the article.


zfs send was used, so the failed zfs still had a hand in the positive outcome


Only 1/4 of data lost! Meaning still recoverable!


I'd bet that was intentional, because https://www.urbandictionary.com/define.php?term=Rad


You deserve all my upvotes :)


That's extremely unlikely. Could it have been the controller instead? Which HGST drives?


I've frequently had drives in a RAID fail in rapid succession. If you buy a bunch of identical drives at the same time and put them in a RAID, then you can end up with:

* They were manufactured in the same batch, maybe even one right after another on the same line.

* As they were transported from manufacturer to OEM to you they were exposed the the same environmental conditions, right down to vibrations, humidity, and ambient EM environment.

* As you use them, they continue to be exposed to the same environmental conditions, including power supply fluctuations and power inductively coupled into places it doesn't belong.

* They see the same usage patterns. Depending on the RAID specifics, that might be right down to seeing the same disk locations seeing the same read and write volume.

Its then not surprising if they fail at about the same time.

The last machine I put together that I wanted to have high availability, I intentionally bought two different brand drives to put in the mirror to maximize the likelihood that they fail at very different times.

Many years ago (c. 2003) the group I was working in inherited a massive 6U storage server with an insane number of 10k SCSI (it was before SAS was a thing) drives. We named it "hurricane" for the sound it made. After a few weeks of using it, the first drive failed. It rebuilt to a hot spare and we ordered and eventually installed a replacement. A few weeks later, another drive failed, and this time before it could finish rebuilding, two more drives in the RAID failed and its contents lost (but we had a good backup). We never used it again. For a while I used it as a coffee table, but then someone convinced me that was too tacky, and it got ewasted.


> Its then not surprising if they fail at about the same time.

It is, but in a different way. It is a testament to the depth and precision of manufacturing process control, that two insanely complex machines will behave nearly identically for years, up to the point of failing at about the same time, if they've been made in the same batch, and exposed to about the same environment and usage patterns over those years. You'd expect any number of random factors to cause one drive fail way before the other, but no - not only there is very little variation between drives in a batch, tiny variations in usage are damped down instead of amplified.

It truly is amazing.


> If you buy a bunch of identical drives at the same time and put them in a RAID

When setting up a new machine with zfs I intentionally buy drives from as many different brands and models as possible to spread the manufacturing defect risk.


Not extremely unlikely if they were identical drives from the same manufacturing batch. It's good practise to use diverse manufacturers or at least batches when adding disks to a raid array for just this reason.


It's not unreasonable to believe that if you pick two identical products off the same shelf at the same time (as one would logically do when purchasing 2 of a single item), that the two products were manufactured at similar times and in similar conditions.

Your model isn't exactly bad, but there is an assumption being made that you haven't accounted for. Which to be fair, is frequently not stated. The assumption is that the drives defects are independent of one another. This is a poor assumption when manufactured back to back.


https://news.ycombinator.com/item?id=32026606 Hacker news went down a while back because of the 40k hour bug. Both the primary and backup servers were placed into service at the same time with ssd's that had an overflow after ~40k hours.


The drives themselves were toast. My hypothesis was a short in the raid controller or something leading to an over current in the drives.

I wasn’t using them in a RAID configuration, but they were attached to a raid controller.


Or could be that the tolerances and environmental factors were so tighly matched between the drives.


Stop, you're scarring me! I have mirrored drives in a zpool. If a pair dies I lose 18 TB. My most important stuff is cloud-replicated but still...


My pairs of mirrored drives come from 2 different manufacturers to prevent a common fault that happens at once to both drives.


Ah, don't be scared. You're at least starting to think on your data and replicating your most important stuff elsewhere. I'd still recommend a non-cloud copy also, but you're probably okay :)

Take it as a good impetus to catalog your data and find those extra replication options. Data protection does cost you a bit to do it "fully", but since I've worked on a backup solution before in a client facing way, trust me when I tell you that I've seen rather large businesses (a few you might even know as a household name) who have less consideration for their data than you've expressed in your 3 sentences :)

So just figure out which of your data _truly_ needs to survive at all costs, get a solid setup with personally owned storage for the backups in combination with cloud storage, and you're probably fine.


The thing that has been worrying me lately with tens of TB of files I need to look after is how do I know the files haven't silently got corrupted somehow? I feel like I need to periodically re-checksum everything and keep hashes in a database somewhere off the side.


I think rsync has a --checksum flag and also supports incremental backups, so probably based on your situation you can understand a logic that represents your access patterns. That is, if you know for sure you personally will not touch the files after 20:00 and no one else will, it's _fairly_ reasonable to assume that if you start rsync after you're done working, then schedule an incremental pass a bit before you start working, anything that was "copied" to your new backup has likely changed in an unintended way, and the list should be pretty small for you to check each morning, if there's anything. Keep in mind stuff like OS deduplication may mess with this (Windows' Dedup would undoubtedly break this schema as it may show that the files are modified after the Optimization job runs)

Alternatively, consider just using a file system that does periodic integrity checks. I know ReFS has integrity streams, and I am pretty sure XFS has something almost exactly the same or better. It won't prevent corruption, but it will give you something you can monitor for when the filesystem reports an issue.

Some combination like this should work.

Similarly, you might be able to come up with a fast trick using stat; with some quick testing on a dummy file in MacOS' ZFS shell, you can do something like:

stat -f %m somedir/*

and compare the resulting value by passing it to sum or something. I am not super familiar with stat in general, so likely I am missing elements that make this unreliable, but I'd consider looking into it further unless someone tells me it's 100% the wrong direction and explains why.


Note that rsync generally runs far, far more slowly with the --checksum flag. And I can't recall it ever saying something like "change(s) in $FileName were only noticed by checksum", so I would have been alerted to quiet disk corruption.

ZFS has checksums, and the 'zpool scrub' command tells it to verify those (on all copies of your data, if you're using RAID).


You should probably be using RAID Z2 in that case which supports simultaneous 2 drive failure without a problem


I actually had this happen, and RAID Z2 saved me from a very long recovery process.

I thought it might be the controller but a year on I've had no further issues. Sometimes drives do just go like lightbulbs.


3 2 1 backups, ddg it ;)


i hope me mixing the models and buy times (i have been using the raidz expansion branch) will keep that raidz alive.

it has become non economical / practical for me to backup everything


Tangentially related, on topic with the Google search issues that The Verge complains about at https://www.theverge.com/22291828/sandisk-extreme-pro-portab...

> Google assumes you’re looking for product pages when you search for things like “4TB SanDisk SSD,” so news stories like ours and Ars Technica’s appear far down search results.

Well, I've been test-driving the paid Kagi search engine, and this was an excellent opportunity to see if a different class of web search could produce different results...

Sadly I'm afraid when all the web is flooded by praising articles, a different set of prioritization rules would still struggle to show different results:

* First 5-6 results are from shops.

* Then come some reviews, from: easeus [1], techpowerup [2], anandtech [3], consumerreviews [4]. None of them contain the word "fail".

* Lastly, and this is the only actual improvement from google, there are some relevant search suggestions such as "sandisk 4tb ssd failure" and "sandisk 4tb ssd problems". Difficult to see (they are at the bottom of the page), but at least better than Google (where the word "fail" doesn't appear at all in the first results page).

[1]: https://www.easeus.com/knowledge-center/sandisk-4tb-extreme-...

[2]: https://www.techpowerup.com/review/sandisk-ultra-3d-4-tb-ssd...

[3]: https://www.anandtech.com/show/16892/sandisk-extreme-pro-cru...

[4]: https://consumerreviews.store/sandisk-4tb-extreme-portable-s...


Just before I exited the Linux world entirely, I was beginning to chip away at the iceberg known as btrfs, and it was fascinating. I saw so much promise in many of its features, for revolutionizing backups and organizing my disks and everything.

Now btrfs isn't ZFS, but it has some feature parity and perhaps the "poor man's ZFS". It's also much more reasonable to run on certain OS, due to the licensing, packaging, and in-kernel status of ZFS being kind of weird.

One memorable time I was encouraged to use ZFS was when I mentioned to the Linux User's Group that I'd had to pull the power cord to reboot my computer, and I was roundly scorned for this foolish maneuver. But you may change your mind about the wisdom of doing either one when you consider that the system in question was a Raspberry Pi. Heh.


The pi supposedly can get FS corruption, but I've never seen it because every time I install an image, I run a script that puts tmpfses everywhere and turns off mostly useless logging. Those things just run for ever, very reliably, as long as you don't hammer the SD card.

ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

I always stayed away from BTRFS, because every few months I'd see a "BTRFS destroyed my data" post, followed by an argument about if it was BTRFSes fault. I see them less now, perhaps it's time to revisit?


I had multiple pi systems “bricked” after a power outage. Presumably it was file system corruption? I never looked into the issues further. I just wiped the drives and reinstalled. These were vanilla raspbian installs at the time. It happened a few times when I was first trying out my pi. These were a mix of me cutting the power and actual power outages.

I only ever had these issues with SD cards. I quickly switched to running my pi off a USB external SSD, and haven’t had any problems since then. Now when the power goes out, it boots back up properly and all my services start. All of this on ext3 I think?

Planning to redo things for ZFS at some point, but haven’t gotten around to it yet.


This happened a lot to my Pi 3 back when I used the internal SD slot. I set it to boot off USB and used an old 8gb USB2 flash drive as the boot drive and never had the problem again.


Yeah, sd cards that you haven't put through extensive crash testing simply can't be trusted. I used to have a jar of sd cards that didn't survive testing.


The most noticeable raspberrypi SD card life lengthener for me has been to write logs to RAM (assuming you have a stable setup and don't count on them to survive a reboot!).

Our $job dashboards used to nuke an SD card every couple weeks/months, but since the move to logs-in-RAM we've been running the same SDs for years.

DIY via {fs,journalctl} config , or using https://github.com/azlux/log2ram

Also, mount the SD with the `noatime` flag of course: https://wiki.archlinux.org/title/Ext4#Disabling_access_time_...


Noatime, disable swap, logs in RAM, /tmp in RAM, .xsession errors in RAM(With logrotate or it will fill up! So many random problems can make a problem there), chromium profile folder in RAM if you do kiosk work(Browser makers must hate flash!).

From what I hear Home Assistant is still not the easiest if you want to run for years on a card, not sure if that's fixed now, but it's one of the big factors blocking me from moving to HA.


There was a link a couple weeks ago to industrial sd cards on digikey. They're pricey, but they will work for years without disabling anything on these sorts of boards.

https://www.digikey.ca/en/products/filter/memory-cards/501?s... is the brand I used to use, iirc


Any script/guide to do all this on a Raspberry Pi in one go? I'm very interested for some Pi Zero 2: every couple of months the microSD content gets corrupted and I have to rebuild the system + restore backups.


The section starting at 827 is what I'm currently using, but it just covers the basics. Somewhere in there is also the kiosk launch script to run Chromium without breaking the card. If you use apache you'll need to add a service to create the logfile though.

https://github.com/EternityForest/KaithemAutomation/blob/dev...


A guide I wrote for $job that covers what I mention in the parent-parent post, and a couple other things: https://unito.io/blog/better-raspberry-pi-dashboards/

Note: only a guide, not scripted yet. See footnote for why :)


Ooooh, excellent additions, thanks :)


If you automate a fsck at every boot you can make these kind of issues go away on the pi


That won't help if your SD card got corrupted because of too many writes.


Vjjnba


Sorry about that. Absolutely no idea how it got typed and posted. Had the page open this morning but I don't see how either pocket typing or feline interference could have done it…


You made the first page of Google with a brand new word!


My kid would leave these for me.


Zfs doesn't need hugely resourced hardware. For example, it can run on a single disk and still get you cheap snapshots/rollback, on-disk checksums, potential for easy replication via send/recv, highly efficient caching via the ARC, and transparent compression amongst other things. You can also start with a single disk and then easily add a mirror later on.

In terms of CPU and RAM, most NASs now are perfectly capable of running it (but do also consider speccing out a small form factor PC with a case with lots of drive bays vs prebuilt NAS - you might be surprised at what you can get for a similar cost).


> I'll just be doing RAID and ext4

Why do you need/want RAID ?

RAID is for availability, but is your data really that important that you cannot wait for a restore ? Most people would be much better off using that 1..N parity drives as versioned backups instead of running RAID.

I've run NAS boxes for years, but these days i'm only using single drives.

My setup these days consists of laptops that synchronizes data (encrypted) to the cloud, and a small ARM machine that synchronizes cloud contents locally, and makes a versioned backup to a single drive as well as a versioned cloud backup.

As for cost, it's cheaper (for me) to store my data in the cloud, than the cost of electricity required to run my NAS.


> As for cost, it's cheaper (for me) to store my data in the cloud, than the cost of electricity required to run my NAS.

how much data are we talking about?


I have around 10TB data stored in the cloud, including backups. Total cloud bill is around €23/month (recent price increase included).

There is a turning point somewhere after 15TB where the cloud becomes a lot more expensive than local storage, but at 10TB it’s hard to do in a reliable manner locally.

And it doesn’t really matter how you do it. If you’re the DIY type, and buy a second hand server, or a small machine, or just run out and buy the latest Synology box, you’re still paying more for storing <15TB at home in RAID. One will be more expensive in electricity, the other in purchase price.


I was going to do RAID for disk failure tolerance, either with a consumer NAS box, or a disk enclosure that does hardware RAID, and something very low power like a Zero 2 W.

Isn't RAID parity slightly more space efficient than versioned backups? Or is there a better way to do redundancy that doesn't involve just replicating entire files to multiple disks? Or some kind of automated manager that puts each individual file on N different disks out of M?

I mostly do embedded so reliable data storage isn't generally something I deal with, we usually leave that to the cloud or to the user, and I'm not quite familiar with what's out there.


>Isn't RAID parity slightly more space efficient than versioned backups?

It depends on your storage array. The more drives, the more space efficient RAID becomes, but RAID is still only a single copy of your data.

>Or is there a better way to do redundancy that doesn't involve just replicating entire files to multiple disks

Most of the industry is using erasure coding these days (https://blog.min.io/erasure-coding/) which allows for spreading your parity and data across multiple sites. Erasure coding usually runs a layer above the filesystem, as opposed to RAID which typically runs below the filesystem (Snapraid, Mergerfs and others excluded).

My personal "backup vault" is a Raspberry Pi 4 with a single 4TB external drive attached. The RPi runs Minio, and all backups are done through the S3 interface or SFTP/SMB. It is not the fastest box in the world, but it backs up (incremental) ~2TB in 30 minutes, which is "fast enough".

It consumes on average 4W, which means even with worst case electricity prices of €1/kWh (which we saw last winter), it costs less than €3/month.

For comparison, my NAS consumed around 50W, and at €1/kWh, that would cost €37/month in electricity alone, and then you need to add the cost of the actual hardware itself.

I switched off the NAS, and purchased ~10TB of cloud storage (main storage and backup storage at two different locations) for €20/month, and keep sensitive stuff encrypted with Cryptomator.


but RAID is still only a single copy of your data

Btrfs has 3-copy and 4-copy RAID1 (profiles raid1c3 and raid1c4), doesn't ZFS have something similar?


ZFS has RAIDZ1, RAIDZ2, RAIDz3 and Ditto blocks, which do much the same thing, although a bit differently.

My point was that even if you have 4 copies of your data, you still only have a single machine where your data is stored, and you're essentially just one flood/lightning strike/house fire/burglary away from all of it being gone. Or one bad power supply away from 4 dead drives.

With versioned backups, you have higher latency on restoring data in case a disk dies, but your data is also safer.

As i initially stated, RAID is for availability. It is great for making sure that data is available 24/7, but that is rarely what the average home user needs. Most home users access their data infrequently, and would be perfectly fine waiting a couple of hours while restoring data from a backup.


I had a FreeBSD box with data on a gmirror of two disks for a decade. Twice in that period I had one of the disks die. Each time I would buy a new disk, add it into the array, and go on with next to no effort. As for cost, I would have had the box anyway, so it wasn't like i was paying for an extra system, just an extra disk. Cloud backup would have cost me several orders of magnitude more, this was a media library of hundreds of gigabytes, eventually low terabytes. Plus all the complexity of working backup/restore procedures.


And how much power does that box consume ?

The average price of power during a year here is around €0.45/year, so a box drawing 50W (not unlikely with a decade old processor and two disks) will use 438 kWh during a year, meaning it would cost €197/year to keep it powered.

Add to that the ~€150 x 2 for new drives, and you end up with €227/year, or €19/month.

But of course, if you, as you stated, had the box running anyway, the math works out differently as you're essentially splitting the cost with whatever purpose the box already fulfills.


The system started its existence on a VIA EPIA board with a sata controller pci card, later upgraded to an Intel Atom-based Supermicro board. The Atom was 20W TDP iirc, the VIA less than that. The system was mostly idle.


> ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

FreeBSD won't break your wallet.


I assumed they are talking about the commonly spread myth that you need a gig of RAM for every terabyte of storage. I think that's recommended when doing deduplication, but for a simple NAS ZFS would use a comparable amount of memory as ext4 on RAID.


Oh yeah. That was always a sketchy recomendation, but it sure did make its way around. I think deduplication is remarkably seductive, but doesn't seem worth the cost for almost anyone, given how it's implemented. IIRC, btrfs has a dedup option where you can link up duplicates later, and then you don't have to hold a dedupe table of everything all the time, and don't need to collect writes to check the dedupe table, etc. But rewriting data isn't how zfs rolls, and I get that.


Or the myth ZFS without ECC is more dangerous than anything else without ECC.


Not sure if a myth, iirc it was literally in the ZFS manual last time I looked into ZFS (which to be fair was 10+ years ago).


The documentation author refuted it 9 years ago.[1] Probably your understanding or memory was incorrect.

[1] https://news.ycombinator.com/item?id=8438239


https://openzfs.org/wiki/System_Administration#Data_Integrit...

>Misinformation has been circulated on the FreeNAS forums that ZFS data integrity features are somehow worse than those of other filesystems when ECC RAM is not used. That has been thoroughly debunked. All software needs ECC RAM for reliable operation and ZFS is no different from any other filesystem in that regard


Is that really the case? Can you point to some references backing that up? (Hehe, unintentional pun)


ZFS will happily use a large amount of RAM for caching if you have it, but it'll run fine on a recent Pi (3 or 4, or not raspberry at all).

It'll run fine but more sadly on older Pis running 32-bit kernels, since it does a looooooooot of 64-bit and wider operations, so you pay a nasty tax on that on 32-bit things. (Though the virtual address space limits might actually be sadder than the 64-bit operation penalty there, really...)


That's surprising. I may try on my Pi. I have 40TB and am very happy with SnapRAID, but ZFS always seemed liked the “correct but expensive” solution.


I'm running ZFS on the smallest AWS instance, running FreeBSD, and it does what i want it to do.


TrueNAS' general requirements are 8GB, and they spell out when you want more. Most of the situations you'd want more you'd also want more on RAID. https://www.truenas.com/docs/core/gettingstarted/corehardwar...


The only filesystem I've ever completely lost to corruption was btrfs, and that was about a year ago. btrfs-restore completely failed, so if I really needed that data I guess I'd have to do some manual surgery. I got to the point in the documentation where the only recourse was "idk go ask someone on IRC".

Of course if you have good backups, you can use whatever and not really worry too much about it.


I've used it extensively for many years, both professionally and personally. Historically it's been something that users needed to be paying attention to the mailing list and wikis.

For the most part though, sticking to standalone or mirrored disks is pretty rock solid and has been for a long time. Ditto for subvolumes, snapshots and, send/receive. My laptop has been snapshotted and sent from one piece of hardware to the next for many years now.

That said, I'm with you on the backups. Anyone who uses btrfs and doesn't have a rock solid backups is a mad man.


Well, I guess that resets my "Has it been long enough since a BTRFS horror story that I can trust it" counter.


I've had the same experience the two times I've tried btrfs, about two to four years ago, in my case with Linux VMs that would occasionally be abruptly terminated. In both cases, the corruption that couldn't be automatically repaired happened within the first few of these sudden terminations.

Ext4 seems to handle that scenario better. I can't think of a single instance of filesystem corruption that fsck couldn't fix, and some of those VMs have probably been abruptly terminated at least a hundred times over the years.


I had some ram fail in my laptop and that killed my btrfs filesystem. Though btrfs restore was able to recover almost everything eventually (which is good because my last backup was a few weeks before since I had been traveling). Decided to go back to boring old ext4.


> ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

I ran ZFS on a Raspberry Pi 4 with 8GB of RAM just fine (under debian arm64), and I've used ZFS on a machine with 4GB of RAM for receiving snapshots.


Ubuntu ships with ZFS support, as well as TrueNAS.

I personally run my nas with Ubuntu and ZFS and love it.


Apparently, recent releases of Ubuntu have dropped ZFS filesystem support after the person driving that effort left. :/


Reading up on it, recent releases of Ubuntu have dropped using ZFS for your boot and root volumes, which isn't ideal, however, they still support ZFS for any other volumes, which I'd venture is it's primary usage anyway, and they don't plan on removing support for anything other then zsys/zfs root/zfs boot.

But thanks for bringing this to my attention. I had missed the changes in 23.04.


There's been a recent series of PRs which appear to be adding ZFS root support to subiquity, the new Ubuntu installer:

https://github.com/canonical/subiquity/pull/1689


Cool, hopefully that means it's being kept as an option after all. :)


Ditto!


I am using USB disks with ZFS connected to a standard cheap office PC as my storage server. It still provides plenty of benefits over ext4.


is that a script that is maintained and up to date and available on the web? Just curious. I've done this stuff manually and have a check list, but I never wrote a script because I didn't trust myself enough just barely knowing a little bit of bash myself. Another survival technique is to get a very large sd card like 64GB because wear leveling should increase the life. I have one rpi that's been going nonstop for 4 years now and the sdcard appears to be fine. Not sure why rpi guys don't have a "choose long term reliability over kitchen sink" option for setup (or even an image!)


It's in the kaithem-kioskify up a little earlier in this thread.

Eventually I'll probably move it to a standalone thing since there seems to be a whole lot of interest in the SD protection feature, but it's missing a few of hacks for programs I don't really use anymore, like the apache logfile thing, just because I got tired of maintaining stuff that didn't have much interest.


Are you willing to share your script? I'd like to compare it to mine to see if I've missed anything. Thanks!


The version I'm using now is all tangled up with an installer and setup script(When I'm doing interactive installations I tent to try to reuse the same setup for everything, but there's a big ASCII art banner for most of the relevant stuff for the SD card.

Note that this doesn't have the Apache logfile hack so Apache probably won't run if you try this and don't add something to make it's fussy logfile.

https://github.com/EternityForest/KaithemAutomation/blob/dev...


What's the point of using ext4 on a NAS, and not XFS?


XFS is not the default on most systems and I hardly ever hear about it in general, so I really never paid much attention to it in.

Seems like people say it is more CPU heavy than EXT4, so unless it's way more reliable, would it really be the best choice on a pi/router/subGHz commercial NAS chip?


We used to test some installers on XFS servers for Redhat stuff. XFS need far fewer "corrupt drive" fixes that than ext4fs on those servers. (probably 1/10th as much corruption) and we were constantly just doing straight power offs on them (no soft shutdowns). I think XFS doesn't get the respect it deserves if you don't need a "fancy" file system like zfs or btrfs.


Huh, that does sound pretty appealing for sure.


I always disable rsyslog too.

Tell me about tmpfs: where do you use it?


Alright, I'll bite... What led you to leaving the Linux world entirely? What can "the community" learn from your experience to make it better for others?


Linux is a great fantastic experience, and I have no qualms or ill will about it. I simply had no use for it anymore, and I needed to simplify. I've said before, I'm not a sysadmin anymore, I don't tinker with systems, I need stuff to be operational and in production.

I still love Linux and I'd use it for any given server or Raspi if that were part of my job. I do use it daily in my job, but to a very minimal extent.


Did you switch to a mac or to a windows machine?


They said "Operational and in production" :)


I honestly don't know which one you implied.


BSD


So Ubuntu Server LTS, got it ;)


btrfs only thing that i do not like at all is the fragmentation that it is prone to having. especially with sparse VMs images.

zvols are so much better for that


A few years ago we had capacitor plague. Are we living now the storage plague? It's getting ridiculous that all storage is getting worse and worse. WD is making HDDs crappy with SMR, manufacturers says that 3 years operating time is already too much for SSDs and HDDs, and they don't joke. I just had a Kingston SSD (okay, that was like 8 years old) and a portable WD HDD (~2 years old) die just this year.

The internet is full of problems lately about data loss and longevity issues.

I remember 20 years ago HDDs were not meant for eternity either, but they definitely outlived the usefulness of the computer that they were bought with...


I've had a lot more luck with hard disks these days than I did 20 years ago. Remember 20 years ago was the era of the infamous IBM Deathstar drives where the magnetic coating would literally start sprinkling off the platters. Also the era of terrible terrible Maxtor drives that died in 1-2 years, which Seagate then bought and made their drives also unstable for a while. I ran a server with around 8 drives and had to keep replacing disks at the rate of about one per year.

Meanwhile today I'm helping admin a ZFS server with 20+ drives and drives have about a 4-5 year lifespan.

> but they definitely outlived the usefulness of the computer that they were bought with

Computers were also much more quickly obsoleted back then. When today a 6 year old computer is totally useable, back then you really felt it if your machine was just 3 years old.


> Remember 20 years ago was the era of the infamous IBM Deathstar drives where the magnetic coating would literally start sprinkling off the platters.

I have even older memories of problematic IBM drives. During the early 90s the shop I briefly worked with, found a supplier for IBM SCSI drives at a very convenient price, so they ordered a good lot of them. They worked great on PCs, but some of us also had Amiga machines and of course would love to benefit from the offer. So we tried one, but it didn't work; then another, and another; nothing, they were normal SCSI drives but refused to work on any Amiga with a SCSI controller, although any other drive would work in there. In the end we abandoned all hopes and took the drives for a reformat to be used on PCs, but... they were all dead. Completely, not even detectable by any controller; the mere connection to an Amiga SCSI controller destroyed them instantly. We never discovered where the problem was; those drives worked perfectly on all PCs, while we could install any other drive on every Amiga and expect it to work, but no way to put those in an Amiga and expect it to survive. Good old times indeed:)


Seagate had 1.5TB drives I think about 15 years ago with really high failure rates. Somewhere I saw close to 33%. Anecdotally, mine failed after about a year and the refurb warranty replacement also failed after about a year.

I think it was about 10 years ago some Seagate drives had higher than industry failure rates. Iirc one of their factories was producing drives that failed much more frequently than others (there might have also been something to do with the platter counts/model)


4-5 years? My old backup server has almost been running for 15 years now, 6x 500GB hard drives, one is even running on PATA. 8GB ECC ram, Athlon II. I could save some money replacing those 6 drives with 2x 14TB hard drives today. But as long it is working fine I ain't gonna do something before I run out of hard drive space.


I remember those IBM Deathstar drives...

Was at the Aussie Tribes 2 launch LAN and there was a guy who had one die on him.

At that time in the LAN scene there would always be someone who had a Deskstar die ... You could hear the clicking over the noise of the LAN.

I realised back then, I can only trust Seagate.


The memory of the Deathstar drives that stands out most in my mind was a coworker managing to destroy hardware with SQL. We were at a really ... frugal ... interactive advertising firm and our dev server had been slapped together with a RAID-1 array of cheap IBM drives. One day said coworker was testing conversion of a large database table in MySQL from MyISAM to InnoDB format (to see how long it'd take, what query perf afterwards was like, etc.) and all of a sudden the server went hard down. We went over to the server closet and discovered that the IO had been enough for both drives to grenade themselves at the same time. Good times. I'm just glad we had semi-decent backups and it wasn't a production machine.


Hah, I had a Deathstar die on me back in the early 00s too. Surprisingly, about a decade later I hammered it with ddrescue and was able to get almost all the data off it!


Don't worry, your chips will start glitching after a few soon too.

We're hitting scaling limits. Exponential growth is slowing.


that does not make any sense if you are talking about the controllers.



> I remember 20 years ago HDDs were not meant for eternity either, but they definitely outlived the usefulness of the computer that they were bought with...

Anecdotally I remember HDDs failing sometimes for me and my friends/relatives back in the day. Now it barely happens for SSDs. Hell, even supposedly problematic old Intel from late 2000s still works fine in the same old MacBook I gave to my mother after using it for years.

I wonder what it the actual data regarding this.


All WD drives I bought in the last decade work as expected (0), including the NAS ones bought after the introduction of that SMR thing; I just made sure they are either Plus or Pro, not the plain Red ones, which are SMR-plagued. I was also lucky with SSDs, but especially on desktops I use small ones as I still prefer to keep /home dirs and RAID arrays on old rusty drives.

0: A couple exceptions: Two WD Red (before SMR) which I took out from my old NAS to put bigger drives in place, and put in a drawer while they were still perfectly healthy. After like 2.5 years in their anti static bags and normal conditions, no excessive heat, no moisture, no magnetic fields etc, I took them out because I needed a spare disk and checked them: both were not working, one completely dead and the other barely recognizable but unreadable. The first didn't even show up once connected; I tried to clean all contacts, including the pins on the controller pcb to no avail, and eventually had to ditch it; the 2nd one could be reused only after a full reformat; no way to recover old data, not even using testdisk. I never experienced nor expected anything like that, and frankly it worries me quite a lot.


~20 years ago was the HGST "deathstar," another drive so bad there was a class action lawsuit filed. Disks have always randomly died, that's part of the reason Sun made zfs.


Very quick summary: The mastodon thread refers to https://zrepl.github.io "zrepl is a one-stop, integrated solution for ZFS replication."


Does anyone know how zrepl compares to sanoid/syncoid other than that zrepl is written in Go and sanoid/syncoid are Perl scripts?


I use sanoid to do basically the same thing as this, and was interested in giving it a shot to see if it was more hands off but it's definitely a more complex setup to begin with, given you have to setup your own SSL certs etc, not sure why they wouldn't just use SSH transport for this like everything else.


I use Wireguard to secure and authenticate the transport. Much easier to set up! SSH is also an option.


Thanks. Good to know that's possible, it's exactly what I use for sanoid also, so I guess the quickstart just assumes that layer isn't available.


Looks like it needs to speak to a daemon running on the storage server?

Would be cool if it could just use e.g. S3 for storage.


Zrepl is a big part of why I feel secure doing the digital nomad thing. A script, run nightlyish, opens a separate-headered LUKS-protected ZFS pool and then copies all snapshots over. That NVME enclosure lives in my "purse" that never leaves me sight/body.

Between this and NixOS, I can provision a new identical laptop in about 10 minutes.

I recently added off-site replication as well, so even if I get completely devastatingly mugged, there's still about zero chance of serious data loss.

Zrepl is absolutely brilliant software. Easy to run with, but incredibly sophisticated and powerful if you need all the knobs. I can't praise it enough.


Kudos. You're probably safer than most people who are only one theft, fire, flood, or other disast


... er away from major data loss.


Do you have any experience with sanoid/syncoid? What does Zrepl give you over them?


You should do a write up about how this works. It sounds very interesting.


It's on my list, but... to be honest if you Google "separate header Luks", you'll find it's trivial to create a LUKS device with a detached header. Then the default ZRepl quick start will get you going with the basic pool-to-pool local replication. That will get you almost all the way there. :) I used their docs/guide to do the remote replication too, though it would make a good write-up as I could throw in how I use sops-nix for securing the Zrepl TLS bits for the remote scenario too...


This does sound really cool- and like a way to ultimately set up a secure, not that complex backup method...


Could snapshotting the filesystem every 10 minutes have contributed to its death?


ZFS is very special, and it is cheap to make snapshots with ZFS, because ZFS uses copy-on-write.

Intuitively I would think that the amount of extra writes is pretty low, even if you snapshot very frequently.

But scientific measurements would be nice.

I used to do snapshots every minute, every hour and every day with ZFS on some servers I administered. I’d purge the minute snapshots after 60 minutes. And I had cron jobs on other machines to backup the hourly and daily snapshots. I had it set up so that hourly snapshots were kept for something like 72 hours. And the daily snapshots were kept forever.

The idea with the every minute snapshots being that they were for undoing manually made mistakes during SQL migrations etc.

It worked well for me.

I still use ZFS on my FreeBSD servers. But at the moment my projects are low traffic and the data only changes in important ways some rare times. So with my current personal servers I manually snapshot about once a week and manually trigger a backup of that from another server.

Another thing I’ve changed is that now I only snapshot the parts of the file system where I store PostgreSQL databases and other application data. I no longer care so much about snapshotting the operating system data and such. If I have a serious hardware malfunction I will do a fresh install of the OS, and I have a log of what important config values are used and so on, that my backup scripts copy when I run them, without copying all of the other things.


This sounds overkill even for production data, much less personal data, particularly the every minute and the fact you keep dailies forever.

Unless you're a custodian of some secret society's files!


> sounds overkill

But it wasn’t. It was very useful in fact.


Copy-on-write and cheap snapshots was quite special when ZFS was created. It’s hardly special in 2023, when every single non-vintage iPhone, iPad and Mac has that.


ZFS is still very special. Compared to the limited file system capabilities of the machines in the world using FAT32, NTFS, ext2, ext3, ext4, etc.


Wouldn’t a filesystem backup of a running SQL database be corrupted when you try to restore it?


Whenever the filesystem is being snapshotted atomically, no. That would be effectively same state as from a power-cut at that exact moment.

It is correct though, that trying to do it with something like 'cp' on a live database will most likely be corrupt.


Not if the filesystem itself can do atomic snapshots. The problem can happen if you try to copy the files or even the device when it's being written to. But if you create a snapshot and copy that, it would be consistent.

You can of course the up in a dirty state where some new transaction was started but not committed, but any non-toy database should be able to recover from that. (You'll lose the transaction of course)


it’s equivalent to backing up a server after a hard power off.

as long as the filesystem supports some kind of journaling (aka it’s not ancient) and the database is acid compliant, there shouldn’t be any major issues beyond a slow startup.

but keep in mind that you may lose acid compliance by fiddling with the disk flushing configuration, which is a common trick to raise write speed on write choked databases. if that’s the case you may lose some transactions

your database documentation should have this information.


Mainly if you restore a previous version of the on-disk data while the DBMS is still running I would think. Because then the idea that the running DBMS has of the data no longer matches what’s on disk.

But if you stop the running DBMS before you restore the previous version on disk, and then start the DBMS again it should be able to continue from there.

After all that’s one of the key selling points of a bonafide DBMS like PostgreSQL, that it’s supposedly very good at ensuring that the data on disk is always consistent, so that when your host computer suddenly stops running at any point in time (power outage, kernel crash, etc) the data on disk is never corrupted.

If data is corrupted by restoring from a random point in time in the past, that should be considered a serious bug in PostgreSQL.


It depends on the DBMS, its configuration, the host OS, its configuration, and many other details. Snapshotting running databases is possible and often "just works," but you should always verify that before relying on this functionality in production.


> you should always verify that before relying on this functionality in production

Yep :)

At said company where I was using ZFS snapshots for the servers I administered, I additionally had a nightly cron job to dump the db using the tools that shipped with the DBMS. Just in case something with the ZFS snapshots fricked itself :)


Smart.

I did a daily dump of my prod database to my local workstation, needed it for prod corruption issue because ops was only doing weeklies.


Not likely.

The snapshot doesn't write much, and both SSDs and ZFS are copy on write. Which means the cost of writing after a snapshot is the same as before the snapshot.

On the other hand context is missing. Both SSDs and ZFS don't like being full or even close to full. The working set was ~650GB, of the drive was 1TB, then those snapshots could have easily made the drive over 90% full. This could have made ZFS unhappy all by itself.


I agree that it was unlikely. The total size of all data and snapshots was 625 GiB on a 2 TB drive (which had seen less than 2 years of moderate use). It was a pretty unexpected failure.


> cost of writing after a snapshot is the same as before the snapshot

I didn't understand this, could you please clarify?

If there was no snapshot, there would be only one write operation, the actual write. However, with snapshot in place, in addition to actual write, there is a copy operation which copies the original data and writes to snapshot location. So, there should be two write operations (actual + copy).


ZFS is never overwriting in place in either case, you're just not freeing the old one if it's in a snapshot, and a snapshot is just a note that "nickname this point in time 'mysnapshot', and don't clean up anything referenced at this point in time", so it's very cheap to make, and you just check it later when you would be cleaning things up.


Does it actually not update in place even for areas with a single reference? I haven't checked the source, but that sounds like fragmentation hell on spinning disks. That would absolutely kill the performance on zfs-hosted VM images / databases, which I didn't think actually happens... (Apart from the intent log, which sure, that's append only)


I promise you, it does not.

ZFS really deeply assumes that, when a region is in use, it will not change until it's no longer in use anywhere, and it also won't reuse things you just freed for a certain number of txgs afterward to let you get away with having to roll back a couple txgs in case of dire problems without excitement. (Since having enough writes will cause more txgs to happen faster, this isn't an issue people run into with being unable to use newly free space in practice.)

Also in practice, defining what "sequential" means with multiple disks in nontrivial topologies becomes...exciting anyway, and for writes, you only care that things are relatively, not absolutely, sequential for spinning media, and on reads, prefetch is going to notice you doing heavily sequential IO and queue things up anyway. (IMO)

If you like, you could go check on your configurations, what the DVAs for the different data blocks in your VM images are - something like zdb -dbdbdbdbdbdb [dataset] [object id, which you can get from the "inode number" of the file, or if it's a zvol, I think it's always just 1 that all the data you think of as the "disk" goes in...]

You'll almost certainly find that the regions that changed more than a couple txgs apart (the "birth=" value is the logical/physical txg the record was created) are mostly not remotely sequential.

(Nit - the two exceptions that come to mind are, the uberblocks are basically a fixed position on disk relative to the disk's size, and a fixed size, and you get [fixed size]/[minimum allocation size] of them in a ring buffer, basically, before you overwrite the oldest one, and that happens by just overwriting it, since it's technically not in use any more, someone just might want to roll back to it in a "This Should Never Happen(tm)" case...or the newly added feature of corrective send/recv, to let you feed ZFS a send stream of an "intact" copy of something that had an uncorrectable data error and have it scribble over the mangled copy with the fixed one in-place, assuming it passes the checksums.)


So looking at various benchmarks, reports and tuning guides, it does look like the spinning disks performance really suffers from zfs fragmentation. I haven't seen those before, but also haven't dealt with databases on zfs either. Something to keep in mind I guess.

Edit: after reviewing a few benchmarks, the outcome seems to be - even on SSD, make sure you actually want the zfs features, because ext4 will be a lot faster.


Yeah, it's a tradeoff. Zfs gives you easy data integrity verification (and recovery if you have redundancy), easy snapshotting, easy send/recv. But you lose out on modify in place, and unified kernel memory management (at least on FreeBSD and Linux, maybe it's different on Solaris?); both of those can reduce performance, especially in certain use cases.

IMHO, zfs is a clear win for durable storage for documents and personal media. It's not a clear win for ephermeral storage for a messaging service or a CDN. If you don't mind running multiple filesystems, zfs probably makes sense for your OS and application software even if your application data should be on a different filesystem.


Do you have pointers?

Because there are various mitigations and configurations involved if you're trying to do lots of small random IO for ZFS, and I've not heard people giving the advice of "just don't" in most use cases.


Just search for "zfs ext4 postgresql benchmark" - you'll find many of them using different configurations.


There's no copy operation, the previous data isn't overwritten and the new data is written to a new block. It's "copy-on-write."


As I understand it, taking a snapshot with ZFS involves writing a metadata object and some data references. Assuming 100 GB of data, 128K block size, and 64 bit pointers, I'd guesstimate * that new data written during a snapshot would be in the ballpark of 5 MB. Is doing that 6 times per hour (52,560 times per year) enough to cause premature wear on the drive? That would be ~256 GB per year. This is likely under 1% of an SSD's write endurance. So, I'd be surprised if taking 10 minute snapshots was a significant causal factor.

* I could be wrong, I asked for some help from not the most reliable sources. Happy to be corrected. Still, if my estimate is higher than actual and yet still unlikely to affect drive longevity, it may be moot.


Not likely. A snapshot just marks the most recently written block and prevents previous blocks from being altered. (More or less.) Since ZFS is copy on write, any changes to files will involve the same writes and some previously written data will not be deleted.


Minimum write size of a modern Flash chip can be ~100MB(!) according to a comment found in a random orange website[1]. So 5MB write every 10 minutes can be 600MB/hr, which is 4.8TB/8-hr-day, which is 24TB/40-hour-week, which is 3.43 DWPD real time for a 1TB drive, and 2500 TBW in 2 years real time[2].

Official quoted specification for SN850 is 600 TBW of write endurance, likely after derating for obvious warranty implications. Incidentally, 2500TB is also a typical endurance figure for many SSDs in this market. Overall, to me, sounds not entirely impossible.

I kind of wonder what's the controller says in SMART data, if still alive. On Linux the command is `apt install smartmontools; smartctl -s on /dev/sda; smartctl -A /dev/sda`, and it shall print out a table[4]. On Windows, just install CrystalDiskInfo[3].

1: https://news.ycombinator.com/item?id=29165202

2: DWPD: drive writes per day, TBW: Total Bytes Written - in terabytes

3: https://crystalmark.info/en/software/crystaldiskinfo/

4: Note that "Pre-fail" means the value is supposed to change when about to fail and "Old_age" means the value is supposed to indicate age, NOT "this is bad and about to fail" and "this drive is old". It always says all Pre-fail and Old_age. Someone should have changed it to "somewhat_boolean" and "life_remain" long time ago in my opinion.


Unfortunately the drive didn't appear accessible at all via nvme-cli. Interestingly it shows up in lspci but doesn't get a /dev/nvme. It tends to hang the UEFIs of the two systems I tried it in when they try to read it.


Minimum write size is not erase block size.


My machines have always been just constantly writing logs, like every couple seconds (macOS does this), and the write wear has never been anywhere near that bad. The advertised endurance must take into account write amplification for typical loads.


> Minimum write size of a modern Flash chip can be ~100MB(!) according to a comment found in a random orange website[1]

Your reference says erase block size. That is not minimum write size. Sorry but you are clueless (and careless).


You were off by factor 1000: GB, not TB


Nope. As others have mentioned, ZFS is CoW. Snapshots are "free" in that they (basically) point to a transaction group in the filesystem. They record a small amount of metadata to disk on each snapshot - on the order of a few MB. This is much much lower than an rclone/sync style backup.


That's also the least interesting part of comparing a ZFS backup to rsync/rclone: The rsync way is to crawl the entire tree being backed up to diff over the network then copy the differences. Because of snapshots, ZFS already knows all the changes that occurred between snapshot A and snapshot B, and (provided state up to A has already been backed up) can update the backup by pushing all changes between A and B as one big binary blob without having to scan or diff anything.


No. Individual snapshots are a matter of kilobytes; desktop environments write considerably more.


I read somewhere that snapshots are actually around 5 MB. Still not a lot, but a lot more than a few KB. A year's worth of hourly snapshots comes to over 40 GB just in snapshot overhead.


Another factor to weigh in my case is this laptop probably spends at least 50% of its life suspended. The overhead should be measured in MB per hour of uptime.


No. ZFS implements normal writes to the disk as snapshots (just unnamed ones), so in fact you can only write to disk through creation of a snapshot or by writing to "Intent Log" which is short-term log of data that is going into next snapshot - but which was synced before the snapshot was done, and as such it's secured in case of power failure.


Nah, changes are COW so most snapshots are tiny.


I guess it could contribute somewhat, but I don't think it is that much additional work: for every new write (since the last snapshot), there is one additional read as the data is sent. It isn't reading the whole file system, just the incremental data.


Of course it contributed. But it probably wasn’t the main reason or a significant contributor.


I tend to use Apple’s Time Machine incremental backup to a Synology spinning rust server. I also have an external SSD that I’ll mirror the internal drive to, if I’ll be doing anything dodgy, or upgrading my machine.

That works. TM restores can be quite slow, but almost all my important data is in Git (and hosted storage), so it’s not really been an issue. I just use TM every now and then, if I have a single file I want to backtrack.

I also have one of the notorious[0] SanDisk drives. I don’t use it for anything important. It just has some game storage. Since I’m a Mac user, games aren’t really much of a factor for me, and I won’t cry, if they croak.

[0] https://arstechnica.com/gadgets/2023/08/sandisk-extreme-ssds...


I use TM as well. I got an app (https://tclementdev.com/timemachineeditor/) that will manually trigger TM backups whenever the machine is idle. Seems to work a lot better than the Apple automatic or timed backups.


I’ve given up on Time Machine. It never seems to work past a month or two for me on my Synology w/ atalk etc.


It was always breaking on my Synology too. I ended up just attaching an 8TB spinning rust directly to the Mac and it's been flawless since.

Time Machine really doesn't like using remote disks that aren't official Apple gear.


Even when I had an Apple Time Capsule, it would break about once a year. It's just a flakey system. Wish they'd add the equivalent of zsend to APFS instead of using the weird "gigantic sparse disk image with hard links in it" system


When backing up to an APFS Time Machine volume it does work a bit like that, at least no hard links are used any more:

- https://eclecticlight.co/2021/03/11/time-machine-to-apfs-und...

- https://eclecticlight.co/2021/04/16/time-machine-to-apfs-usi...


Oh awesome! I guess I need to delete and re-create my Time Machine backup to enable this.


Ditto. Once I bothered to set up Borg for my other boxes, it was overall much less pain to just use it on macOS too.


FWIW Apple has deprecated AFP/AppleTalk and you should disable it on the Synology. It's far more stable with SMB (but still not great)


My snapshots are encrypted by the original computer (this is cool because the NAS can’t read them!). So I also needed to restore the encryption “wrapper key” to be able to use the backups.

Not gonna lie, it was pretty terrifying until I had my first confirmation I could decrypt the data.

Note: no bespoke backup method should be assumed functional unless you actually periodically check that you can restore data from it.


Or non-bespoke.


I added that word because at least with non-bespoke backups, you have probably thousands of users each day testing restoration out of necessity, and word might get around if the method failed to restore. But nevertheless, one should still test even then.


In short, backups aren't actually taken until they have been verifiably restored.


I always see this advice but...how do you even do that without having an entire additional set of disks to restore too?

You can't restore to production obviously, as aside from the downtime if the test fails you've just destroyed your good copy and proved the other copy is also bad.

About the best I can think of is restoring a small part of the set as a sample, which isn't really testing the whole thing.


I always meant to keep a copy of my LUKS header on separate disk, just in case, but..


bcachefs is the near future for Linux here. https://bcachefs.org/


What does bcachefs do better than BTRFS or ZFS?


Than both? Tiered storage. Than btrfs? Hopefully parity RAID, and performance. Than ZFS? Being GPL compliant, so viable for being in-tree in the Kernel.

(list most likely incomplete)


Near future? Its been in development for 7 years and still hasnt been accepted upstream (an upstreaming effort is in progress, though).


It's just silliness (attitudes, versus technical merit) blocking it now. The only thing stopping it is if something happens to Kent.

https://www.phoronix.com/news/Linux-Torvalds-Bcachefs-Review


Any tldr about why we should be looking forward to this?

Any cool things?


Erasure Coding when it lands should be pretty solid.

Until then per-directory data replicas is the killer feature for me (Music has 3, Documents has 5, Downloads has 1). Something to be very excited about with full compression and encryption.


You can do that with ZFS at the cost of defining separate filesystems per directory.

I don't use multiple replicas, but I use that to tailor my backups per directory. ~/documents is snapshotted and backed up on the regular, with long-lived snapshots. Code is snapshotted regularly, but snapshots don't live too long, and they're not shipped to a different drive. I don't care for ~/tmp so no snapshots.

---

edit: the cost, besides having to actually create the file systems, is that moving data between them isn't instant.


> the cost, besides having to actually create the file systems, is that moving data between them isn't instant.

No more the case after block cloning support goes production: https://github.com/openzfs/zfs/pull/13392


Nice, thanks!


> Not gonna lie, it was pretty terrifying until I had my first confirmation I could decrypt the data.

Sounds like the process was a bit less tested and documented than optimal. For a home system or personal desktop that's not super unusual.

You don't want to be working out your restore procedure on the fly for production servers though. ;)


For sure. I knew I had all the right ingredients to restore, and had kicked the tires 6 months back, but I initially copied over the wrong key. When it failed to load I had a sad moment until I realized my mistake. In such fail moments there's a flash of clarity where every process gap becomes blindingly obvious.


For every ZFS fan, I can recommend zfs-auto-snapshot[1]. I use it on my proxmox server[2] to auto manage snapshots incl throwing away old ones.

[1]: https://github.com/zfsonlinux/zfs-auto-snapshot

[2]: https://pilabor.com/series/proxmox/restore-virtual-machine-v...


i actually made 2 scripts to automate the sending and deletion of old snapshots along with one that calls auto snapshot on a vm rebooting / shutting down. I was thinking that they might be useful to other people.


For people who don't want to use ZFS but are okay with LVM: wyng-backup (formerly sparsebak)

https://github.com/tasket/wyng-backup


I wonder how much he gained from entirely restoring the system versus simply reprovisioning (gasp, even manually reinstalling) and restoring needed files a will. I'm not sure there's a lot of value in snapshotting and restoring stuff in a lost ssd situation tthat's also available in mirrors across the world.


I've reflected similarly after this exercise.

I've lost data before, but it felt terrible to lose my context and working memory. While I make sure the most important stuff is in git, there's a bunch of momentum and working memory in my bash history and system configuration. It's also nice to not have to think very hard about a patchwork of backup plans.

It's nice to get a fresh start every now and then, but not under duress. I was in the middle of a multi day project and was gonna lose time either way. It was real nice to boot back into a machine that felt like home.


I do a version of this that doesn't require restore at all, I have three separate physical systems that all share a VPN across the world, and wherever I am at any given time zfs snapshots are syncing across that VPN to those three physical systems depending on which one is the primary I'm using at any given point in time. They also use the VPN layer to check if they have peer status on a faster local network like the wifi or LAN and use that instead for the snapshot transfers if so.

If for any reason any one of these systems either is destroyed or is no longer master, picking up from where I left off is as simple as picking another system up and marking it "master". No restore process, no changes, nothing at all, and it picks up from exactly where I left off when I was working on the other system. Means I can just grab my EDC laptop and stuff it in a bag not knowing how long I'll be out or where I'll be going and also know that it will be completely up to date with my datasets, or I can grab my desktop replacement laptop and its enormous external disk if I am going to be on a different continent for an extended period of time and want full geographic dataset locality. At no point in time does any of the above require the manual running of any process or replication or anything like that.

Reprovision and restore would take a whole lot longer than this, wouldn't give the abilities that it provides, and the above is only possible because of zfs snapshot replication.

I also use a USB C external SSD that is a member of a ZFS mirror and a md raid group that is bootable, so even if my EDC laptop were to spontaneously combust, I could immediately get up and running on any similar laptop with roughly comparable hardware simply by putting that SSD in and booting from it, then adding the SSD on the laptop to the zfs mirror / md raid group.


The biggest time save is in time spent recovering. It's so much faster to restore the entire system than to reinstall the OS, reconfigure the bootloader, resetup disk encryption, reconfigure user accounts, reinstall all software, manually reload configs, etc.

Or put it more directly, full disk backups are a great way to get RTO down.


In this case the author had to do arcane magick to restore his zfs snapshot. This wasn’t a routine raw dd restore.


Agreed. It was a large time investment that happened to pay off. If I ever have to do it again I'll be much faster. I hope that with wider ZFS adoption some of the routine tasks will be automated better in the future. I see no reason why in a couple years this couldn't be a mature fire and forget user experience.


If it can be done manually, it can be automated and made more reliable!


I read this piece as someone who just got lucky. He never tested it until he needed it. 10 minutes is good. Don’t let this story fool you to not take backups.


I tested my backups about 6 months ago when I set up zrepl. When I mentioned it was scary until I could decrypt the data, that wasn't the whole story, actually: initially I restored the wrong wrapper key and it failed to load!

It's also scary in general to go from 2 copies to only 1 copy of data. A friend and I have been planning to trade replicas but haven't set it up yet. There's definitely still room for improvement in my setup.


> Don’t let this story fool you to not take backups.

Do you mean don't let this story fool you to not test your backups? Because the whole point of the story is he was saved by having his backups. (Though you're right that he lucked out by having it work when he hadn't tested it)


I just have rsync running in a cronjob. How is this significantly different?

I imagine it is, but I don't know how.


Mostly different in terms of performance and wear on the drives. If you rsync over and over, it has to scan basically the whole filesystem for changes each time. Zfs snapshots don't. The snapshot is ~instant and the calculation of what to send has no need to examine any files.

I don't think the performance and drive-lifetime hit of running rsync every 10 minutes would be good.

Zfs should have an edge in terms of atomicity as well, but in practice I'm not sure how much that matters. I _think_ it does matter but isn't perfect (zfs can't trick applications into doing atomic writes if they're not already, but it won't have _another_ worse layer of breaking atomicity like rsync must).


Depends what else the machines are doing, and how much ram, and the rsync settings.

If you read all the files on both sides, every time and don't have more ram than disk, it's going to be a lot of work. If you're just looking at directory entries most of the time, there's a good chance that's all cached and it's no disk load, other than the small changes.

I ageee with you though that atomicity is a big difference, if it matters, and in most cases, it probably doesn't.

Personally, I've mostly stopped doing rsync backups in favor of zfs send, but I've still got one I need to get around to changing. Sanoid/syncoid is pretty decent for less effort snapshotting and syncing snapshots; but I haven't done anything with encrypted datasets. For most of my systems, I'd prefer recovery over security. For the one system in iffy hosting, it runs full disk encryption as a layer below zfs, so it's zfs sends are cleartext, too. (The hosting facility has given me other customer's unwiped disks; better for me to assume my disks won't be wiped)


> If you read all the files on both sides, every time and don't have more ram than disk, it's going to be a lot of work. If you're just looking at directory entries most of the time, there's a good chance that's all cached and it's no disk load, other than smthe dmall changes.

Yeah that's a good point. I know that rsync is _quite_ clever, but at least any incantations I've ever done it still hits the drives a good amount. I'd ballpark guess a couple of orders of magnitude better than just "cp -r" or something, but still a couple of orders of magnitude worse than zfs snapshots.

Yeah you're 100% right it'll depend on bunch of variables though.

> Sanoid/syncoid is pretty decent for less effor snapshotting and syncing snapshots; but I haven't done anything with encrypted datasets.

I'm not sure I'd recommend it, but I use both directly on encrypted datasets. I have tested recovery a couple times and it works fine, but I've read some cautionary tales too. I _think_ they're all old issues?


Regarding atomicity, do you rsync from the live filesystem or from the hidden .zfs/snapshot dir?

My impetus for swapping to a send-based approach from an rsync-based approach of running an incremental (just from the normal non-snapshot filesystem view) then taking a snap on the far side was encountering corrupted encrypted containers. If rsync ran while a container was mounted and being written to, it'd get an inconsistent view of the underlying file and produce a nonsense diff, resulting in an unmountable container in the backup. This doesn't happen with send because it has a consistent view of the blocks it needs to replicate, but as I was writing this I realized that running rsync from zfs's view of the snapshot might get around that. Of course, at that point it's probably easiest to just use send.


My rsync backup goes from the live filesystem (and then does some snapshot-like things on the other end with hard links and what not), but it's for my househould's shared network drive, so atomicity isn't super important. Mostly there's not many changes, and if there are, it's ok if it takes a couple snapshots to settle.

I didn't originally have that area as its own zfs filesystem, and it wasn't even originally on zfs, but I moved things around when setting up a new, offsite, backup system... Just haven't gotten around to redoing the old backups. I don't think I'd spend any more time on rsync based backups, given how I use things now; incremental zfs means no need to compare, which makes me have good feelings.


Is reading a drive actually that big a hit on the drive's lifetime? At least for SSDs, I always see write endurance quoted, never read.

Sure, having to scan a whole tree of files can take a toll on the performance as perceived by other apps trying to use the drive.


For SSDs I'm not sure. My data drives, SSDs are still too expensive so I haven't switched. Spinning rust it definitely does matter (for drive lifetime and also contention).


I've always figured that for spinning rust, what kills them is power cycles. So, unless you're absolutely sure they won't be woken up, disable "sleep" mode. If you're sure nothing's going to wake them up, might as well turn them off completely, and the server also, saving a buck or two.


While rsync does incremental backups fine, it doesn't offer deduplication. If you want that, have a look around at newer options like restic or others such as borg (see a comprehensive list at https://github.com/restic/others)


rsync does have --link-dest which you can sometimes use to get a (file level) dedup effect.

zfs dedupe is pretty expensive and doesn't often work out like people might expect...


> rsync does have --link-dest which you can sometimes use to get a (file level) dedup effect.

Sadly for safety's sake directory hardlinks pretty much don't exist, so this doesn't save as much as it could. Apple hacked in an exception for Time Machine so they could get these additional savings.


For those that don't know, there are many wonderful incremental backup solutions that don't require ZFS. * For one, I personally recommend Restic (https://restic.net) because of its deduplication.

* People on macOS don't have ZFS, well... maybe they could? See https://github.com/spl/zfs-on-mac


bupstash.io was my favored option other than ZFS. It's a beautiful and performant solution. Being filesystem agnostic is an advantage in many contexts.

In the end I chose ZFS for the efficiency of snapshots (vs. a full disk scan) and atomicity. Both enable more frequent, smaller syncs, which is perfect for a laptop.


macOS Time machine does incremental backups, is there any other reason wou might need ZFS?


I see that his NAS is using an external harddrive enclosure over USB. I'm really curious about how wise such a NAS setup is?

In some ways it seems very attractive as you can just get a low-power SoC (like a Raspberry Pi or a NUC) and hook it up to the external drives.

But there also seems to be many potential pitfalls. Like, how slow will a resilver be over the USB? Might it be unusable/dangerously slow? How reliable is the USB connection? Does it perform to spec or might it cause weird issues? Can you get SMART info over the USB connection? Other issues?


> NAS is using an external harddrive enclosure over USB.

The N in NAS means "network" as in network attached storage so it's not a NAS.

> really curious about how wise such a NAS setup is?

For data use cases like this, USB 3 can be reasonably comparable to Thunderbolt 3, and that connection is generally faster than the media.

This use case seems to be using the external device as a continuous external backup rather than as network attached storage, which is a great use of USB-C dongle SSD enclosures that are the same size or larger than the SSD inside the laptop.

You effectively have mirroring as well, since you have both the internal SSD copy and the external SSD copy, in different makes and forms, unlikely to both fail.


How I interpret the setup that OP has, is that he some computer (a PI, an old laptop, a NUC, etc.) which is connected to drives in an external drive bay through USB. This is a somewhat common setup and is definitely a NAS.

> For data use cases like this, USB 3 can be reasonably comparable to Thunderbolt 3, and that connection is generally faster than the media.

External HDD enclosures can often contain 4 drives. During a resliver all of these could be heavily accessed. I'm not sure how the single USB 3 connection fares in this scenario. In a normal desktop you'd have four separate SATA connections, and even then resilvering a large RAID setup can take quite some time.


In the context of the grandparent, a "NAS" is a device, attached to the network, that provides storage. How the storage the NAS provides is connected is irrelevant.

On a related note, I can't think of any drive that has a network interface instead of something like USB, Firewire, Thunderbolt, SCSI, IDE, etc. so how exactly would you define a NAS device?


How could I backup using incremental atomic snapshots on Windows?


Look into the Windows Copy Shadow Service.

Unfortunately, I haven't had luck with Open Source backup software that uses it (the shadow copy snapshot would fail, the error code would be no help, and finding no resources, I gave up), but commercial software I've used was great. When I was at a big corp, the commercial backup software whose name escapes me at the moment would litterally wait for files to be saved, then do an incremental backup.

As of now, I'm using Veeam on my personal machines, and it runs an incremental backup nightly and saves to an smb share.


I've been using urbackup for years. It does disk image-based backups and/or file-based backups. Disk image-based backups are incremental at the block level, and file-based backups are incremental at the file level (so if a single byte of the file has changed, the entire file gets backed up). It uses the Volume Shadow Copy mechanism that the sibling comment mentioned to get atomicity and avoid file locking issues.


I wonder if zrepl could be run in WSL2 - would be nice to backup Windows computers as well using this approach.

At the moment, I use Nextcloud to sync data to my server. It is a more selective approach and Nextcloud is, per se, not a backup solution because not all files can be backed up.. and live-sync is always a half-baked backup solution.


How would that work? Zrepl ships ZFS snapshots. You could probably wrangle wsl2 to install its distro on zfs. If so, I see no reason why zrepl wouldn't work with the linux environment. But snapshotting the whole Windows drive? I don't think so.

If you want similar features, I think ReFS comes close. AFAICT it's not supported as a boot drive.


I thought maybe if Windows is installed on a ZFS volume, and WSL2 on another, Zrep from Linux should be able to backup both, snapshots from the Linux and the Windows volume. But a quick search on Google reveals that Windows on ZFS is not a thing, yet.


You could flip that on its head, and run Windows in a VM on Linux on a ZFS volume. Depending on what you do on Windows and your particular hardware setup, this may or may not work well enough.

I see people use GPU pass-through to play games on Windows VMs. You could probably pass through practically all devices (GPU, sound, network, keyboard, etc.) and this could work well-enough if you don't need the absolute last drop of performance from your CPU and drives. And since KVM supports nested virtualization, you could run WSL2 in the Windows VM.

And if I'm not mistaken, the KVM agent in windows can be told to ask the guest OS to sync the drives, and some Windows applications [0] can even cooperate with this and flush their buffers to disk. You could signal this before creating the ZFS snapshot.

[0] probably not most, but I think MSSQL does.


Yes, this could be a possible direction. However, judging by reports from Reddit [1], Windows on ZFS still produces lots of BSOD. Maybe one needs a ZFS Volume formatted as NTFS?

Anyway, these days, I work 60% of time on my VMs, and the other 40% in WSL. Windows is reduced to a graphical interface, maybe I should simply ditch this last mile, too.

[1]: https://www.reddit.com/r/zfs/comments/yxipyy/anyone_using_op...


I've read that ZFS is less safe than other Linux filesystems if you don't use ECC RAM, because it assumes that there are no memory errors and therefore doesn't provide a tool to repair a filesystem corrupted by such errors. Is this true?


It's not true. That's basically ancient forum myth, alongside the also incorrect "ZFS needs 1GB memory per TB of HDD" nonsense that has thankfully mostly died out finally. ZFS makes no additional assumptions when using ECC vs non-ECC memory.

It is theoretically possibly to construct a scenario where evil ram does all the exactly right things needed fool ZFS and corrupt your filesystem. Any pearl clutching about this thing which has never happened somehow also ignores that every filesystem is going to get corrupted.

In reality, while ECC memory is always nice to have, it's no more required than any other filesystem. Though personally now that amounts of +32gb are common, I generally prefer error correction/detection over ultimate speed these days. Though ironically ECC memory is actually really nice to overclock, because I can actually just check my logs and prove if my system is actually stable.

There so many actual dangers to your data in comparison that it's laughable. The biggest one being you. Followed by hardware failure, malware, and genuine ZFS bugs. I'd stay far away from raw sends of encrypted datasets in ZFS for a while, there are edge cases that haven't been resolved yet.

Edit Longer article saying the same thing: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...


"I don't backup my drives, I replicate them"

<sigh> I mean, sure, he recognizes the difference which a lot don't, and I guess yay for zfs here to save him, but this is just irresponsible if you value your data.


spoiler: he has 10 minutes incremental backups.


Its "backups" join "zfs makes snapshots easy" join "snapshots make incremental backups easy" join "backups on device aren't a backup" join "I had off-device backups"

which reduces to "I had backups" indeed.

3-2-1 forever!


> "backups on device aren't a backup"

That wasn't part of the article. It was a single drive failure, so RAID would have done fine.


Yes, RAID will get you over some failures. But, it still isn't a backup. Backup is what gets you over corrupted RAID, loss of both sides of the mirror stripe, entire disk failure when its not RAID.

What he does is run zrep to make a backup. it covers his needs. ZFS snapshot by itself is only transitionally a "backup" for the immediacy of change, it's the least safe form of backup if it remains on the same logical drive structure.


> Yes, RAID will get you over some failures. But, it still isn't a backup.

backups also can fail, that zrep can start failing after some os/kernel update without notifying owner. The question is in probabilities of failures, I kinda would trust industrial raid more than some custom made hobby solution.


I really figured we'd have super easy hardware Raid1 in even consumer level PCs by now given how cheap drives are (and unreliable).

My SSD boot drive makes me nervous as heck, constantly backing it up.


Was he able to back up and restore his boot/ESP partition as well using ZFS, or did those need to be reprovisioned manually?


Am I the only one who doesn't have any important data? If I lost everything today I would just start afresh and move on.


> Am I the only one who doesn't have any important data?

Quite possibly.


At all? Like, anywhere? Or do you just have data living on cloud services instead of locally?


At all. What important data do you have? I really can't think of anything that I would miss if everything disappeared.


Address book, photographs/video of people I care about and holidays, personal diary, hobby projects, old letters/emails, stored passwords, archived bank statements/contracts/insurance and other important documents.

I think you are very unusual if you don't care about any of this.


Yes, I think it's unusual to not care about photographs/videos. I used to think that they were important and that I would want to revisit them some day. However, now when I'm old enough that I should it simply hasn't happened. They are just some files that I will never open.


Most administrative stuff, especially if it's in digital form anyway, can probably be recovered without too much hassle if it's lost. But you're right that most people have a bunch of stuff (not all of which is admittedly digital) they wouldn't want to lose.



I would love to be able to do this on a MacOS, with the click of a button.


FWIW I’ve used Arq Backup[1] for several years now, and I’ve successfully restored at least twice after my MacBook died. It also encrypts the data before it leaves your computer, and supports tons of (cloud) storage solutions — I use Google Cloud Storagge and spend about a dollar per month on storage costs with hourly backups.

[1] https://www.arqbackup.com/


tldr: my drive died, and I had a backup.

zfs seems incidental to me. I could have a 10 minute cron job rsyncing changes from ext4 and been just as well off.


Classic HN comment!


Is this so?


What's the equivalent setup for a Mac user?


I feel like a computer running ZFS and serving files, is fine. And it should itself be treated as a strage device with a full parallel backup even though this has cost.

But your computer shouldn't run ZFS, that's for the big boys upstairs. Code's too big, it's too hungry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: