I only lost 10 minutes of data, thanks to ZFS

xet7 · on Aug 23, 2023

With ZFS, he is better prepared than other WD and Sandisk SSD users.

https://petapixel.com/2023/08/08/sandisk-portable-ssds-are-f...

https://www.theverge.com/22291828/sandisk-extreme-pro-portab...

https://news.ycombinator.com/item?id=37042587

https://www.theverge.com/23837513/western-digital-sandisk-ss...

https://news.ycombinator.com/item?id=37188736

anonuser123456 · on Aug 23, 2023

I had two drives in my mirrored zpool die within 8 minutes of one another.

Both HGST drives too. A very sad day.

Thankfully I had been regularly zfs sending my contents to another site and lost very little data.

ZFS is rad.

alias_neo · on Aug 23, 2023

I also had two mirrored drives fail simultaneously in my zpool a few days ago. There was nothing on them so I wasn't worried. WD Reds in my case;

Using matched drives seems to be a very bad idea for mirrors. I'll probably replace them with two different brands.

I also have a matched pair of HGST SAS Helium drives in the same backplane so hopefully I can catch those before they fail too if they're going to go at once, I _do_ have data on those.

RealStickman_ · on Aug 23, 2023

Alternatively, you can also buy your drives a few months apart to get different batches, most likely.

alias_neo · on Aug 23, 2023

Yeah, that's an option; though since my pair failed at once, I need to buy at least two immediately to get the mirror back up.

hackmiester · on Aug 24, 2023

Because of this, I always buy a different brand as well.

alias_neo · on Aug 24, 2023

Any limitations, or am I good so long as they're the same label capacity? I assume I just lose a few MB or so from whichever is a little larger?

miohtama · on Aug 23, 2023

Is there a specific reason why the drives die at the same time? Electricity spike?

wruza · on Aug 23, 2023

Buying two identical drives has high chances of them being from a single batch, which makes them physically almost identical. It’s a pretty well-known raid-related fact, but some people aren’t aware of it or don’t take it seriously.

short_throw · on Aug 23, 2023

Identical twins may both die of a heart attack, but not usually at the same time.

Normally, failures come from some amount of non-repeatability or randomness that the systems weren't robust to.

The drive industry is special (in a bad way) in that they can exactly reproduce their flaws, and most people's intuition isn't prepared for that.

alias_neo · on Aug 23, 2023

If they're bought together, like mine were, and they have close serials, they've be almost identical; if you then run them in a ZFS mirror like I was, they'll receive identical "load" as well.

Since mine had ~43000 hours, they didn't fail prematurely, they just aged out, and since they appear to have been built pretty well, they both aged out at the same time. Annoying for a ZFS mirror, but indicates good quality control in my opinion.

privong · on Aug 23, 2023

If they're ~identical construction and being mirrored so that they have the same write/read pattern history, it could trigger the same failure mode simultaneously.

dizhn · on Aug 23, 2023

More likely to be from the same bad batch too. There was a post with very detailed comments about this just a few days ago.

alias_neo · on Aug 23, 2023

Why bad? What's considered a good/bad lifetime for these? Mine had ~43000 power on hours, I don't know if that's good or bad for a WD Red (CMR) drive, but they weren't particularly heavily loaded, and their temps were good, so I'm fairly happy with how long they lasted (though longer would have been nice).

dizhn · on Aug 23, 2023

You're right it might be a natural end of life that coincides too.

anotherhue · on Aug 23, 2023

> ZFS is rad. Typo: RAID

quickthrower2 · on Aug 23, 2023

Redundant Array of Disks (cost not specified)

askvictor · on Aug 23, 2023

I always thought it was Independent Disks, though how one disk could be dependent on another is beyond me. Perhaps the I is redundant?

throw0101b · on Aug 23, 2023

The original Patterson, Gibson, Katz paper has "Inexpensive":

* https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

* https://www.computerhistory.org/storageengine/u-c-berkeley-p...

quickthrower2 · on Aug 23, 2023

Looking it up I think you are right. I thought it was inexpensive.

josephg · on Aug 23, 2023

Repeat after me: RAID is not a backup.

Sounds like it was the backup which saved the day here, not the raid array.

tjoff · on Aug 23, 2023

He didn't have a raid array on his laptop, that would have saved him as well, with not even a second worth of data lost.

And he could use his system without replacing any drive and he would have had a much quicker recovery and could have used the system during rebuild.

Now, raid on a laptop is mostly reserved to bigger units. And authors setup is awesome as well, but single drive failure is one of the most common issues (especially where you make use of snapshots) you need your backup and that is exactly what raid solves.

So try to do both :)

sofixa · on Aug 23, 2023

> I had two drives in my mirrored zpool die within 8 minutes of one another.

2 drives in a mirrored zpool, this is the equivalent of RAID1.

tjoff · on Aug 23, 2023

Missed that context, meant regarding the article.

taskforcegemini · on Aug 23, 2023

zfs send was used, so the failed zfs still had a hand in the positive outcome

btown · on Aug 23, 2023

Only 1/4 of data lost! Meaning still recoverable!

LargoLasskhyfv · on Aug 24, 2023

I'd bet that was intentional, because https://www.urbandictionary.com/define.php?term=Rad

anonuser123456 · on Aug 23, 2023

You deserve all my upvotes :)

vardump · on Aug 23, 2023

That's extremely unlikely. Could it have been the controller instead? Which HGST drives?

repiret · on Aug 23, 2023

I've frequently had drives in a RAID fail in rapid succession. If you buy a bunch of identical drives at the same time and put them in a RAID, then you can end up with:

* They were manufactured in the same batch, maybe even one right after another on the same line.

* As they were transported from manufacturer to OEM to you they were exposed the the same environmental conditions, right down to vibrations, humidity, and ambient EM environment.

* As you use them, they continue to be exposed to the same environmental conditions, including power supply fluctuations and power inductively coupled into places it doesn't belong.

* They see the same usage patterns. Depending on the RAID specifics, that might be right down to seeing the same disk locations seeing the same read and write volume.

Its then not surprising if they fail at about the same time.

The last machine I put together that I wanted to have high availability, I intentionally bought two different brand drives to put in the mirror to maximize the likelihood that they fail at very different times.

Many years ago (c. 2003) the group I was working in inherited a massive 6U storage server with an insane number of 10k SCSI (it was before SAS was a thing) drives. We named it "hurricane" for the sound it made. After a few weeks of using it, the first drive failed. It rebuilt to a hot spare and we ordered and eventually installed a replacement. A few weeks later, another drive failed, and this time before it could finish rebuilding, two more drives in the RAID failed and its contents lost (but we had a good backup). We never used it again. For a while I used it as a coffee table, but then someone convinced me that was too tacky, and it got ewasted.

TeMPOraL · on Aug 23, 2023

> Its then not surprising if they fail at about the same time.

It is, but in a different way. It is a testament to the depth and precision of manufacturing process control, that two insanely complex machines will behave nearly identically for years, up to the point of failing at about the same time, if they've been made in the same batch, and exposed to about the same environment and usage patterns over those years. You'd expect any number of random factors to cause one drive fail way before the other, but no - not only there is very little variation between drives in a batch, tiny variations in usage are damped down instead of amplified.

It truly is amazing.

jjav · on Aug 23, 2023

> If you buy a bunch of identical drives at the same time and put them in a RAID

When setting up a new machine with zfs I intentionally buy drives from as many different brands and models as possible to spread the manufacturing defect risk.

jms · on Aug 23, 2023

Not extremely unlikely if they were identical drives from the same manufacturing batch. It's good practise to use diverse manufacturers or at least batches when adding disks to a raid array for just this reason.

godelski · on Aug 23, 2023

It's not unreasonable to believe that if you pick two identical products off the same shelf at the same time (as one would logically do when purchasing 2 of a single item), that the two products were manufactured at similar times and in similar conditions.

Your model isn't exactly bad, but there is an assumption being made that you haven't accounted for. Which to be fair, is frequently not stated. The assumption is that the drives defects are independent of one another. This is a poor assumption when manufactured back to back.

birdman3131 · on Aug 23, 2023

https://news.ycombinator.com/item?id=32026606 Hacker news went down a while back because of the 40k hour bug. Both the primary and backup servers were placed into service at the same time with ssd's that had an overflow after ~40k hours.

anonuser123456 · on Aug 23, 2023

The drives themselves were toast. My hypothesis was a short in the raid controller or something leading to an over current in the drives.

I wasn’t using them in a RAID configuration, but they were attached to a raid controller.

numpad0 · on Aug 23, 2023

Or could be that the tolerances and environmental factors were so tighly matched between the drives.

erhaetherth · on Aug 23, 2023

Stop, you're scarring me! I have mirrored drives in a zpool. If a pair dies I lose 18 TB. My most important stuff is cloud-replicated but still...

densh · on Aug 23, 2023

My pairs of mirrored drives come from 2 different manufacturers to prevent a common fault that happens at once to both drives.

csydas · on Aug 23, 2023

Ah, don't be scared. You're at least starting to think on your data and replicating your most important stuff elsewhere. I'd still recommend a non-cloud copy also, but you're probably okay :)

Take it as a good impetus to catalog your data and find those extra replication options. Data protection does cost you a bit to do it "fully", but since I've worked on a backup solution before in a client facing way, trust me when I tell you that I've seen rather large businesses (a few you might even know as a household name) who have less consideration for their data than you've expressed in your 3 sentences :)

So just figure out which of your data _truly_ needs to survive at all costs, get a solid setup with personally owned storage for the backups in combination with cloud storage, and you're probably fine.

foobarian · on Aug 23, 2023

The thing that has been worrying me lately with tens of TB of files I need to look after is how do I know the files haven't silently got corrupted somehow? I feel like I need to periodically re-checksum everything and keep hashes in a database somewhere off the side.

csydas · on Aug 23, 2023

I think rsync has a --checksum flag and also supports incremental backups, so probably based on your situation you can understand a logic that represents your access patterns. That is, if you know for sure you personally will not touch the files after 20:00 and no one else will, it's _fairly_ reasonable to assume that if you start rsync after you're done working, then schedule an incremental pass a bit before you start working, anything that was "copied" to your new backup has likely changed in an unintended way, and the list should be pretty small for you to check each morning, if there's anything. Keep in mind stuff like OS deduplication may mess with this (Windows' Dedup would undoubtedly break this schema as it may show that the files are modified after the Optimization job runs)

Alternatively, consider just using a file system that does periodic integrity checks. I know ReFS has integrity streams, and I am pretty sure XFS has something almost exactly the same or better. It won't prevent corruption, but it will give you something you can monitor for when the filesystem reports an issue.

Some combination like this should work.

Similarly, you might be able to come up with a fast trick using stat; with some quick testing on a dummy file in MacOS' ZFS shell, you can do something like:

stat -f %m somedir/*

and compare the resulting value by passing it to sum or something. I am not super familiar with stat in general, so likely I am missing elements that make this unreliable, but I'd consider looking into it further unless someone tells me it's 100% the wrong direction and explains why.

bell-cot · on Aug 24, 2023

Note that rsync generally runs far, far more slowly with the --checksum flag. And I can't recall it ever saying something like "change(s) in $FileName were only noticed by checksum", so I would have been alerted to quiet disk corruption.

ZFS has checksums, and the 'zpool scrub' command tells it to verify those (on all copies of your data, if you're using RAID).

orangepurple · on Aug 23, 2023

You should probably be using RAID Z2 in that case which supports simultaneous 2 drive failure without a problem

benoliver999 · on Aug 23, 2023

I actually had this happen, and RAID Z2 saved me from a very long recovery process.

I thought it might be the controller but a year on I've had no further issues. Sometimes drives do just go like lightbulbs.

teekert · on Aug 23, 2023

3 2 1 backups, ddg it ;)

nicman23 · on Aug 23, 2023

i hope me mixing the models and buy times (i have been using the raidz expansion branch) will keep that raidz alive.

it has become non economical / practical for me to backup everything

j1elo · on Aug 23, 2023

Tangentially related, on topic with the Google search issues that The Verge complains about at https://www.theverge.com/22291828/sandisk-extreme-pro-portab...

> Google assumes you’re looking for product pages when you search for things like “4TB SanDisk SSD,” so news stories like ours and Ars Technica’s appear far down search results.

Well, I've been test-driving the paid Kagi search engine, and this was an excellent opportunity to see if a different class of web search could produce different results...

Sadly I'm afraid when all the web is flooded by praising articles, a different set of prioritization rules would still struggle to show different results:

* First 5-6 results are from shops.

* Then come some reviews, from: easeus [1], techpowerup [2], anandtech [3], consumerreviews [4]. None of them contain the word "fail".

* Lastly, and this is the only actual improvement from google, there are some relevant search suggestions such as "sandisk 4tb ssd failure" and "sandisk 4tb ssd problems". Difficult to see (they are at the bottom of the page), but at least better than Google (where the word "fail" doesn't appear at all in the first results page).

[1]: https://www.easeus.com/knowledge-center/sandisk-4tb-extreme-...

[2]: https://www.techpowerup.com/review/sandisk-ultra-3d-4-tb-ssd...

[3]: https://www.anandtech.com/show/16892/sandisk-extreme-pro-cru...

[4]: https://consumerreviews.store/sandisk-4tb-extreme-portable-s...

NoZebra120vClip · on Aug 23, 2023

Just before I exited the Linux world entirely, I was beginning to chip away at the iceberg known as btrfs, and it was fascinating. I saw so much promise in many of its features, for revolutionizing backups and organizing my disks and everything.

Now btrfs isn't ZFS, but it has some feature parity and perhaps the "poor man's ZFS". It's also much more reasonable to run on certain OS, due to the licensing, packaging, and in-kernel status of ZFS being kind of weird.

One memorable time I was encouraged to use ZFS was when I mentioned to the Linux User's Group that I'd had to pull the power cord to reboot my computer, and I was roundly scorned for this foolish maneuver. But you may change your mind about the wisdom of doing either one when you consider that the system in question was a Raspberry Pi. Heh.

eternityforest · on Aug 23, 2023

The pi supposedly can get FS corruption, but I've never seen it because every time I install an image, I run a script that puts tmpfses everywhere and turns off mostly useless logging. Those things just run for ever, very reliably, as long as you don't hammer the SD card.

ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

I always stayed away from BTRFS, because every few months I'd see a "BTRFS destroyed my data" post, followed by an argument about if it was BTRFSes fault. I see them less now, perhaps it's time to revisit?

snailmailman · on Aug 23, 2023

I had multiple pi systems “bricked” after a power outage. Presumably it was file system corruption? I never looked into the issues further. I just wiped the drives and reinstalled. These were vanilla raspbian installs at the time. It happened a few times when I was first trying out my pi. These were a mix of me cutting the power and actual power outages.

I only ever had these issues with SD cards. I quickly switched to running my pi off a USB external SSD, and haven’t had any problems since then. Now when the power goes out, it boots back up properly and all my services start. All of this on ext3 I think?

Planning to redo things for ZFS at some point, but haven’t gotten around to it yet.

rebelpixel · on Aug 26, 2023

This happened a lot to my Pi 3 back when I used the internal SD slot. I set it to boot off USB and used an old 8gb USB2 flash drive as the boot drive and never had the problem again.

cwillu · on Aug 23, 2023

Yeah, sd cards that you haven't put through extensive crash testing simply can't be trusted. I used to have a jar of sd cards that didn't survive testing.

ronjouch · on Aug 23, 2023

The most noticeable raspberrypi SD card life lengthener for me has been to write logs to RAM (assuming you have a stable setup and don't count on them to survive a reboot!).

Our $job dashboards used to nuke an SD card every couple weeks/months, but since the move to logs-in-RAM we've been running the same SDs for years.

DIY via {fs,journalctl} config , or using https://github.com/azlux/log2ram

Also, mount the SD with the `noatime` flag of course: https://wiki.archlinux.org/title/Ext4#Disabling_access_time_...

eternityforest · on Aug 23, 2023

Noatime, disable swap, logs in RAM, /tmp in RAM, .xsession errors in RAM(With logrotate or it will fill up! So many random problems can make a problem there), chromium profile folder in RAM if you do kiosk work(Browser makers must hate flash!).

From what I hear Home Assistant is still not the easiest if you want to run for years on a card, not sure if that's fixed now, but it's one of the big factors blocking me from moving to HA.

cwillu · on Aug 25, 2023

There was a link a couple weeks ago to industrial sd cards on digikey. They're pricey, but they will work for years without disabling anything on these sorts of boards.

https://www.digikey.ca/en/products/filter/memory-cards/501?s... is the brand I used to use, iirc

037 · on Aug 23, 2023

Any script/guide to do all this on a Raspberry Pi in one go? I'm very interested for some Pi Zero 2: every couple of months the microSD content gets corrupted and I have to rebuild the system + restore backups.

eternityforest · on Aug 23, 2023

The section starting at 827 is what I'm currently using, but it just covers the basics. Somewhere in there is also the kiosk launch script to run Chromium without breaking the card. If you use apache you'll need to add a service to create the logfile though.

https://github.com/EternityForest/KaithemAutomation/blob/dev...

ronjouch · on Aug 23, 2023

A guide I wrote for $job that covers what I mention in the parent-parent post, and a couple other things: https://unito.io/blog/better-raspberry-pi-dashboards/

Note: only a guide, not scripted yet. See footnote for why :)

ronjouch · on Aug 23, 2023

Ooooh, excellent additions, thanks :)

ekianjo · on Aug 23, 2023

If you automate a fsck at every boot you can make these kind of issues go away on the pi

mkesper · on Aug 23, 2023

That won't help if your SD card got corrupted because of too many writes.

dspillett · on Aug 23, 2023

Vjjnba

dspillett · on Aug 23, 2023

Sorry about that. Absolutely no idea how it got typed and posted. Had the page open this morning but I don't see how either pocket typing or feline interference could have done it…

eternityforest · on Aug 23, 2023

You made the first page of Google with a brand new word!

sitkack · on Aug 24, 2023

My kid would leave these for me.

tomatocracy · on Aug 23, 2023

Zfs doesn't need hugely resourced hardware. For example, it can run on a single disk and still get you cheap snapshots/rollback, on-disk checksums, potential for easy replication via send/recv, highly efficient caching via the ARC, and transparent compression amongst other things. You can also start with a single disk and then easily add a mirror later on.

In terms of CPU and RAM, most NASs now are perfectly capable of running it (but do also consider speccing out a small form factor PC with a case with lots of drive bays vs prebuilt NAS - you might be surprised at what you can get for a similar cost).

8fingerlouie · on Aug 23, 2023

> I'll just be doing RAID and ext4

Why do you need/want RAID ?

RAID is for availability, but is your data really that important that you cannot wait for a restore ? Most people would be much better off using that 1..N parity drives as versioned backups instead of running RAID.

I've run NAS boxes for years, but these days i'm only using single drives.

My setup these days consists of laptops that synchronizes data (encrypted) to the cloud, and a small ARM machine that synchronizes cloud contents locally, and makes a versioned backup to a single drive as well as a versioned cloud backup.

As for cost, it's cheaper (for me) to store my data in the cloud, than the cost of electricity required to run my NAS.

redeeman · on Aug 23, 2023

> As for cost, it's cheaper (for me) to store my data in the cloud, than the cost of electricity required to run my NAS.

how much data are we talking about?

8fingerlouie · on Aug 24, 2023

I have around 10TB data stored in the cloud, including backups. Total cloud bill is around €23/month (recent price increase included).

There is a turning point somewhere after 15TB where the cloud becomes a lot more expensive than local storage, but at 10TB it’s hard to do in a reliable manner locally.

And it doesn’t really matter how you do it. If you’re the DIY type, and buy a second hand server, or a small machine, or just run out and buy the latest Synology box, you’re still paying more for storing <15TB at home in RAID. One will be more expensive in electricity, the other in purchase price.

eternityforest · on Aug 23, 2023

I was going to do RAID for disk failure tolerance, either with a consumer NAS box, or a disk enclosure that does hardware RAID, and something very low power like a Zero 2 W.

Isn't RAID parity slightly more space efficient than versioned backups? Or is there a better way to do redundancy that doesn't involve just replicating entire files to multiple disks? Or some kind of automated manager that puts each individual file on N different disks out of M?

I mostly do embedded so reliable data storage isn't generally something I deal with, we usually leave that to the cloud or to the user, and I'm not quite familiar with what's out there.

8fingerlouie · on Aug 23, 2023

>Isn't RAID parity slightly more space efficient than versioned backups?

It depends on your storage array. The more drives, the more space efficient RAID becomes, but RAID is still only a single copy of your data.

>Or is there a better way to do redundancy that doesn't involve just replicating entire files to multiple disks

Most of the industry is using erasure coding these days (https://blog.min.io/erasure-coding/) which allows for spreading your parity and data across multiple sites. Erasure coding usually runs a layer above the filesystem, as opposed to RAID which typically runs below the filesystem (Snapraid, Mergerfs and others excluded).

My personal "backup vault" is a Raspberry Pi 4 with a single 4TB external drive attached. The RPi runs Minio, and all backups are done through the S3 interface or SFTP/SMB. It is not the fastest box in the world, but it backs up (incremental) ~2TB in 30 minutes, which is "fast enough".

It consumes on average 4W, which means even with worst case electricity prices of €1/kWh (which we saw last winter), it costs less than €3/month.

For comparison, my NAS consumed around 50W, and at €1/kWh, that would cost €37/month in electricity alone, and then you need to add the cost of the actual hardware itself.

I switched off the NAS, and purchased ~10TB of cloud storage (main storage and backup storage at two different locations) for €20/month, and keep sensitive stuff encrypted with Cryptomator.

tremon · on Aug 23, 2023

but RAID is still only a single copy of your data

Btrfs has 3-copy and 4-copy RAID1 (profiles raid1c3 and raid1c4), doesn't ZFS have something similar?

8fingerlouie · on Aug 24, 2023

ZFS has RAIDZ1, RAIDZ2, RAIDz3 and Ditto blocks, which do much the same thing, although a bit differently.

My point was that even if you have 4 copies of your data, you still only have a single machine where your data is stored, and you're essentially just one flood/lightning strike/house fire/burglary away from all of it being gone. Or one bad power supply away from 4 dead drives.

With versioned backups, you have higher latency on restoring data in case a disk dies, but your data is also safer.

As i initially stated, RAID is for availability. It is great for making sure that data is available 24/7, but that is rarely what the average home user needs. Most home users access their data infrequently, and would be perfectly fine waiting a couple of hours while restoring data from a backup.

neuromanser · on Aug 23, 2023

I had a FreeBSD box with data on a gmirror of two disks for a decade. Twice in that period I had one of the disks die. Each time I would buy a new disk, add it into the array, and go on with next to no effort. As for cost, I would have had the box anyway, so it wasn't like i was paying for an extra system, just an extra disk. Cloud backup would have cost me several orders of magnitude more, this was a media library of hundreds of gigabytes, eventually low terabytes. Plus all the complexity of working backup/restore procedures.

8fingerlouie · on Aug 24, 2023

And how much power does that box consume ?

The average price of power during a year here is around €0.45/year, so a box drawing 50W (not unlikely with a decade old processor and two disks) will use 438 kWh during a year, meaning it would cost €197/year to keep it powered.

Add to that the ~€150 x 2 for new drives, and you end up with €227/year, or €19/month.

But of course, if you, as you stated, had the box running anyway, the math works out differently as you're essentially splitting the cost with whatever purpose the box already fulfills.

neuromanser · on Aug 24, 2023

The system started its existence on a VIA EPIA board with a sata controller pci card, later upgraded to an Intel Atom-based Supermicro board. The Atom was 20W TDP iirc, the VIA less than that. The system was mostly idle.

toast0 · on Aug 23, 2023

> ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

FreeBSD won't break your wallet.

boomboomsubban · on Aug 23, 2023

I assumed they are talking about the commonly spread myth that you need a gig of RAM for every terabyte of storage. I think that's recommended when doing deduplication, but for a simple NAS ZFS would use a comparable amount of memory as ext4 on RAID.

toast0 · on Aug 23, 2023

Oh yeah. That was always a sketchy recomendation, but it sure did make its way around. I think deduplication is remarkably seductive, but doesn't seem worth the cost for almost anyone, given how it's implemented. IIRC, btrfs has a dedup option where you can link up duplicates later, and then you don't have to hold a dedupe table of everything all the time, and don't need to collect writes to check the dedupe table, etc. But rewriting data isn't how zfs rolls, and I get that.

pseudalopex · on Aug 23, 2023

Or the myth ZFS without ECC is more dangerous than anything else without ECC.

fho · on Aug 23, 2023

Not sure if a myth, iirc it was literally in the ZFS manual last time I looked into ZFS (which to be fair was 10+ years ago).

pseudalopex · on Aug 23, 2023

The documentation author refuted it 9 years ago.[1] Probably your understanding or memory was incorrect.

[1] https://news.ycombinator.com/item?id=8438239

boomboomsubban · on Aug 23, 2023

https://openzfs.org/wiki/System_Administration#Data_Integrit...

>Misinformation has been circulated on the FreeNAS forums that ZFS data integrity features are somehow worse than those of other filesystems when ECC RAM is not used. That has been thoroughly debunked. All software needs ECC RAM for reliable operation and ZFS is no different from any other filesystem in that regard

tambourine_man · on Aug 23, 2023

Is that really the case? Can you point to some references backing that up? (Hehe, unintentional pun)

rincebrain · on Aug 23, 2023

ZFS will happily use a large amount of RAM for caching if you have it, but it'll run fine on a recent Pi (3 or 4, or not raspberry at all).

It'll run fine but more sadly on older Pis running 32-bit kernels, since it does a looooooooot of 64-bit and wider operations, so you pay a nasty tax on that on 32-bit things. (Though the virtual address space limits might actually be sadder than the 64-bit operation penalty there, really...)

tambourine_man · on Aug 23, 2023

That's surprising. I may try on my Pi. I have 40TB and am very happy with SnapRAID, but ZFS always seemed liked the “correct but expensive” solution.

waynesonfire · on Aug 23, 2023

I'm running ZFS on the smallest AWS instance, running FreeBSD, and it does what i want it to do.

boomboomsubban · on Aug 23, 2023

TrueNAS' general requirements are 8GB, and they spell out when you want more. Most of the situations you'd want more you'd also want more on RAID. https://www.truenas.com/docs/core/gettingstarted/corehardwar...

TillE · on Aug 23, 2023

The only filesystem I've ever completely lost to corruption was btrfs, and that was about a year ago. btrfs-restore completely failed, so if I really needed that data I guess I'd have to do some manual surgery. I got to the point in the documentation where the only recourse was "idk go ask someone on IRC".

Of course if you have good backups, you can use whatever and not really worry too much about it.

mattpallissard · on Aug 23, 2023

I've used it extensively for many years, both professionally and personally. Historically it's been something that users needed to be paying attention to the mailing list and wikis.

For the most part though, sticking to standalone or mirrored disks is pretty rock solid and has been for a long time. Ditto for subvolumes, snapshots and, send/receive. My laptop has been snapshotted and sent from one piece of hardware to the next for many years now.

That said, I'm with you on the backups. Anyone who uses btrfs and doesn't have a rock solid backups is a mad man.

eternityforest · on Aug 23, 2023

Well, I guess that resets my "Has it been long enough since a BTRFS horror story that I can trust it" counter.

VancouverMan · on Aug 23, 2023

I've had the same experience the two times I've tried btrfs, about two to four years ago, in my case with Linux VMs that would occasionally be abruptly terminated. In both cases, the corruption that couldn't be automatically repaired happened within the first few of these sudden terminations.

Ext4 seems to handle that scenario better. I can't think of a single instance of filesystem corruption that fsck couldn't fix, and some of those VMs have probably been abruptly terminated at least a hundred times over the years.

cozzyd · on Aug 23, 2023

I had some ram fail in my laptop and that killed my btrfs filesystem. Though btrfs restore was able to recover almost everything eventually (which is good because my last backup was a few weeks before since I had been traveling). Decided to go back to boring old ext4.

benlivengood · on Aug 23, 2023

> ZFS looks so cool! Unfortunately when I eventually get a NAS I doubt I'll want to pay for anything that can run it, so I suspect I'll just be doing RAID and ext4.

I ran ZFS on a Raspberry Pi 4 with 8GB of RAM just fine (under debian arm64), and I've used ZFS on a machine with 4GB of RAM for receiving snapshots.

upon_drumhead · on Aug 23, 2023

Ubuntu ships with ZFS support, as well as TrueNAS.

I personally run my nas with Ubuntu and ZFS and love it.

justinclift · on Aug 23, 2023

Apparently, recent releases of Ubuntu have dropped ZFS filesystem support after the person driving that effort left. :/

upon_drumhead · on Aug 23, 2023

Reading up on it, recent releases of Ubuntu have dropped using ZFS for your boot and root volumes, which isn't ideal, however, they still support ZFS for any other volumes, which I'd venture is it's primary usage anyway, and they don't plan on removing support for anything other then zsys/zfs root/zfs boot.

But thanks for bringing this to my attention. I had missed the changes in 23.04.

chromakode · on Aug 23, 2023

There's been a recent series of PRs which appear to be adding ZFS root support to subiquity, the new Ubuntu installer:

https://github.com/canonical/subiquity/pull/1689

justinclift · on Aug 23, 2023

Cool, hopefully that means it's being kept as an option after all. :)

chromakode · on Aug 23, 2023

Ditto!

chillfox · on Aug 23, 2023

I am using USB disks with ZFS connected to a standard cheap office PC as my storage server. It still provides plenty of benefits over ext4.

lost_tourist · on Aug 23, 2023

is that a script that is maintained and up to date and available on the web? Just curious. I've done this stuff manually and have a check list, but I never wrote a script because I didn't trust myself enough just barely knowing a little bit of bash myself. Another survival technique is to get a very large sd card like 64GB because wear leveling should increase the life. I have one rpi that's been going nonstop for 4 years now and the sdcard appears to be fine. Not sure why rpi guys don't have a "choose long term reliability over kitchen sink" option for setup (or even an image!)

eternityforest · on Aug 24, 2023

It's in the kaithem-kioskify up a little earlier in this thread.

Eventually I'll probably move it to a standalone thing since there seems to be a whole lot of interest in the SD protection feature, but it's missing a few of hacks for programs I don't really use anymore, like the apache logfile thing, just because I got tired of maintaining stuff that didn't have much interest.

akeck · on Aug 23, 2023

Are you willing to share your script? I'd like to compare it to mine to see if I've missed anything. Thanks!

eternityforest · on Aug 23, 2023

The version I'm using now is all tangled up with an installer and setup script(When I'm doing interactive installations I tent to try to reuse the same setup for everything, but there's a big ASCII art banner for most of the relevant stuff for the SD card.

Note that this doesn't have the Apache logfile hack so Apache probably won't run if you try this and don't add something to make it's fussy logfile.

https://github.com/EternityForest/KaithemAutomation/blob/dev...

nine_k · on Aug 23, 2023

What's the point of using ext4 on a NAS, and not XFS?

eternityforest · on Aug 23, 2023

XFS is not the default on most systems and I hardly ever hear about it in general, so I really never paid much attention to it in.

Seems like people say it is more CPU heavy than EXT4, so unless it's way more reliable, would it really be the best choice on a pi/router/subGHz commercial NAS chip?

lost_tourist · on Aug 23, 2023

We used to test some installers on XFS servers for Redhat stuff. XFS need far fewer "corrupt drive" fixes that than ext4fs on those servers. (probably 1/10th as much corruption) and we were constantly just doing straight power offs on them (no soft shutdowns). I think XFS doesn't get the respect it deserves if you don't need a "fancy" file system like zfs or btrfs.

eternityforest · on Aug 23, 2023

Huh, that does sound pretty appealing for sure.

forinti · on Aug 23, 2023

I always disable rsyslog too.

Tell me about tmpfs: where do you use it?

KennyFromIT · on Aug 23, 2023

Alright, I'll bite... What led you to leaving the Linux world entirely? What can "the community" learn from your experience to make it better for others?

NoZebra120vClip · on Aug 23, 2023

Linux is a great fantastic experience, and I have no qualms or ill will about it. I simply had no use for it anymore, and I needed to simplify. I've said before, I'm not a sysadmin anymore, I don't tinker with systems, I need stuff to be operational and in production.

I still love Linux and I'd use it for any given server or Raspi if that were part of my job. I do use it daily in my job, but to a very minimal extent.

theaiquestion · on Aug 23, 2023

Did you switch to a mac or to a windows machine?

worthless-trash · on Aug 23, 2023

They said "Operational and in production" :)

raincole · on Aug 23, 2023

I honestly don't know which one you implied.

scbrg · on Aug 23, 2023

lannisterstark · on Aug 23, 2023

So Ubuntu Server LTS, got it ;)

nicman23 · on Aug 23, 2023

btrfs only thing that i do not like at all is the fragmentation that it is prone to having. especially with sparse VMs images.

zvols are so much better for that

not_your_vase · on Aug 23, 2023

A few years ago we had capacitor plague. Are we living now the storage plague? It's getting ridiculous that all storage is getting worse and worse. WD is making HDDs crappy with SMR, manufacturers says that 3 years operating time is already too much for SSDs and HDDs, and they don't joke. I just had a Kingston SSD (okay, that was like 8 years old) and a portable WD HDD (~2 years old) die just this year.

The internet is full of problems lately about data loss and longevity issues.

I remember 20 years ago HDDs were not meant for eternity either, but they definitely outlived the usefulness of the computer that they were bought with...

kalleboo · on Aug 23, 2023

I've had a lot more luck with hard disks these days than I did 20 years ago. Remember 20 years ago was the era of the infamous IBM Deathstar drives where the magnetic coating would literally start sprinkling off the platters. Also the era of terrible terrible Maxtor drives that died in 1-2 years, which Seagate then bought and made their drives also unstable for a while. I ran a server with around 8 drives and had to keep replacing disks at the rate of about one per year.

Meanwhile today I'm helping admin a ZFS server with 20+ drives and drives have about a 4-5 year lifespan.

> but they definitely outlived the usefulness of the computer that they were bought with

Computers were also much more quickly obsoleted back then. When today a 6 year old computer is totally useable, back then you really felt it if your machine was just 3 years old.

squarefoot · on Aug 23, 2023

> Remember 20 years ago was the era of the infamous IBM Deathstar drives where the magnetic coating would literally start sprinkling off the platters.

I have even older memories of problematic IBM drives. During the early 90s the shop I briefly worked with, found a supplier for IBM SCSI drives at a very convenient price, so they ordered a good lot of them. They worked great on PCs, but some of us also had Amiga machines and of course would love to benefit from the offer. So we tried one, but it didn't work; then another, and another; nothing, they were normal SCSI drives but refused to work on any Amiga with a SCSI controller, although any other drive would work in there. In the end we abandoned all hopes and took the drives for a reformat to be used on PCs, but... they were all dead. Completely, not even detectable by any controller; the mere connection to an Amiga SCSI controller destroyed them instantly. We never discovered where the problem was; those drives worked perfectly on all PCs, while we could install any other drive on every Amiga and expect it to work, but no way to put those in an Amiga and expect it to survive. Good old times indeed:)

nijave · on Aug 23, 2023

Seagate had 1.5TB drives I think about 15 years ago with really high failure rates. Somewhere I saw close to 33%. Anecdotally, mine failed after about a year and the refurb warranty replacement also failed after about a year.

I think it was about 10 years ago some Seagate drives had higher than industry failure rates. Iirc one of their factories was producing drives that failed much more frequently than others (there might have also been something to do with the platter counts/model)

olavgg · on Aug 23, 2023

4-5 years? My old backup server has almost been running for 15 years now, 6x 500GB hard drives, one is even running on PATA. 8GB ECC ram, Athlon II. I could save some money replacing those 6 drives with 2x 14TB hard drives today. But as long it is working fine I ain't gonna do something before I run out of hard drive space.

intothemild · on Aug 23, 2023

I remember those IBM Deathstar drives...

Was at the Aussie Tribes 2 launch LAN and there was a guy who had one die on him.

At that time in the LAN scene there would always be someone who had a Deskstar die ... You could hear the clicking over the noise of the LAN.

I realised back then, I can only trust Seagate.

volkadav · on Aug 23, 2023

The memory of the Deathstar drives that stands out most in my mind was a coworker managing to destroy hardware with SQL. We were at a really ... frugal ... interactive advertising firm and our dev server had been slapped together with a RAID-1 array of cheap IBM drives. One day said coworker was testing conversion of a large database table in MySQL from MyISAM to InnoDB format (to see how long it'd take, what query perf afterwards was like, etc.) and all of a sudden the server went hard down. We went over to the server closet and discovered that the IO had been enough for both drives to grenade themselves at the same time. Good times. I'm just glad we had semi-decent backups and it wasn't a production machine.

chromakode · on Aug 23, 2023

Hah, I had a Deathstar die on me back in the early 00s too. Surprisingly, about a decade later I hammered it with ddrescue and was able to get almost all the data off it!

andromeduck · on Aug 23, 2023

Don't worry, your chips will start glitching after a few soon too.

We're hitting scaling limits. Exponential growth is slowing.

nicman23 · on Aug 23, 2023

that does not make any sense if you are talking about the controllers.

andromeduck · on Aug 23, 2023

https://support.google.com/cloud/answer/10759085?hl=en

caskstrength · on Aug 23, 2023

> I remember 20 years ago HDDs were not meant for eternity either, but they definitely outlived the usefulness of the computer that they were bought with...

Anecdotally I remember HDDs failing sometimes for me and my friends/relatives back in the day. Now it barely happens for SSDs. Hell, even supposedly problematic old Intel from late 2000s still works fine in the same old MacBook I gave to my mother after using it for years.

I wonder what it the actual data regarding this.

squarefoot · on Aug 23, 2023

All WD drives I bought in the last decade work as expected (0), including the NAS ones bought after the introduction of that SMR thing; I just made sure they are either Plus or Pro, not the plain Red ones, which are SMR-plagued. I was also lucky with SSDs, but especially on desktops I use small ones as I still prefer to keep /home dirs and RAID arrays on old rusty drives.

0: A couple exceptions: Two WD Red (before SMR) which I took out from my old NAS to put bigger drives in place, and put in a drawer while they were still perfectly healthy. After like 2.5 years in their anti static bags and normal conditions, no excessive heat, no moisture, no magnetic fields etc, I took them out because I needed a spare disk and checked them: both were not working, one completely dead and the other barely recognizable but unreadable. The first didn't even show up once connected; I tried to clean all contacts, including the pins on the controller pcb to no avail, and eventually had to ditch it; the 2nd one could be reused only after a full reformat; no way to recover old data, not even using testdisk. I never experienced nor expected anything like that, and frankly it worries me quite a lot.

boomboomsubban · on Aug 23, 2023

~20 years ago was the HGST "deathstar," another drive so bad there was a class action lawsuit filed. Disks have always randomly died, that's part of the reason Sun made zfs.

xpe · on Aug 23, 2023

Very quick summary: The mastodon thread refers to https://zrepl.github.io "zrepl is a one-stop, integrated solution for ZFS replication."

istjohn · on Aug 23, 2023

Does anyone know how zrepl compares to sanoid/syncoid other than that zrepl is written in Go and sanoid/syncoid are Perl scripts?

etherael · on Aug 23, 2023

I use sanoid to do basically the same thing as this, and was interested in giving it a shot to see if it was more hands off but it's definitely a more complex setup to begin with, given you have to setup your own SSL certs etc, not sure why they wouldn't just use SSH transport for this like everything else.

chromakode · on Aug 23, 2023

I use Wireguard to secure and authenticate the transport. Much easier to set up! SSH is also an option.

etherael · on Aug 23, 2023

Thanks. Good to know that's possible, it's exactly what I use for sanoid also, so I guess the quickstart just assumes that layer isn't available.

runeks · on Aug 24, 2023

Looks like it needs to speak to a daemon running on the storage server?

Would be cool if it could just use e.g. S3 for storage.

predictabl3 · on Aug 23, 2023

Zrepl is a big part of why I feel secure doing the digital nomad thing. A script, run nightlyish, opens a separate-headered LUKS-protected ZFS pool and then copies all snapshots over. That NVME enclosure lives in my "purse" that never leaves me sight/body.

Between this and NixOS, I can provision a new identical laptop in about 10 minutes.

I recently added off-site replication as well, so even if I get completely devastatingly mugged, there's still about zero chance of serious data loss.

Zrepl is absolutely brilliant software. Easy to run with, but incredibly sophisticated and powerful if you need all the knobs. I can't praise it enough.

xpe · on Aug 23, 2023

Kudos. You're probably safer than most people who are only one theft, fire, flood, or other disast

xpe · on Aug 25, 2023

... er away from major data loss.

istjohn · on Aug 23, 2023

Do you have any experience with sanoid/syncoid? What does Zrepl give you over them?

tmountain · on Aug 23, 2023

You should do a write up about how this works. It sounds very interesting.

predictabl3 · on Aug 23, 2023

It's on my list, but... to be honest if you Google "separate header Luks", you'll find it's trivial to create a LUKS device with a detached header. Then the default ZRepl quick start will get you going with the basic pool-to-pool local replication. That will get you almost all the way there. :) I used their docs/guide to do the remote replication too, though it would make a good write-up as I could throw in how I use sops-nix for securing the Zrepl TLS bits for the remote scenario too...

LWIRVoltage · on Aug 23, 2023

This does sound really cool- and like a way to ultimately set up a secure, not that complex backup method...

windows2020 · on Aug 23, 2023

Could snapshotting the filesystem every 10 minutes have contributed to its death?

codetrotter · on Aug 23, 2023

ZFS is very special, and it is cheap to make snapshots with ZFS, because ZFS uses copy-on-write.

Intuitively I would think that the amount of extra writes is pretty low, even if you snapshot very frequently.

But scientific measurements would be nice.

I used to do snapshots every minute, every hour and every day with ZFS on some servers I administered. I’d purge the minute snapshots after 60 minutes. And I had cron jobs on other machines to backup the hourly and daily snapshots. I had it set up so that hourly snapshots were kept for something like 72 hours. And the daily snapshots were kept forever.

The idea with the every minute snapshots being that they were for undoing manually made mistakes during SQL migrations etc.

It worked well for me.

I still use ZFS on my FreeBSD servers. But at the moment my projects are low traffic and the data only changes in important ways some rare times. So with my current personal servers I manually snapshot about once a week and manually trigger a backup of that from another server.

Another thing I’ve changed is that now I only snapshot the parts of the file system where I store PostgreSQL databases and other application data. I no longer care so much about snapshotting the operating system data and such. If I have a serious hardware malfunction I will do a fresh install of the OS, and I have a log of what important config values are used and so on, that my backup scripts copy when I run them, without copying all of the other things.

vasco · on Aug 23, 2023

This sounds overkill even for production data, much less personal data, particularly the every minute and the fact you keep dailies forever.

Unless you're a custodian of some secret society's files!

codetrotter · on Aug 23, 2023

> sounds overkill

But it wasn’t. It was very useful in fact.

oefrha · on Aug 23, 2023

Copy-on-write and cheap snapshots was quite special when ZFS was created. It’s hardly special in 2023, when every single non-vintage iPhone, iPad and Mac has that.

codetrotter · on Aug 23, 2023

ZFS is still very special. Compared to the limited file system capabilities of the machines in the world using FAT32, NTFS, ext2, ext3, ext4, etc.

ndiddy · on Aug 23, 2023

Wouldn’t a filesystem backup of a running SQL database be corrupted when you try to restore it?

dezgeg · on Aug 23, 2023

Whenever the filesystem is being snapshotted atomically, no. That would be effectively same state as from a power-cut at that exact moment.

It is correct though, that trying to do it with something like 'cp' on a live database will most likely be corrupt.

viraptor · on Aug 23, 2023

Not if the filesystem itself can do atomic snapshots. The problem can happen if you try to copy the files or even the device when it's being written to. But if you create a snapshot and copy that, it would be consistent.

You can of course the up in a dirty state where some new transaction was started but not committed, but any non-toy database should be able to recover from that. (You'll lose the transaction of course)

kobalsky · on Aug 23, 2023

it’s equivalent to backing up a server after a hard power off.

as long as the filesystem supports some kind of journaling (aka it’s not ancient) and the database is acid compliant, there shouldn’t be any major issues beyond a slow startup.

but keep in mind that you may lose acid compliance by fiddling with the disk flushing configuration, which is a common trick to raise write speed on write choked databases. if that’s the case you may lose some transactions

your database documentation should have this information.

codetrotter · on Aug 23, 2023

Mainly if you restore a previous version of the on-disk data while the DBMS is still running I would think. Because then the idea that the running DBMS has of the data no longer matches what’s on disk.

But if you stop the running DBMS before you restore the previous version on disk, and then start the DBMS again it should be able to continue from there.

After all that’s one of the key selling points of a bonafide DBMS like PostgreSQL, that it’s supposedly very good at ensuring that the data on disk is always consistent, so that when your host computer suddenly stops running at any point in time (power outage, kernel crash, etc) the data on disk is never corrupted.

If data is corrupted by restoring from a random point in time in the past, that should be considered a serious bug in PostgreSQL.

barryrandall · on Aug 23, 2023

It depends on the DBMS, its configuration, the host OS, its configuration, and many other details. Snapshotting running databases is possible and often "just works," but you should always verify that before relying on this functionality in production.

codetrotter · on Aug 23, 2023

> you should always verify that before relying on this functionality in production

Yep :)

At said company where I was using ZFS snapshots for the servers I administered, I additionally had a nightly cron job to dump the db using the tools that shipped with the DBMS. Just in case something with the ZFS snapshots fricked itself :)

sitkack · on Aug 24, 2023

Smart.

I did a daily dump of my prod database to my local workstation, needed it for prod corruption issue because ops was only doing weeklies.

georgyo · on Aug 23, 2023

Not likely.

The snapshot doesn't write much, and both SSDs and ZFS are copy on write. Which means the cost of writing after a snapshot is the same as before the snapshot.

On the other hand context is missing. Both SSDs and ZFS don't like being full or even close to full. The working set was ~650GB, of the drive was 1TB, then those snapshots could have easily made the drive over 90% full. This could have made ZFS unhappy all by itself.

chromakode · on Aug 23, 2023

I agree that it was unlikely. The total size of all data and snapshots was 625 GiB on a 2 TB drive (which had seen less than 2 years of moderate use). It was a pretty unexpected failure.

Xaiph_Rahci · on Aug 23, 2023

> cost of writing after a snapshot is the same as before the snapshot

I didn't understand this, could you please clarify?

If there was no snapshot, there would be only one write operation, the actual write. However, with snapshot in place, in addition to actual write, there is a copy operation which copies the original data and writes to snapshot location. So, there should be two write operations (actual + copy).

rincebrain · on Aug 23, 2023

ZFS is never overwriting in place in either case, you're just not freeing the old one if it's in a snapshot, and a snapshot is just a note that "nickname this point in time 'mysnapshot', and don't clean up anything referenced at this point in time", so it's very cheap to make, and you just check it later when you would be cleaning things up.

viraptor · on Aug 23, 2023

Does it actually not update in place even for areas with a single reference? I haven't checked the source, but that sounds like fragmentation hell on spinning disks. That would absolutely kill the performance on zfs-hosted VM images / databases, which I didn't think actually happens... (Apart from the intent log, which sure, that's append only)

rincebrain · on Aug 23, 2023

I promise you, it does not.

ZFS really deeply assumes that, when a region is in use, it will not change until it's no longer in use anywhere, and it also won't reuse things you just freed for a certain number of txgs afterward to let you get away with having to roll back a couple txgs in case of dire problems without excitement. (Since having enough writes will cause more txgs to happen faster, this isn't an issue people run into with being unable to use newly free space in practice.)

Also in practice, defining what "sequential" means with multiple disks in nontrivial topologies becomes...exciting anyway, and for writes, you only care that things are relatively, not absolutely, sequential for spinning media, and on reads, prefetch is going to notice you doing heavily sequential IO and queue things up anyway. (IMO)

If you like, you could go check on your configurations, what the DVAs for the different data blocks in your VM images are - something like zdb -dbdbdbdbdbdb [dataset] [object id, which you can get from the "inode number" of the file, or if it's a zvol, I think it's always just 1 that all the data you think of as the "disk" goes in...]

You'll almost certainly find that the regions that changed more than a couple txgs apart (the "birth=" value is the logical/physical txg the record was created) are mostly not remotely sequential.

(Nit - the two exceptions that come to mind are, the uberblocks are basically a fixed position on disk relative to the disk's size, and a fixed size, and you get [fixed size]/[minimum allocation size] of them in a ring buffer, basically, before you overwrite the oldest one, and that happens by just overwriting it, since it's technically not in use any more, someone just might want to roll back to it in a "This Should Never Happen(tm)" case...or the newly added feature of corrective send/recv, to let you feed ZFS a send stream of an "intact" copy of something that had an uncorrectable data error and have it scribble over the mangled copy with the fixed one in-place, assuming it passes the checksums.)

viraptor · on Aug 23, 2023

So looking at various benchmarks, reports and tuning guides, it does look like the spinning disks performance really suffers from zfs fragmentation. I haven't seen those before, but also haven't dealt with databases on zfs either. Something to keep in mind I guess.

Edit: after reviewing a few benchmarks, the outcome seems to be - even on SSD, make sure you actually want the zfs features, because ext4 will be a lot faster.

toast0 · on Aug 23, 2023

Yeah, it's a tradeoff. Zfs gives you easy data integrity verification (and recovery if you have redundancy), easy snapshotting, easy send/recv. But you lose out on modify in place, and unified kernel memory management (at least on FreeBSD and Linux, maybe it's different on Solaris?); both of those can reduce performance, especially in certain use cases.

IMHO, zfs is a clear win for durable storage for documents and personal media. It's not a clear win for ephermeral storage for a messaging service or a CDN. If you don't mind running multiple filesystems, zfs probably makes sense for your OS and application software even if your application data should be on a different filesystem.

rincebrain · on Aug 23, 2023

Do you have pointers?

Because there are various mitigations and configurations involved if you're trying to do lots of small random IO for ZFS, and I've not heard people giving the advice of "just don't" in most use cases.

viraptor · on Aug 23, 2023

Just search for "zfs ext4 postgresql benchmark" - you'll find many of them using different configurations.

boomboomsubban · on Aug 23, 2023

There's no copy operation, the previous data isn't overwritten and the new data is written to a new block. It's "copy-on-write."

xpe · on Aug 23, 2023

As I understand it, taking a snapshot with ZFS involves writing a metadata object and some data references. Assuming 100 GB of data, 128K block size, and 64 bit pointers, I'd guesstimate * that new data written during a snapshot would be in the ballpark of 5 MB. Is doing that 6 times per hour (52,560 times per year) enough to cause premature wear on the drive? That would be ~256 GB per year. This is likely under 1% of an SSD's write endurance. So, I'd be surprised if taking 10 minute snapshots was a significant causal factor.

* I could be wrong, I asked for some help from not the most reliable sources. Happy to be corrected. Still, if my estimate is higher than actual and yet still unlikely to affect drive longevity, it may be moot.

HankB99 · on Aug 23, 2023

Not likely. A snapshot just marks the most recently written block and prevents previous blocks from being altered. (More or less.) Since ZFS is copy on write, any changes to files will involve the same writes and some previously written data will not be deleted.

numpad0 · on Aug 23, 2023

Minimum write size of a modern Flash chip can be ~100MB(!) according to a comment found in a random orange website[1]. So 5MB write every 10 minutes can be 600MB/hr, which is 4.8TB/8-hr-day, which is 24TB/40-hour-week, which is 3.43 DWPD real time for a 1TB drive, and 2500 TBW in 2 years real time[2].

Official quoted specification for SN850 is 600 TBW of write endurance, likely after derating for obvious warranty implications. Incidentally, 2500TB is also a typical endurance figure for many SSDs in this market. Overall, to me, sounds not entirely impossible.

I kind of wonder what's the controller says in SMART data, if still alive. On Linux the command is `apt install smartmontools; smartctl -s on /dev/sda; smartctl -A /dev/sda`, and it shall print out a table[4]. On Windows, just install CrystalDiskInfo[3].

1: https://news.ycombinator.com/item?id=29165202

2: DWPD: drive writes per day, TBW: Total Bytes Written - in terabytes

3: https://crystalmark.info/en/software/crystaldiskinfo/

4: Note that "Pre-fail" means the value is supposed to change when about to fail and "Old_age" means the value is supposed to indicate age, NOT "this is bad and about to fail" and "this drive is old". It always says all Pre-fail and Old_age. Someone should have changed it to "somewhat_boolean" and "life_remain" long time ago in my opinion.

chromakode · on Aug 23, 2023

Unfortunately the drive didn't appear accessible at all via nvme-cli. Interestingly it shows up in lspci but doesn't get a /dev/nvme. It tends to hang the UEFIs of the two systems I tried it in when they try to read it.

pseudalopex · on Aug 23, 2023

Minimum write size is not erase block size.

kalleboo · on Aug 23, 2023

My machines have always been just constantly writing logs, like every couple seconds (macOS does this), and the write wear has never been anywhere near that bad. The advertised endurance must take into account write amplification for typical loads.

epcoa · on Aug 23, 2023

> Minimum write size of a modern Flash chip can be ~100MB(!) according to a comment found in a random orange website[1]

Your reference says erase block size. That is not minimum write size. Sorry but you are clueless (and careless).

mafuy · on Aug 23, 2023

You were off by factor 1000: GB, not TB

E39M5S62 · on Aug 23, 2023

Nope. As others have mentioned, ZFS is CoW. Snapshots are "free" in that they (basically) point to a transaction group in the filesystem. They record a small amount of metadata to disk on each snapshot - on the order of a few MB. This is much much lower than an rclone/sync style backup.

Izkata · on Aug 23, 2023

That's also the least interesting part of comparing a ZFS backup to rsync/rclone: The rsync way is to crawl the entire tree being backed up to diff over the network then copy the differences. Because of snapshots, ZFS already knows all the changes that occurred between snapshot A and snapshot B, and (provided state up to A has already been backed up) can update the backup by pushing all changes between A and B as one big binary blob without having to scan or diff anything.

Filligree · on Aug 23, 2023

No. Individual snapshots are a matter of kilobytes; desktop environments write considerably more.

istjohn · on Aug 23, 2023

I read somewhere that snapshots are actually around 5 MB. Still not a lot, but a lot more than a few KB. A year's worth of hourly snapshots comes to over 40 GB just in snapshot overhead.

chromakode · on Aug 23, 2023

Another factor to weigh in my case is this laptop probably spends at least 50% of its life suspended. The overhead should be measured in MB per hour of uptime.

p_l · on Aug 23, 2023

No. ZFS implements normal writes to the disk as snapshots (just unnamed ones), so in fact you can only write to disk through creation of a snapshot or by writing to "Intent Log" which is short-term log of data that is going into next snapshot - but which was synced before the snapshot was done, and as such it's secured in case of power failure.

baby_souffle · on Aug 23, 2023

Nah, changes are COW so most snapshots are tiny.

abrookewood · on Aug 23, 2023

I guess it could contribute somewhat, but I don't think it is that much additional work: for every new write (since the last snapshot), there is one additional read as the data is sent. It isn't reading the whole file system, just the incremental data.

endisneigh · on Aug 23, 2023

Of course it contributed. But it probably wasn’t the main reason or a significant contributor.

ChrisMarshallNY · on Aug 23, 2023

I tend to use Apple’s Time Machine incremental backup to a Synology spinning rust server. I also have an external SSD that I’ll mirror the internal drive to, if I’ll be doing anything dodgy, or upgrading my machine.

That works. TM restores can be quite slow, but almost all my important data is in Git (and hosted storage), so it’s not really been an issue. I just use TM every now and then, if I have a single file I want to backtrack.

I also have one of the notorious[0] SanDisk drives. I don’t use it for anything important. It just has some game storage. Since I’m a Mac user, games aren’t really much of a factor for me, and I won’t cry, if they croak.

[0] https://arstechnica.com/gadgets/2023/08/sandisk-extreme-ssds...

jedberg · on Aug 23, 2023

I use TM as well. I got an app (https://tclementdev.com/timemachineeditor/) that will manually trigger TM backups whenever the machine is idle. Seems to work a lot better than the Apple automatic or timed backups.

op00to · on Aug 23, 2023

I’ve given up on Time Machine. It never seems to work past a month or two for me on my Synology w/ atalk etc.

jedberg · on Aug 23, 2023

It was always breaking on my Synology too. I ended up just attaching an 8TB spinning rust directly to the Mac and it's been flawless since.

Time Machine really doesn't like using remote disks that aren't official Apple gear.

kalleboo · on Aug 23, 2023

Even when I had an Apple Time Capsule, it would break about once a year. It's just a flakey system. Wish they'd add the equivalent of zsend to APFS instead of using the weird "gigantic sparse disk image with hard links in it" system

praseodym · on Aug 23, 2023

When backing up to an APFS Time Machine volume it does work a bit like that, at least no hard links are used any more:

- https://eclecticlight.co/2021/03/11/time-machine-to-apfs-und...

- https://eclecticlight.co/2021/04/16/time-machine-to-apfs-usi...

kalleboo · on Aug 24, 2023

Oh awesome! I guess I need to delete and re-create my Time Machine backup to enable this.

rollcat · on Aug 23, 2023

Ditto. Once I bothered to set up Borg for my other boxes, it was overall much less pain to just use it on macOS too.

kalleboo · on Aug 23, 2023

FWIW Apple has deprecated AFP/AppleTalk and you should disable it on the Synology. It's far more stable with SMB (but still not great)

DavidSJ · on Aug 23, 2023

My snapshots are encrypted by the original computer (this is cool because the NAS can’t read them!). So I also needed to restore the encryption “wrapper key” to be able to use the backups.

Not gonna lie, it was pretty terrifying until I had my first confirmation I could decrypt the data.

Note: no bespoke backup method should be assumed functional unless you actually periodically check that you can restore data from it.

raisin_churn · on Aug 23, 2023

Or non-bespoke.

DavidSJ · on Aug 23, 2023

I added that word because at least with non-bespoke backups, you have probably thousands of users each day testing restoration out of necessity, and word might get around if the method failed to restore. But nevertheless, one should still test even then.