Why RAID 5 stops working in 2009 (2007)

ars · on June 1, 2014

Note: This article is from 2007 and is quite prescient.

It's completely shameful how bad the specified read error rates are now.

It's to the point that if you read an entire disk 4TB you have a 32% chance of one bit being wrong!

That means hard disks can no longer be considered reliable devices that return the data written to them, you now need a second layer in software checking checksums.

http://www.wolframalpha.com/input/?i=4TB+%2F+10^14+bit++in+%...

For extra money they sell hard disks with 10^15 reliability instead of 10^14 - this should be standard!

MarkSweep · on June 2, 2014

I think that it's more shameful that modern operating systems are only slowly adapting file systems that can even detect that silent data corruption has occurred. EXT4, HFS+, and NTFS all make no effort to ensure the data you read back is the data you wrote.

Slowly Btrfs on Linux and ReFS on Windows are becoming more prevalent. Since storage density is growing faster than most home users can use it, hopefully in the near future consumer computers will come configured to store multiple copies of their data. Instead of one day noticing that a picture is corrupted or the computer won't boot, periodicity scrubs will detect corruption before it's too late.

Freaky · on June 1, 2014

> It's to the point that if you read an entire disk 4TB you have a 32% chance of one bit being wrong!

If this were the case, forums would be filled to the brim with ZFS users whinging about how they get CKSUM errors practically every time they scrubbed their pools. They do happen, but nothing like that frequently.

> That means hard disks can no longer be considered reliable devices that return the data written to them

They were never such devices in the first place if you ever valued your data enough to care. Even if your disks are perfect, your cabling, backplanes, IO controllers, drivers or memory probably aren't.

ars · on June 1, 2014

> If this were the case, forums would be filled to the brim with ZFS users whinging about how they get CKSUM errors practically every time they scrubbed their pools. They do happen, but nothing like that frequently.

I'm going by what the manufacturers say - are you saying they are overestimating the error rate? Why would they do that?

> They were never such devices in the first place

No, I meant they would return false data instead of an error. Used to be you never got corruption, just success or fail. Now you have success, false success, or fail.

Freaky · on June 2, 2014

> I'm going by what the manufacturers say

Are you really? "Non-recoverable Read Errors per Bits Read, Max", say Seagate - that just means it couldn't correct, not that it couldn't detect there was an error.

> are you saying they are overestimating the error rate?

Perhaps not, as a general order-of-magnitude upper-bound on a marginal drive in poor conditions, but certainly if you're interpreting it as an average. Too many people would be seeing it happen way too often to keep quiet about it otherwise.

> Used to be you never got corruption, just success or fail

I wish :/

vacri · on June 1, 2014

I think we have very different definitions of 'completely shameful'. One bit wrong in four trillion, on a device that only costs a couple of hundred dollars?

ars · on June 1, 2014

It sounds good doesn't it? But then you compare to the size of the disk and realize that it's not actually that good.

The shameful part is that they actually sell the 10^15 drives - and not for a ton more money either. They simply bin drives and the better ones are rated as 10^15. Instead of doing that they should figure out what's different in the manufacturing of those and do it across the board.

ted_dunning · on June 2, 2014

Yeah... shameful. I am sure that there is something simple that all the drive manufacturers are just overlooking in their process that would take them to 10^-17 BER.

When was the last time you created a physical artifact for sale? That had even just part per billion malfunction rates?

It isn't as easy as these guys make it look by a long stretch.

kelnos · on June 1, 2014

Even worse/better -- one bit wrong, not byte. So one bit wrong in thirty-two trillion.

logicallee · on June 1, 2014

lol @ "they should sell hard disks with 10^15 reliability instead of 10^14!" This is the funniest outrage I've seen in a while, just because the suggested fix is so specific and (relatively) low for the huge amount discussed.

It's like saying, "It's an outrage that this drink holds only 37 ounces! For this kind of money I expect it to hold at least 38 ounces. I would pay for that."

-

My 4 downvoters are clearly a lot worse at math at I am.

Let me spell it out for you:

"Reliability of 100000000000000 is unacceptable! I need exactly 1000000000000000 of reliability (a factor 10 difference only)."

If you don't see how that is funny, the problem is with you, not me. That's a hilariously specific, and small, change.

"Man, having a 30% chance of a bit flipping if I read 4 TB is unacceptable. I should have to read like...40 TB for a 30% chance of a bit flip!"

It's particularly funny because in the face of this being unacceptable, I would expect a solution, like error correction, that adds 5-10 orders of magnitude (many bit flips have to all happen in the same sector, by coincidence, at the already low 10^14 rate), for it to make it past the error correction layer.

The other thing is that given that we're talking about ERROR RATES, I cannot possibly believe that they have it down to such a specific number as 100000000000000. It would have to be a range. You should target more than 1 order of magnitude to get something more reliable (like ECC RAM versus regular RAM). [1]

Just like the error margin on 37 ounces must be like 37-40 ounces anyway.

[1] first reference I found: http://lambda-diode.com/opinion/ecc-memory-2

"I computed that if you have 4 GiB of memory, you have 96% chance of getting a bit flip in three days because of cosmic rays. SECDED ECC would reduce that to a negligible one chance in six billion." That is, for example, a change of

1 per 34359738368 bits (rounding 96% to 100%) to 1 per 206158430208000000000 bits (multiplying by 6 billion).

34359738368 bits is 10^10.

206158430208000000000 is 10^20.

That's the kind of change (give or take) that I was expecting parent to suggest - not one order of magnitude!

ars · on June 1, 2014

Clearly math isn't your strong suit.

An order of magnitude change is nothing minor. It would take the 32% chance and make it a 3.2% chance.

Maybe you don't know that ^ means to the power off?

tedsanders · on June 1, 2014

Hey, please remember to be kind to everyone. Even if you disagree with someone, insulting them adds no value for them and adds no value for the other readers on HackerNews. Insults are the rare type of comment that add value for only a single user. Even if someone is wrong, remember: we're all human. :)

ars · on June 1, 2014

I know, you're right. I was just a bit annoyed at his tone so I reacted instead of just replied.

I tried to soften it a bit by adding if perhaps he just didn't recognize the notation, but I guess it wasn't enough.

logicallee · on June 1, 2014

I did recognize it, ars. I think it's hilarious that if 100000000000000 (1e14) isn't enough reliability then 1000000000000000 (1e15) should do it.

As mentioned if you look at, for example, the change from normal bit flips (from cosmic rays) on normal RAM, versus that on ECC RAM, you would see many orders of magnitude. Given that we're speaking about errors (which have a huge standard deviatoin anyway) it is quite normal to see orders of magnitude difference. A factor of 10x is just not that big of a wish, if the error levels are causing you problems!

I also don't think that multiplying by the size of the drive is relevant to anything. It doesn't increase the chance you have an error in a specific application, etc.

What can you fill 4 TB with, that you wouldn't use software error correction on, but for which a single bit flip is unacceptable?

I can think of things like movies, raw sensor data, and other media. I don't see how multiplying by the size of the drive gives you an interesting result. You've just fallen for marketing :)

ars · on June 1, 2014

> should do it.

They actually sell 10^15 disks, so there is no excuse for 10^14 ones to exist anymore. They aren't even using more error correction to get it - they are binning drives and the better ones get sold for more.

This would be fine is the lower spec'd ones were still reasonable, but for a 4TB drive 10^14 is NOT OK, and they should not be selling them anymore.

logicallee · on June 1, 2014

I just disagree with you and think that your perspective on the matter and use of the numbers is funny. Again you repeat "For a 4TB drive 10^14 is NOT OK"... but 10^15 is. Well, if you say so.

Note: you shouldn't downvote just because you disagree with someone's perspective. I'm not going to delete these comments because I stand by them. If 1 bit flip every 4 TB is a problem for you, then don't move to 1 bit flip every 40 TB - get some kind of ECC etc.

wtallis · on June 2, 2014

He's not saying that 10^15 is good enough for all purposes. He's saying that it's closer to good enough without significantly increasing the cost. This should not be that hard for you to grasp.

logicallee · on June 2, 2014

Your tone with me is completely unwarranted, wtallis. This is what he wrote:

"

Note: This article is from 2007 and is quite prescient.

It's completely shameful how bad the specified read error rates are now.

It's to the point that if you read an entire disk 4TB you have a 32% chance of one bit being wrong!

That means hard disks can no longer be considered reliable devices that return the data written to them, you now need a second layer in software checking checksums.

http://www.wolframalpha.com/input/?i=4TB+%2F+10^14+bit++in+%...

For extra money they sell hard disks with 10^15 reliability instead of 10^14 - this should be standard!

"

Even after his (and your) clarification, it makes me chuckle to read that :) He started, "It's completely shaemful how bad the specified read error rates are now" and then ended "For extra money they sell hard disks with 10^15 reliability instead of 10^14 - this should be standard!"

This is hilarious. I guess some people here have no sense of humor though.

Error rates are not nearly as exact as that - more like marketing figures.

lindenr · on June 1, 2014

Does this mean that a 10^13 reliability has 320% chance of failure?

EDIT: in any case the correct calculation should be http://www.wolframalpha.com/input/?i=%281-%2810^-14%29%29^%2... which gives a 30% chance of failure, versus http://www.wolframalpha.com/input/?i=%281-%2810^-15%29%29^%2... for a 3.5% chance.

ars · on June 1, 2014

320% would mean you find 3 wrong bits each time you read the disk.

I was calculating the expected number of wrong bits per full read of the drive.

You calculated the chance of a perfect read which is similar but not the same.

lindenr · on June 1, 2014

Ah, I see. Could you edit your post to reflect this? As it stands it is very easy to misinterpret.

ars · on June 1, 2014

You can only edit for a little over an hour, and it's been three.

I thought "of one bit being wrong" said it, but I can see how you misread it. Hopefully people will read the thread and see these replies as well.

judk · on June 1, 2014

But only a few years of difference with Moore's Law... Which was the point of the article.

ars · on June 1, 2014

You probably mean: https://en.wikipedia.org/wiki/Kryder's_Law

logicallee · on June 1, 2014

I have now spelled it out for you. "Reliability of 100000000000000 is unacceptable! I need exactly 1000000000000000 of reliability (a factor 10 differnece only)." It's hilariously specific. If your error rate is riding up against your usage so that you notice it, you wouldn't expect to just want to cut it by a small factor like that.

ars · on June 1, 2014

> It's hilariously specific.

The reason for the number is very simple: It actually exists.

logicallee · on June 1, 2014

I think it's probably just marketing. They're just engineering reliability targets, a specific batch could have far different rates.

abritishguy · on June 1, 2014

And clearly it isn't your's either.

zupa-hu · on June 1, 2014

I had the same thought logicallee :)

keypusher · on June 1, 2014

This is already well known in enterprise storage, and one of the solutions is erasure coding. Some products implementing this exist today and others are currently in the works. Basically you go to something like 20+5, instead of 4+1. Of course this only makes sense when you have 50+ drives in a tray.

http://www.networkcomputing.com/storage/what-comes-after-rai...

mprovost · on June 1, 2014

Practically yes, erasure coding solutions do tend to have a lot of hard drives. That's probably because their efficiency really makes them much cheaper for people who are buying lots of storage, for a small setup the cost of the drives themselves isn't that significant. But mathematically it works with smaller numbers. You could have an 8 of 12 setup and be able to sustain the loss of 4 drives while only storing an extra 50%.

e12e · on June 1, 2014

There was an update in 2013:

http://www.zdnet.com/has-raid5-stopped-working-7000019939/

And based on a sample size of 1, it looks like desktop 4tb drives still have an URE of 10^14:

http://www.seagate.com/www-content/product-content/barracuda...

Then there's of course SSDs:

https://www.youtube.com/watch?v=eULFf6F5Ri8

Yeah, no. I don't think it's fair to compare raid0 and raid5 for durability -- but ssds seem to have very low ure-rates, and you can now (almost) reasonably get 1tb ssds. Not sure about how long they can be realistically expected to last, though. Long enough (2-3 years) for when you'll probably want to replace those 14 1tb ssds in raid6 with a three 16 tb ssds in raid1 with one hot-spare?

zanny · on June 1, 2014

You could get a bunch of 840 Evo drives, say, 7 of them, for around $3500, for a 6TB raid5 that would probably get at least a gigabyte a second even with parity calculations overhead.

Though ZFS and Btrfs solve the unreadable bit problem with extent checksums, so you really don't need to worry about mechanical disks, yet. You would need two URE's in the same block to kill the checksumming.

lmm · on June 1, 2014

I lost data to this kind of problem[1]. The linux dm-raid handles these kind of failures extremely poorly (or did at the time), even when one following all available tutorials. (When I reported my experience one developer said I should have set a cronjob to recursively md5sum my / every week or so - not exactly user-friendly, and not mentioned in any of the tutorials). When you attempt to rebuild a dmraid array, even a raid6 one, expect to lose all your data.

Now I use ZFS (on FreeBSD), which handles these kind of errors much more gracefully; if there's an isolated URE you might lose data in that particular file, but it won't destroy the whole array.

[1] Yeah yeah, RAID is not a backup. I'm talking about data I didn't consider worth the cost of backing up, as a poor student at the time.

ars · on June 1, 2014

> I should have set a cronjob to recursively md5sum my / every week or so

If you use Debian then install the debsums program which will do that for you for non-user data and report any errors.

You should also install mdadm and set it to check the array every month.

And finally install smartmontools and have it do a short self check every day and a long one (i.e. a full disk read) every week.

turrini · on June 1, 2014

Yeah, now I've near 120 tb of data under ZoL (ZFS on Linux, currently on master as time of writing), replicated at 5 min interval between two datacenters...

Zero corruption, zero problems.

fulafel · on June 1, 2014

The main Linux raid impl (md) probably handles failures more robustly than dm-raid.

topbanana · on June 1, 2014

I don't use any storage level redundancy at all. I found that misconfiguration (my fault, on two occasions) made it more unreliable that just a single disk. I rely on cloud backup, and I'll take the hit if/when I need to rebuild my machines

fulafel · on June 1, 2014

Raid is no replacement for backups in any case!

For most people it's better to make a nightly copy to other HD. Raid is for saving downtime on HD failures, doesn't save you from accidental rm -r, word processor corrupting your thesis, GPU driver crash corrupting your filesystem, box getting owned, etc.

autokad · on June 1, 2014

i think what he means is, down time isn't a big deal since recovering from back up is quick enough in his case. I never found raid 5 particularly helpful because: #1 hard disks tend to fail at the same time #2 systems fail to give warnings on bad disks ie everything is working, no red flashing lights, things go boom, restart, just kidding, lol you got 2 bad disks. #3 redundant servers > redundant disks for uptime #4 cloud

but thats just me and my use cases

edit: when i use raid, i use raid 10 or raid 0

mprovost · on June 1, 2014

This is why Time Machine was a much better solution for Apple to implement than their attempt to build ZFS into OSX.

barrkel · on June 1, 2014

The storage efficiency gains from RAID5 are not worth the risks, and when you go to RAID6, you lose even more efficiency.

You're better off with RAID10 (only need to read one drive to replace, not all the drives). Better performance all round too.

chadnickbok · on June 1, 2014

I'm not sure I fully understand why RAID10 is better than RAID5. Reading the RAID wikipedia article, they seem to imply that a typical RAID10 setup uses n*2 disks, where each block of data is written to two drives.

But how does that help in the case of drive failure? If a drive fails, then as size increases won't the exposure to a URE also increase? Is it better than RAID5?

wtallis · on June 1, 2014

When a drive in a RAID10 array dies, then the data on that drive is still directly available on a drive that's not protected by redundancy, and the other half of the data is still protected. Rebuilding the array requires reading a single drive without error. Rebuilding a RAID5 requires reading n-1 drives and computing parity without error.

judk · on June 1, 2014

Indeed. Reed Solomon coding (maybe only available in system software, not disk hardware) gives you efficiently balanced striping+redundancy

WatchDog · on June 1, 2014

In a 4 disk system RAID6 can survive any 2 disks failing but RAID10 can only survive 2 disks failing if you get lucky. Once you go past 4 disks, RAID10 becomes quite expensive and yet can still be killed by a 2 disk failure event.

mprovost · on June 1, 2014

The window for losing two disks is reduced though, because rebuilding a single drive failure is much faster in a RAID 10 system where it's just a drive copy, as opposed to n drive reads and a parity calculation.

barrkel · on June 1, 2014

Work out the probabilities. The results may surprise you. They did me. I was like you once.

eeZi · on June 1, 2014

I work at a hosting company, and we're using RAID10 exclusively.

cybojanek · on June 1, 2014

Can't wait for btrfs. Benchmark from today:

https://docs.google.com/spreadsheets/d/1L5bVGU95D0Cu1gJoQhBh...

GhotiFish · on June 1, 2014

Is BTRFS really still at the "do not use in production" phase? I'm surprised it's still considered unstable. Seems like a case of the "google beta's"

thristian · on June 1, 2014

Russell Coker's reports of his experiences with BTRFS give me the screaming heebie-jeebies, no matter how up-beat and positive he stays about it: http://etbe.coker.com.au/tag/btrfs/

pmoriarty · on June 1, 2014

What about ZFS?

MarkSweep · on June 2, 2014

ZFS has been around a bit longer than Btrfs, which the ZFS proponents claim has allowed for it to become more stable. Watching the commit logs of Illumos (the open-source derivative of Solaris) most commits seem to be related to adding features or reducing IO latency variance. Problems with lost data are few and far between.

As an appeal to authority, a number of companies currently trust[1] their data to ZFS, Joyent probably being the most well know of them. I store my personal data[2] on ZFS, though my needs are modest.

[1]: http://open-zfs.org/wiki/Companies [2]: http://www.awise.us/2013/03/10/smartos-home-server.html

hga · on June 2, 2014

And the list at [1] doesn't include rsync.net, who's business is providing reliable off site disk storage.

brunorsini · on June 1, 2014

Never used anything but RAID 1 on my Synology NAS. Had a drive fail there once and rebuilding the array was as simple as swapping it for a new one (took a few hours but it was truly "plug and play" and no data was lost).

Do the things mentioned on the article imply that a system like the Promise Pegasus2 R6 12TB (6 by 2TB) Thunderbolt 2 RAID System (http://store.apple.com/us/product/HE152VC/A/promise-pegasus2...) is actually not guaranteed to survive 1 HD crash when configured in RAID 5? I'm a bit confused now, would appreciate any help there...

click170 · on June 1, 2014

So my understanding is that the problem is that the raid driver throws an error when it fails to read a bit, asserting that it can't read the whole array.

Does this mean that since it's the raid driver itself claiming "Hes dead, Jim", that most filesystem-level protections will be ineffective since the array itself is "dead"? Of course, if your filesystem leveraged multiple RAID arrays as independent disks, it would have a chance.

For software raid at least, could it not leverage some kind of Hamming or Reed-Solomon code so that it doesn't fail hard like this?

js2 · on June 1, 2014

Shrink the RAID group size? 7 x 1 TB disks in RAID5 gives you 6 TB usable space. With 2 TB drives use a 4 disk RAID group. You're still protecting the same 6 TB (at a small loss of efficiency), but eventually cost catches up (if 4 x 2TB isn't immediately cheaper than 7 x 1TB it won't take long till it is). This more than covers the reliability decrease of the higher capacity drives.

Nican · on June 2, 2014

I am not an expert on the topic, but nobody seems to have mentioned RAID-Z. https://blogs.oracle.com/bonwick/entry/raid_z Can anyone comment on this?

gnopgnip · on June 1, 2014

This is just looking at the physical layer for consumer drives. Any SAN is going to be using something with less errors. Also every year storage costs continue to drop. Double parity, or eventually triple parity are cheaper than ever to implement.

spullara · on June 1, 2014

When I put together my home 8-disk server back in 2009 it was pretty clear to me I needed to protect against 2 failures. Even if you just look at the 12 hour recovery times.

How much does this change if you are using modern SSDs?

mikevm · on June 1, 2014

I believe that SSDs fail differently, in some respects. If your SSDs "wears out", you may not be able to reprogram it (write to it), but you may still be able to read from it.

Edit: Apparently this is wrong. Read comments below

tfigment · on June 1, 2014

My experience as we use SSDs for our embedded solutions and sometimes have very frequent writes is that when it goes the whole partition is pretty much hosed. I'm not sure what exactly fails but it seems like the file system index or journal goes and then the whole partition is mostly useless. We use a write filter to avoid writes to OS partition (writes redirect changes to memory) so it is mostly readonly but it still can fail and and when it does the system is unrecoverable. Other partitions may still be seemingly okay but we just replace once a failure is spotted. Disk based drives usually don't fail as catastrophically in my experience but do have more errors.

Having said that we have been using them for past 5 years and only have had a dozen failures in that time over probably 500 deployments. And I think several of those were service replacing because they could not determine a root cause. Fortunately, we don't really have to try to recover anything from the drives but it does cause downtime when they fail.

aquadrop · on June 1, 2014

In your experience, when you had those 12 failures, did drives die suddenly or was it "expectable" because of high usage (high wear)?

autokad · on June 1, 2014

what is your gut feeling on failure rate vs standard disks?

yuhong · on June 1, 2014

I think it probably depends on the quality of the SSD firmware.

GhotiFish · on June 1, 2014

unrecoverable? as in, not read-only?

fegu · on June 1, 2014

When SSDs fail, they can leave the entire disk unreadable. Spinning disks will usually only corrupt or deny access to a small part. This is very bad if you do not have backups. Beware! http://www.gundersen.net/solid-state-drives-the-best-of-driv...

doxcf434 · on June 1, 2014

I had an SSD fail on a mac and the symptoms were that blocks would become intermittently unreadable the longer the drive was running. So when I took it to the genius bar, their tests passed initially. The logs of course showed the read failures, which is what I had to use to convince them the SSD was failing.

yuhong · on June 1, 2014

Well, in theory it should be true.

maerF0x0 · on June 1, 2014

I wonder if amazon has counter measures to this? Or should I expect to see this kind of (un)reliability replicated in a VPS / "cloud" environment ?

pat2man · on June 1, 2014

For now we can get away with other tricks like ZFS's ditto blocks but at some point the whole redundant storage system will need to be re-though.

le_meta · on June 1, 2014

This article brought to you by the cloud. ZDNet..