Seagate Barracuda Ramp Weakness (2016)

userbinator · on July 10, 2021

I'm not surprised it's about the infamous ST3000DM001, one of the few models of hard drives to have its own Wikipedia page and HN discussion: https://news.ycombinator.com/item?id=27419072

The broken ramps clearly use significantly less material, and are thus weaker as a result. If you look carefully you can also see what looks like stress whitening (slightly lighter colour) forming on the bottom half of the cracked one, as well as both halves of the one below it.

fy20 · on July 10, 2021

Maybe this could be the reason why these drives have had so many failures?

https://en.wikipedia.org/wiki/ST3000DM001

tyingq · on July 10, 2021

I can't find the original source, but if you Google around a bit, you'll find this quote from a German data recovery company called "Datenrettung".

"We must assume that this is an error in the design of the Seagate Grenada hard drive installed in the Time Capsule (ST3000DM001 / ST2000DM001 2014-2018). The parking ramp of this hard drive consists of two different materials. Sooner or later, the parking ramp will break on this hard drive model, installed in a rather poorly ventilated Time Capsule."

Y_Y · on July 10, 2021

I'd feel pretty aggrieved if two disks in my four-disk raid failed simultaneously. Is there a standard practice for allowable failures in disk arrays that would have allowed for recovery after this?

fy20 · on July 10, 2021

This was using RAID 5, which in terms of reliability isn't great. It can recover from 1 disk failure, but as the recovery process is so intensive, it's likely another disk may fail during recovery - especially if they are all the same age. Other RAID setups can recover from two disk failures - but that's a trade-off with performance and useable space.

Mixing drives from different manufacturers may help, but really you shouldn't rely on RAID alone. For commercial use you could duplicate the data on multiple NAS systems, but at home you probably aren't going to do that. Simplest is to understand RAID is not a backup, and to store the data somewhere else as well :-)

chiph · on July 10, 2021

Back during the dot-com era we lost a production array to the infamous "DeathStar" drives.

The system reported a failure, so we scheduled the drive to be replaced and brought up the hot spare and started the parity resync process. A little while later there was another drive failure and we told the data center folks to tell the tech to hurry up. While the tech was headed to our cage, there was a third drive failure and the array was toast. We were able to restore from backup, but the data was a day old.

Lessons were: Mix drives from different production batches (we couldn't mix manufacturers because of the leasing contract). Have a backup that you can restore from. Parity resync operations while the array is in use will put more stress on the drives than production use alone will, and will kill any (remaining) weak drives.

mysterydip · on July 10, 2021

When I was selling SANs, the conventional wisdom was RAID10 for production/performance stuff, and RAID6 for archival/"colder" storage. Plus an external or offsite backup, as you said.

PaulHoule · on July 10, 2021

I build quite a few RAID1 arrays for workstations and home servers. Normal performance and rebuild performance and even degraded performance are great. Hypothetically the RAID5 is more efficient but for me doubled 14TB drives are affordable and compact.

philjohn · on July 10, 2021

At home an offsite backup is going to be best - the initial backup though, especially on a cable internet connection is going to take weeks.

Black101 · on July 10, 2021

> [...] the initial backup though, especially on a cable internet connection is going to take weeks.

Initially, I drove my HDDs offsite... and I keep another offsite backup in my car.

kadoban · on July 10, 2021

In the car is rather clever, I hadn't thought of doing that.

I've been maintaining an offsite at a friend's house that I rotate out whenever I visit. The car would be a good extra. Encryption required, but I'm already doing that.

Black101 · on July 10, 2021

> In the car is rather clever,

I'd give them credit but I forgot where I got that idea from ;) but I don't even encrypt mine because it has a higher chance of recovery that way.

ta988 · on July 12, 2021

Are you not worried about the temperature variations?

petee · on July 10, 2021

I thought RAID 5 was a no-go nowadays due to the high probability of rebuild causing a second failure?

cerved · on July 10, 2021

depends on your needs for redundancy

mrjin · on July 10, 2021

Adding more disks to it. The thing is that disks used in RAIDs tend to be of the same model and even batch and thus suspicious to the same vulnerabilities. Home users typically don't really have neither the need of the capacity/speed nor the budget to allow enough disks in the RAID to allow two disks failing at the same time. As disks get larger and larger, it's really unlikely to have no defects for all disks during the expected life span of the RAID. If one disk in the array shows signs of failure, the others probably going to have the exact the save issue soon. Given the size of the disks, it can take days to rebuild the array after replacing a single failed disk. Making it worse, rebuilding itself also put more pressure on the remaining disks that might be imminent to fail and can really lead to a cascading failure of multiple disks, in which case, total loss of data is unavoidable.

So even my NAS supports RAID5/6, I still go with single disk volumes so that in event of disk failures I'll only have to replace the failed disks and lose only the data on it in the worse case. In fact, I don't lose much because my data does not have to be in the same volume and the disks are ready fast enough and made my home network the bottleneck.

foxrider · on July 10, 2021

Recently my small home server had a disk failure. One of the BTRFS RAID0 drives started having a lot of device errors, and shortly after the second drive started having errors too - but BTRFS was able to just take the healthy bits from both disks and I lost nothing.

However, since I had to now replace 2 disks I decided an upgrade would be in order, the disks been there for 10 years and I wanted to have more that 1tb for a while. Turns out it's two times cheaper to have a 4tb array with 3 2tb disks than a 4tb array with 2 4tb disks.

mrjin · on July 10, 2021

Well, I would say you had bad sectors on your disks, if there weren't too many of them, most modern file systems can handle it but still there are possibilities of losing data. But nonetheless, it's a signal to start migrating data.

What I meant for disk failures was really disk failures, there would be no way directly get data out of the disk unless you open it and directly read the plates, in your case, all your data would be totaled as RAID0 offers no redundancy and thus tolerates no single disk failures at all. Your data seems not much and it would not be that hard to migrate a 4TB array. Things would be completely different if you will have to rebuild more even worse migrate a much bigger array i.e. 4x10TB ones. Even if everything goes on well, it will take a few days just to copy the data somewhere else. If there were cascading failures, then there is really not much you can do. I can promise you that once you went through all those nightmare, you will never ever want to do it again. Take into those cost/effort needed, whenever I need to build a personal NAS, I always go with 2/3 biggest capacity NAS/Enterprise HDDs available and leave at least 1 vacant bay so that I can easily add more space without fiddling with existing drives. Even if the NAS eventually become full, I can still simply replace smallest drive with a bigger one and copy the data back.

formerly_proven · on July 10, 2021

Mix drives and manufacturers. It is much more likely to see clustered failures if you buy all your drives at once, from the same distributor, and they're the exact same model. Conventional wisdom says you shouldn't do this for either hardware or software RAID, and I wouldn't do it with hardware RAID, but why would you use that anyway.

ashtonkem · on July 10, 2021

Raid 5 gives up one hard drives worth of space and can survive the loss of one drive. You can go with raid 6, which will tolerate two drives dying, but it comes at the cost of giving up two drives worth of space.

hinkley · on July 10, 2021

As you add more drives, the odds of failure go up. It’s N drives all playing Russian roulette independently. With enough drives even raid 6 is not enough.

People use RAID 10 because single drive failures are cheap to replace (you copy one disk to another) you can survive some two drive failures, and the logical arrangement is simpler - it’s just stripes where each stripe member has a duplicate.

Scoundreller · on July 10, 2021

I wonder if it increases the probability of drives dying.

When you connect multiple similar devices together mechanically, their vibrations can sum catastrophically.

Ah, it’s mechanical resonance I’m thinking of.

agumonkey · on July 10, 2021

Incredible little detail. I'm pretty sure I have a bunch of disks with similar issues that I couldn't diagnose because I'd never consider this a critical failure point.

ashtonkem · on July 10, 2021

Not many of us buy disks at enough scale in order to see patterns emerge from their failures.

I personally have like maybe 5 disks, two I purchased and three scavenged from systems they outlived. Not really gonna learn which ones are good and bad with that kind of sample am I?

PaulHoule · on July 10, 2021

It’s notorious that bad batches of hard drives came out. Circa 2000 the market was flooded with Maxtor drives, it was hard to find anything else at retail stores.

I bought 8 of them and had 5 fail in 2 years. Fortunately the failures were never synchronized enough to kill a RAID.

cryptoquick · on July 10, 2021

I've had 2/8 Seagate IronWolf drives fail on me slightly outside of the first year I had them. I'm using them in a Synology NAS. I've never had such problems with the WD Reds in my last Synology.

seventytwo · on July 10, 2021

An article from 2016 makes the FP, but no updates about the huge WD hack that happened like two weeks ago??

Sheesh

gruez · on July 10, 2021

What more updates do you want? AFAIK the last development was that WD offered some sort of trade-in program or data recovery service to the affected users.