Hacker News new | past | comments | ask | show | jobs | submit login

If systems are designed with the expectation that hardware can and will fail often, better reliability drives aren't worth the cost as long as cheaper drives are relatively comparable. In addition to cost savings, your system has better robustness when it is decoupled from hardware reliability.

For example, Google's Map Reduce paper has a section on fault tolerance that goes into detail about how they handle the issue of failing workers:

http://research.google.com/archive/mapreduce.html




Deltaqueue just posted the prices and there is a $5.00 difference between the two. So you pay 4% more for 2.5% less annual failures. That sounds close enough to be worth paying more for less operational expense. Obviously, backblaze gets a better deal or they would be buying up the Hitachi drives instead.


The AFR is 2.3 percentage points less (0.9% vs 3.2%), which in this case means that a single unit of the inferior brand is 3.5 times more likely to die during a full year of use. I'd love to see their calculations that justifies buying non-Hitachi drives.


I think percentage points is the right metric here. Spending a lot of money to cut down the frequency of a rare occurrence doesn't make sense, even if you can cut it down by 100x.


"Rare" is the key here, thanks. An AFR of 3.2% is already a pretty damn long MTBF. Makes sense now!


I have experience with hundreds of T of data stores. My opinion is very high of Hitachi 1T and 3T Deskstars. The problem is that they are not generally available - there could be months when you just could not order them.


Are you taking into consideration when a drive fails it requires work to replace it?

This could be minimal and something that in terms of budgetary considerations might be negligible - but I'm not sure.


Backblaze employee here -> Yes, this gets a SMALL amount of allowance. The datacenter team begs us to buy the Hitachi drives even at twice the price, but it would bankrupt us. But if the Hitachis are only $2 or $3 more expensive per drive (including the failure rate in that calculation) then we're willing to buy them for the reduced hassle.

I think the calculation is replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4 million, so who cares about $5k here or there?


> I think the calculation is replacing one drive takes about 15 minutes of work.

Is that really the true cost of replacement? I would think there is also the cost of dealing with the warranty and the testing and monitoring. Here is the quote from the blog post about unsuitable drives:

> When one drive goes bad, it takes a lot of work to get the RAID back on-line if the whole RAID is made up of unreliable drives. It’s just not worth the trouble.

I don't have the time to think about this fully, but it seems similar to calculating the present value of a future cash flow, because there are other costs beyond the first replacement effort:

> Their average age shows 0.8 years, but since these are warranty replacements, we believe that they are refurbished drives that were returned by other customers and erased, so they already had some usage when we got them.

It sounds like the total cost of a failed drive is actually 1.5x, because 50% of the replacement drives also fail.


You can probably cut the time for dealing with warranty down if you do it in bulk. They seem to have ~50 failing drives per month.


Yup. We used this paper as one of the proofs for why we could continue to rely on economy hardware.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: