Hacker News new | past | comments | ask | show | jobs | submit login
Hard Drive Data Sets (backblaze.com)
193 points by epistasis on Feb 4, 2015 | hide | past | favorite | 58 comments



At Backblaze, we've done a lot of our own analyses on our hard drives to look at which ones are most reliable, whether temperature affects reliability, etc. However, we kept being asked for the raw data so people could run their own analyses.

This data set releases 500 million data points on 41,000 drives. I imagine you guys here at YCombinator/Hacker News are the ones who'd find this data amongst the most useful. Enjoy and let us know what you find!


I've asked before but as you're now moving in this direction I'll try again. Do you have any plans to release this data more frewuently, maybe once a month?


Yev from Backblaze here -> Not monthly, but we're hoping to do it quarterly though that may change. That's the plan for now though!


Your website uses a terrible HTTPS config.

https://www.ssllabs.com/ssltest/analyze.html?d=www.backblaze...

No PFS, weak RC4, TLS 1.0 only, and SHA1.

It's not much effort to fix these, really. It makes me question the security of your product.


> Which brings us to the most important principle on HN: civility. Since long before the web, the anonymity of online conversation has lured people into being much ruder than they'd dare to be in person. So the principle here is: don't say anything you wouldn't say face to face. This doesn't mean you can't disagree. But disagree without calling the other person names. If you're right, your argument will be more convincing without them.


We got a B! That's above average! Totally not "terrible". Truth is the website and the service itself are fairly different, but I can tell you that we have a lot of folks using, lets say..."older", browsers that access our site, and so we try to make sure that they can still access their accounts - though we're constantly monitoring for ways to make it better. A lot of folks that use us have older operating systems and that hampers us a bit.


Even though it stops folks with updated browsers from accessing your site?

http://i.imgur.com/3GCKrzr.png


Brian from Backblaze here. Which browser was that? We often test with Mac/Windows and Safari/Chrome/IE/Netscape/Opera, etc. It all seems to work here. Did you tweak any browser settings or are you running stock?


Unable to Connect Securely

Firefox cannot guarantee the safety of your data on www.backblaze.com because it uses SSLv3, a broken security protocol. Advanced info: ssl_error_no_cypher_overlap

Firefox 35, rc4 disabled

As far as I understood you should not (especially not only) use rc4. https://en.wikipedia.org/wiki/RC4 even cloudflare is trying to get used of rc4. https://blog.cloudflare.com/killing-rc4/

> Fast-forward to 2013 and attacks on RC4 have been demonstrated; that makes the preference for RC4 problematic.


Looks like they killed 3DES, leaving only RC4.


FYI, 3DES should be still be enabled by default in browsers for a while, but I am not surprised if at some point they will get rid of it too.


Firefox Nightly. Oddly enough, the page loads fine on my laptop with the same browser. I'll diff my configs and see if anything interesting shows up.


This is a pretty weak excuse, are you claiming adding modern crypto support would make old browsers not work on your site?


If you look at the ssllabs analysis, they already fail in IE 6 handshake simulation. Offhand I can't say if support for older browser is incompatible with more modern ciphers and protocols.


Brian from Backblaze here. Why do we get a bad grade for not talking with IE6? I'm honestly curious... I thought anybody who supported IE6 should be downgraded, because IE6 was end of lifed, inherently using it is a bad bad bad thing. It is literally safer to not use it than use it, right?

I don't think it's even possible to run IE6 on Windows Vista or Windows 7 or Windows 8, but I'm not entirely sure of that.


Yea, that is a false claim and not what OP means.


The mozilla wikipage on TLS configuration has good information on ciphersuites and supported browser: https://news.ycombinator.com/item?id=9000235

The intermediate suite (default) says: "Firefox 1, Chrome 1, IE 7, Opera 5 and Safari 1. " The "Old backward compatibility" has IE6, but it also has SSL3, which is really not recommended.


It should not be incompatible. Servers will just select 3DES or RC4 suites for older browsers that only have these listed.


Instead of patting yourself on the back based on your interpretation of an arbitrary letter grade, can you please do some research into the problems flagged by ssllabs and fix them? Supporting older browsers does not preclude supporting modern ciphersuites and TLS versions for new browsers.


Brian from Backblaze here. We are aware of the sslabs flagged issues and we're working to keep on top of it. Sometimes these issues apply to Backblaze and we are vulnerable for a short time (like when HeartBleed came out we had to scramble to plug it as fast as humanly possible). But most of the time when we are slightly slow it is because it doesn't REALLY affect our customers or us.

Keep checking back, I swear you will see improvement over the next week.


I don't think there are any acute issues like heartbleed (other than the possibility that RC4 is completely broken to secure against state-level adversaries, or that your ssl key might be compromised now or in the future), but the downvoted-to-hell points made higher up are that you should have been aware enough of best practices to improve your ssl config a long time ago.

It doesn't inspire confidence that you're still working at enabling TLS 1.2 or some AES ciphersuites with PFS. ... and at de-prioritizing RC4 if you need to keep it enabled. (A competitor, crashplan, seems to think 3DES is enough to support older clients; they've disabled RC4 completely. Mozilla thinks so too, even in their "old backward compatibility" recommendations: https://wiki.mozilla.org/Security/Server_Side_TLS )


3DES is enough to support most older clients, at least on desktop. The problem is that it is the slowest cipher suite.


Your tact could be better. Here they come with their raw data that we asked for, and you reward them by whinging about their SSL cert.


Their website doesn't have a lot to protect, in all honesty. Their backup product uses different security mechanisms.

If it was a site that had critical user information, like email or document storage, then they could solve this by making their website open to all protocols and then test the browser type and redirect the user to a specific subdomain with the appropriate security. For instance, https://sitename.com could be backwards compatible. Upon login, it redirects the user to lowsecurity.sitename.com or highsecurity.sitename.com depending on their browser type.


The login password is the primary backup password isn't it? That's what would enable them to offer downloads of backed-up files. While they have a zero-knowledge option (encrypt using secret key only known to the client), it can't realistically be enabled by default in a backup service that's intended to be easy to use. Same with crashplan and every other service that isn't zero-knowledge-mode only.

The fact that the actual backups may (don't know) be transferred over a channel with higher security is irrelevant if the passphrase for decrypting the backup (or decrypting the decryption key, depending on how it works) is transmitted over SSL to the website in question.

You don't need lowsecurity.site.com and highsecurity.site.com. You can support any even marginally secure and up-to-date client with one website ssl configuration. The big websites do it just fine. The other problem is that clients that don't support TLS 1.2 and good ciphersuites are pretty much by definition no longer maintained very well.


If the password for your Backblaze backups is the account password to the website, then there are bigger issues than having RC4 enabled.


If you haven't already come across Backblaze's series of blog posts based on these data sets, they're worth reading. Invaluable in helping me make an HDD purchasing decision and full of well presented data.

https://www.backblaze.com/blog/best-hard-drive/


Way better than Tom's Hardware, et al.


Thanks for this!

I'd be interested in whether shucking drives from external enclosures has any noticible effect on drive life. But the data doesn't seem to capture whether the drivers were shucked or not?

Is that something Backblaze has investigated? Or is the need for drives such that it doesn't matter if shucking does cause shorter life?


The tracking on that wasn't easy to align with these data, but from what we've seen the shucked drives seemed to perform similarly. At this point, the percentage of shucked drives in the data set is fairly small.



I merged the two years of data into a single R data file for convenience:

http://pyrovski.github.io/backblaze_data/


Nitpicking: It seems you might have an issue with font loading - for me (Firefox on Linux), it reverts to font-weight 100, making the text (which is missing <p> tags, by the way) almost completely unreadable.

Fig A.: http://i.imgur.com/9rLQwQQ.png


Brian from Backblaze here.

I keep complaining to the visual designer about this, I can't figure out why this is so hard to fix. What's really strange is it often looks GREAT in some web browsers nobody would ever use (IE) but in Chrome on Windows the lower case "g" characters are almost unreadable and disappear.

If only somebody knew how to fix this?

How did you detect it was missing < p > tags? Is there a tool the designers could error check against to see this error?


> How did you detect it was missing < p > tags?

Just opened it up in the inspector in Firefox, no black magic here ;-)

I would definitely look into your font names - it tells me that "Lato-Hairline" is used as: "Lato" and "Lato Regular" is used as: "@font-face:Lato". So perhaps the issue is that Lato-Hairline is the font-weight = 100 and Firefox picks it over the other one, only finds the single font-weight and sticks with it?

Just a guess though, webfonts can be weird. For instance: For Chrome, it can depend on what version you use. Just today I ran into a font rendering bug similar to this one where Chromium versions 37&38 had their font-weights switched so that 300 ended up as 500 and 500 was also picked for "normal". So the bug report "all fonts are bold and it looks terrible" resulted in "CANTFIX old Chrome be weird", basically.


Cool, thanks! I have forwarded the info, it's actually been driving me slightly bonkers. The designer uses a Macintosh, but my primary development box with a 30" monitor happens to be BOTH Macintosh OR Windows (KVM switch) and when I see our blog in Windows it looks terrible.


Yev from Backblaze here ->

WE'RE WORKING ON IT BRIAN!


Brian from Backblaze here. :-)


You're the worstest.


Firefox on Ubuntu 14.10, same effect here.


Working fine in my Ubuntu 14.10 VM (Firefox 35). It's inheriting font-weight: 300 from <body> and rendering as expected.

The use of <br></br><br></br> instead of <p> does feel distinctly mid-90s though. Especially since <br> isn't even supposed to have an end tag. If you're doing XHTML it's <br />, but for HTML all you need is <br>.


@brianwski [Offtopic] Is backblaze going to implement delta copy or something similar in soon future? The last time I checked it definitely didn't. This becomes a real issue if I'm working on bigger binary files, since backblaze is syncing the whole file again - instead of only it's difference... PS: Found this interesting comparison of backup services: https://en.wikipedia.org/wiki/Comparison_of_online_backup_se...


We do transmit "changes" to large files in 10 MByte chunks. In other words, if 1 byte changes in a 50 MByte file it SHOULD only transmit one single 10 MByte chunk.

The absolute worst case for Backblaze is if you insert a single byte at the start of a large file. This "shifts" the entire file along by the one byte, effectively changing every single 10 MByte chunk.

The BEST case is if you append a single byte to a large file, because the final chunk then is probably less than 10 MBytes.

I actually thought we would be working on that area quite a bit over the years but it kind of worked well enough. :-) Most people don't edit large files, with the exception of Outlook.pst files, we see those appear as bandwidth burners.


Thanks for your reply, brianwski. That's seems fair enough.


as someone who made previous comments on backblaze data analyses posted on HN, i wanted to say thanks. this is fantastic, and i'm looking forward to digging into these data! and even though i share some of the same sentiments from other commenters, i'm sorry you've gotten so many bike shed remarks from other commenters.


Enjoy ;-)


This is very much appreciated:

1. As mentioned by others, there really is very little data on HD failure rates.

2. When you first published your blog on failure rates across HD brands/models and SMART attributes many, myself included, suggested it might be more illuminating as a predictive modelling exercise. This data allows others to do that now, which is great!


Releasing this data is a real service, thanks.

I'm unable to explain the plethora of comments nearby about peripheral issues. Weird.


No worries! Glad you like our release! We don't mind the peripheral stuff, we're over like -> http://i.imgur.com/MYHLwt7.gif


thanks for sharing this! A few quick questions:

- Do you guys do any precise prediction on if a particular drive will fail soon and replace it?

- I notice a lot of sparsity in some rows, that is different than a 0 in that field I assume? Does that mean anything else interesting?

- Also under the "inconsistent fields" section you say "drive manufacturers don't generally disclose what their specific numbers mean," can you give a hint as to one of the drive models that has a minimally sparse smart readout and has information available from the manufacturer on what those smart numbers signify?

I figure if anyone has collected the references on what the metadata means, and for which models it is available, it's you guys :)


I'll try to get the datacenter techs to answer tomorrow, but here is my best off the cuff attempt:

> predict if a drive will fail

We have some heuristics (high numbers of time outs and high remapped sectors), but in the end most failures are sudden and catastrophic. It is more like statistical tendencies, the most obvious one being drive age.

> sparsely in rows

Others have noticed, I have to ask the OTHER Brian (Brian Beach did the lion's share of this drive stat collection and presentation).

> its you guys

Aww shucks. :-) But remember, we do this as a guide for ourselves, but we spend most days working on backup features and scaling, we don't have a lot of extra time. That's why we sending this data out there, some smart grad student or PhD in Xerox PARC can hopefully figure out some good stuff we missed! Besides, I don't think math and statistics are our strength, we just happen to be sitting on one of the world's larger stock piles of spinning drives with access to the computers with scripts. :-)


Minor nitpick: using LZMA to compress large text files like this before distribution is normally better; here 7z is LZMA, zip is DEFLATE:

    739M    2013_data
    37M     2013_data.7z
    78M     2013_data.zip
Using [1] for reference, if the download speed is less than ~20MB/s, LZMA is faster than DEFLATE. Though, the data there is a bit less compressible than these csv files, so the break-even point for transfer rate would be higher here; even so, in my case, the download speed was much slower than 20MB/s.

1. http://richg42.blogspot.com/2015/01/parallelized-downloaddec...


Disclaimer: I work at Backblaze.

We tend to favor ZIP as a company because with no additional tools on all platforms (like Macintosh and Windows) it unpacks. No additional technology other than what the manufacturer provides.

In this case, we could provide the raw data in several formats in case bandwidth is a problem, plus I think we should provide the SHA2 or md5 of the resulting package just in case you are wondering if you got the correct download or whether somebody has messed with the contents.


Windows has the 7zip program which handles lzma just fine - it's overall a better program than zip anyway.


He said provided by the manufacturer by default, not something the user has to install themselves.


Nice. Glad to see more info out there for people to see and use. Backblaze has done a great service for everyone including the hard disk industry.


I'm really looking forward to playing with these; thanks for releasing them, especially since there is so little failure data out there.


This is a great resource, Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: