> The AFR for 2020 dropped below 1% down to 0.93%. In 2019, it stood at 1.89%. That’s over a 50% drop year over year... In other words, whether a drive was old or new, or big or small, they performed well in our environment in 2020.
If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year? Better cooling, different access patterns, etc.
If this change doesn't have an obvious root cause, I'd be interested in finding out what it is if I were Backblaze. It could be something they could optimize around even more.
I was wondering if the state of the world in 2020 might have dramatically changed their business / throughput / access patterns in a meaningful enough way to cause this dip.
I'm not sure if they have a measure of the disk utilization or read/write load along with the failure rate.
Disclaimer: I work at Backblaze, but mostly on the client that runs on desktops and laptops.
> I'm not sure if Backblaze has a measure of the disk utilization or read/write load along with the failure rate.
We publish the complete hard drive SMART stats for anybody to attempt these analysis. Most of us in Backblaze engineering get giddy with excitement when a new article comes out that looks at correlating SMART stats and failures. :-) For example, this article circulated widely at Backblaze a few days ago: https://datto.engineering/post/predicting-hard-drive-failure...
Wow- crazy to see people at Backblaze actually read that article!!
Thank you guys for putting out so much data and writing so much about your findings- it was HUGE in helping me come to conclusions about what's realistic to assume from SMART stats. Y'all are doing some really really cool stuff.
I anticipate these reports every year and have strong trust in the data - I want to make that clear - Backblaze has done a massive service to the entire industry by collecting and aggregating this kind of data.
I'm really super curious about the dip in errors over the past year :)
Whether intentional or not, it's also great word-of-mouth advertising. My preexisting experience with Backblaze's hard drive stats reporting definitely worked positively in their favor when I was looking for a new backup service.
> Whether intentional or not, it's also great word-of-mouth advertising.
Oh, this has really worked out for Backblaze and we know it.
The first time we published our drive failure rates (I think January of 2014?) a few people said, "Uh oh, now Backblaze will get sued by the drive manufacturers." And we cringed and waited. :-) But the lawsuit never came, in fact there were NO repercussions, only increased visibility. People who have never heard of our company before find the data interesting, and then they ask "hey, what does this company do to own this many drives?" And a few (like yourself) sign up for the service.
Existing customers seem to stick with us for a long time, and even recommend us to other friends and family from time to time. So one tech person who stumbles across these stats might ACTUALLY bring us 3 or 4 more customers over the next 5 years. That's real money to us.
Not only have the drive manufacturers not sued us, they are actually NICE to us beyond the scale of our actual drive purchases! In one amusing example, our drive stats were used in a lawsuit as evidence. To be clear Backblaze was not the plaintiff or the defendant in the court case, we had no skin in the game at all and didn't want to be involved, but our drive data (and internal emails) were subpoenaed to be entered into evidence. Before we were served, the drive manufacturer called us and apologized for the inconvenience and made it clear they had no beef with us. Yes, a multi-national company that makes BILLIONS of dollars per year called a 40 person company (at the time) that could barely make payroll each month to apologize for the inconvenience. :-) We thought it was very considerate of them, and a little amusing. I'm proud to be the one that "signed" the papers indicating Backblaze had been "served".
> noticed that you are a fellow Beaver. Go Beavs! ;-)
Ha! This is a silly novelty "Beaver Baseball Cap" I was given during my internship in 1988 at Hewlett-Packard in Corvallis: https://i.imgur.com/G0rPHGP.jpg For 32 years it has sat on my computer monitor or on a shelf nearby. Nobody really asks about it anymore.
I got pretty lucky on timing at OSU. The year before I was there they taught beginning programming on punch cards on a mainframe computer called a "CDC Cyber". In my freshman year I took Pascal on the brand new 1984 Macintoshes in the Computer Science department, while the engineering students still learned Fortran on the mainframe my freshman year.
In 1992 I got a job as a Software Engineer at Apple in Cupertino, and you can trace that straight back to my blind luck of starting my programming education at OSU on the Mac in the exact correct year. Well, I'd rather be lucky than good.
If they’ve hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.
There’s also just the possibility that failure rates are bimodal and so they’ve hit the valley of stability.
Are they tracking wall clock time or activity for their failure data?
> If they've hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.
Internally at Backblaze, we're WAY more likely to be spending time trying to figure out why drives (or something else like power supplies or power strips or any other pain point) is failing at a higher rate, than looking into why something is going well. I'm totally serious, if something is going "the same as always or getting better" it just isn't going to get much of any attention.
You have to understand that with these stats, we're just reporting on what happened in our datacenter - the outcome of our operations. We don't really have much time to do more research and there isn't much more info than what you have. And if we stumbled upon something useful we would most likely blog about it. :-)
So we read all of YOUR comments looking for the insightful gems. We're all in this together desperate for the same information.
Seems to me that every drive failure causes read/write amplification, so a small decrease in failure rates would compound. Have you folks done any other work to reduce write amplification this year?
The bottleneck in HDD in this scenario is bandwidth. What you do is split & spread files as much as possible, so your HDD are all serving the same amount of bandwidth. A disk doing nothing is wasted potential bandwidth (unless it's turned off).
But do they actively move around files to spread bandwidth after the initial write? If they don't, and if I am right that older files tend to be rarely accessed, I would expect entire disks to become unaccessed over time.
If they allow that to happen, they are leaving a ton of money on the table. It's typical in the industry to move hot and cold files around to take advantage of the IOPS you already paid for. See, for example, pages 22-23 of http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...
I’m just assuming that folks doing archival storage aren’t using these kinds of spiny disks as it would be super expensive compared to other mediums, right?
I do think access patterns in general should contribute to the numbers so that kind of thing can be determined.
Compared to what, exactly? Tape is cheaper per GB, but the drives and libraries tip that over the other way. Blu-Ray discs are now more expensive per GB than hard drives, thanks to SMR and He offerings.
Also note that Backblaze does backups -- by definition, these are infrequently accessed, usually write-once-read-never. I've personally been a customer for three years and performed a restore exactly once.
Despite claims to the contrary, tape isn't dead just yet. They are still considerably cheaper than drives. An LTO-8 tape (12TB uncompressed capacity) can be had for about $100, while a 12TB HDD goes for some $300. Tape drives/libraries are quite expensive though, but that just shifts the break-even point out. For the largest sites, its still economical. Not sure, if backblaze is big enough (I'm sure they did their numbers). backglacier anyone?
And a number of the library vendors' libraries last for decades with only drive/tape swaps along the way. The SL8500 is on its second decade of sales for example. Usually what kills them is the vendor deciding not to release firmware updates to support the newer drives. The stock half inch cartridge form factor dates from 1984 with DLT & 3480. Given there have been libraries with grippers capable of moving a wide assortment of DLT/LTO/TXX/etc cartridges at the same time. Its doubtful if that will change anytime in the future. So if you buy one of the big libraries today it will likely last another decade or two, maybe three. There aren't many pieces of IT technology you can utilize that long.
I bought a pair of 12tb drives for $199 the other day and they often go cheaper. Now admittedly if you shuck externals you loose warranty but we are keeping them in the enclosures as these are for backups and thus the ease of taking them off site is great for us.
Are you asking for service for the drive itself or for the drive within an enclosure?
Let's say you buy a Ford Explorer SUV. You remove the engine and use it somewhere else for a few years. If that engine breaks within the warranty period of the SUV, can you take just the engine to your local Ford dealer and ask them to fix it? Probably not.
Arguably, you can put the engine back into the SUV and take the entire car to a Ford dealer for repair. But would that constitute fraud on your part?
I was specifically thinking of the SKUs - I assumed they were using faster disks rather than high volume disks that make trade-offs for costs. Just assumptions on my part - and I am mostly curious for more data, but given the historical trends, I'm not terribly suspicious of the actual results here.
Drive enclosures, raid/etc interfaces, and motherboards burning electricity make it a lot more complex than raw HD's vs raw tape. Tape libraries cost a fortune, but so do 10+ racks of cases+power supplies+servers needed to maintain the disks of equal capacity.
Tape suffers from "enterprise" which means the major vendors price it so that its just a bit cheaper than disk, and they lower their prices to keep that equation balanced because fundamentally coated mylar/etc wrapped around a spindle in an injection molded case is super cheap.
That seems like a misleading aggregation. Their total AFR can have been affected just by mix shift from early death to mid-life. It looks that way to me from their tables.
> If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year?
It can also be the case that newer drives this year are better than newer drives last year, while older drives are over a "hill" in the failure statistics, e.g. it could be the case that there are more 1st-year failures than 2nd-year failures (for a fixed number of drives starting the year).
What about air quality? There are actually air filters in a form of a patch or a package similar to miniature mustard packets through which drives breathe. Supposedly those are super fine filters but toxic gas molecules might still pass through them.
I guess these Hard Drive Stats post cover disks used for their B2 service as well? Maybe the service mix is changing (a larger percentage being used for B2 versus their traditional backup service).
I'm not sure how more B2 access pattern would improve the stat though.
> I guess these Hard Drive Stats post cover disks used for their B2 service as well?
Yes. The storage layer is storing both Backblaze Personal Backup files and B2 files. It's COMPLETELY interleaved, every other file might be one or the other. Same storage. And we are reporting the failure rates of drives in that storage layer.
We THINK (but don't know for certain) that the access patterns are relatively similar. For example, many of the 3rd party integrations that store files in B2 are backup programs and those will definitely have similar access patterns. However, B2 is used in some profoundly different applications, like the origin store for a Cloudflare fronted website. So that implies more "reads" than the average backup, and that could be changing the profile over time as that part of our business grows.
perhaps margin of error should be raised to accommodate this change of about 1%, although the set of drives under test is likely not the same between years
I have always loved these posts, they paint a picture of smart people with good processes in place. It's confidence-building. Unfortunately we were evaluating Backblaze around the time they went down for a weekend and didn't even update their status page, which was a bit of a blow to that confidence.
Their marketing blog does a great job of painting them as smart people with good processes. Sadly, I learned the hard way (by trialling their software and immediately discovering a handful of dumb bugs that should’ve been caught by QA, plus serious security problems[0], and some OSS licence violations[1]) that it seems to not actually be the case. This situation where they continue to pump out blog posts about hard drive stats, yet don’t even have a status page for reporting outages, is another example of their marketing-driven approach to development.
I have mentioned this on HN a couple times now[2][3], including again just yesterday. I really dislike doing this because I feel like I am piling on—but as much as I hate it, I feel even more strongly that people deserve to be informed about these serious ongoing failures at Backblaze so that they can make more informed choices about what storage/backup provider to use. I also genuinely hope that I can incentivise them to actually start following software development best practices, since they do provide a valuable service and I’d like them to succeed. If they keep doing what they’re doing now, I absolutely expect to see a massive data breach and/or data loss event at some point, since they are clearly unwilling or unable to properly handle security or user privacy today—and I’ve gotten the impression over time that some prominent people at the company think these criticisms are invalid and they need not make any substantive changes to their processes.
Yev from Backblaze here -> rest assured that we do read what you're writing on these posts and they've spurred some internal process discussions. I believe the bugs you mentioned were cleared/fixed with version 7.0.0.439 which was released in Q1 of 2020. We did leave HackerOne and switched over to BugCrowd to handle our bug program. It's private at the moment, but easy enough to get invited (by emailing bounty@backblaze.com). While we spin that program up (it's a new vendor for us) we may stay private, but hopefully that's not a permanent state.
Edit -> I just noticed the Daniel Stenberg libcurl citation. Oof, yea that certainly a whiff on our end. Luckily though we were able to make up for it (he has a write-up here: https://daniel.haxx.se/blog/2020/01/14/backblazed/).
> rest assured that we do read what you're writing on these posts and they've spurred some internal process discussions.
OK, that’s good to hear. Nobody from the company has reached out to me so this is the first time I’ve been made aware. The only public replies I’ve seen up until now seemed to focus exclusively on just the public bug bounty part, which is really the least important part of this whole thing.
> I believe the bugs you mentioned were cleared/fixed with version 7.0.0.439 which was released in Q1 of 2020.
It’s really critical to be transparent about this stuff and tell your users. You published release announcements for subsequent versions and there were no mentions of security issues being fixed. When you don’t do this, it looks like you’re intentionally trying to hide vulnerabilities from the public. This is not how any company should act, especially not one that promotes how radically transparent it is[0].
> I just noticed the Daniel Stenberg libcurl citation. Oof, yea that certainly a whiff on our end. Luckily though we were able to make up for it […]
I reported license violations in a ticket and nobody replied.
You did fix the libcurl violation, which is great, but it took a letter from the author, which is less great.
You are still violating the OpenSSL license.
It honestly baffles me that nobody at Backblaze thought to check the licenses of the other OSS libraries that you’re distributing after receiving a notice that one of them was being violated. It’s not like there’s a huge compliance burden or complicated dependency tree to evaluate—as far as I can tell, it’s zlib (which requires no acknowledgement), libcurl (which does), and OpenSSL (which also does, for the version you are using[1]). This would’ve taken like 30 seconds.
> as much as I hate it I'm following Backblaze around and posting incorrect information about them
I get the impression Backblaze did something to upset you. Can you let me know what it is so I can try to fix it?
If there wasn't a pandemic on I would invite you to come to our office and I could buy you lunch and I could try to make up for whatever we did to upset you.
I'm not the person you responded to and I'm a happy user of both Backblaze and B2, but I wanted you to know that this response by you reads as quite disingenuous. You seem to want to shift the reason for his disgruntlement with Backblaze from all the reasons he already mentioned to some other, imaginary slight that you indicate you'll do your very best to fix. How about just reading his very real gripes and responding to those?
Let's take his twitter thread for some highlights, these seem like very real reasons to get upset, maybe you "can try to fix" those?
* Backblaze changed their client to add an allowlist some time after my report, while also intentionally breaking their TLS code so it would accept INVALID TLS certificates. Thereafter, the local code execution vuln became a full blown RCE vuln.
* When I submitted my report 11 months ago, they told me they already knew about the problem, downplayed its severity, dodged follow-up questions, didn’t seem to understand how CVE IDs work and refused to issue one after being asked four times. It was not confidence-inspiring. The CVE ID for the vulnerability I gave to them is CVE-2019-19904. They should’ve announced it, but they never did. Actually, they never seem to voluntarily disclose any security bugs… there are a lot of verified, closed, undisclosed bugs on their HackerOne account.
* This is all in stark contrast to their security page (https://backblaze.com/security.html) which makes many claims about best practices, and their blog & social media which present a sense of radical openness. I used to like their blog, but it all feels so gross and dishonest to me now.
* Backblaze mislead users about PEK. The decryption key is sent to their server, and so is your password. The only way to restore data is to decrypt it on their servers first. It is not a zero-knowledge system. PEK data is not ‘inaccessible’ to them. They don’t care.
At face value (and I haven't done any digging of my own) these all seem like valid reasons to distrust Backblaze. Not necessarily because they happened, but because of the way Backblaze has addressed them (read: apparently not)
> this response by you reads as quite disingenuous
It wasn't intended as such, I really meant it. I'd like to get to the bottom of this, understand what this person's true issue with Backblaze is.
> these seem like very real reasons to get upset, maybe you "can try to fix" those?
What the user is doing is called "Gish gallop". This is a technique where somebody makes a rapid fire list of unrelated half truths or misrepresentations, each of which takes CONSIDERABLY longer to address than to claim. And I've repeated explained why they are invalid, but the user just shows up a day or two later and makes the same exact list of complaints. No edits, no admitting that even one of the complaints is invalid. Gish gallop.
This is not the behavior of somebody that is genuinely interested in having Backblaze address or fix that list of issues. There is something else going on, and I personally would like to know what it is. First of all because I'm curious what the issue is, second of all I hope I can fix whatever the real issue is.
I'm not going through the whole list because I've done that maybe 10 - 15 times so far? But let's take this one, because it's spectacularly false, this person KNOWS it's false, but this person repeatedly makes the claim over and over again:
> Backblaze mislead users about PEK. The decryption key is sent to their server, and so is your password. It is not a zero-knowledge system. They don’t care.
Backblaze has 4 security levels, one of which is zero-knowledge, and we ENCOURAGE customers to pick the correct level for themselves. You can read my longer, in-depth answer to this same user just 2 days ago here: https://news.ycombinator.com/item?id=25904473 or you can read my longer, in depth answer 18 days ago here: https://www.reddit.com/r/backblaze/comments/kroqhn/private_e... or you can read my answer TWO YEARS AGO in the link this person supplied you (!!!!) or you can go back to the beginning, 13 years ago, when Backblaze started, where we explained EXACTLY how our encryption worked the same as the Microsoft Encrypted File System ("EFS") here: https://www.backblaze.com/blog/how-to-make-strong-encryption...
Now, despite it being a spectacularly false accusation that has been documented and explained so many times in so many forums, this user will undoubtable show up in another couple days and make this claim again. All the user's claims are like this. Obviously something else is going on.
I just wish that user would tell me what the real issue is. I can't fix what I don't know about.
> I just wish that user would tell me what the real issue is. […] I hope I can fix whatever the real issue is.
It is exactly what the parent already told you. That’s it. That’s “the real issue”. There is nothing more. Everything I’ve said already is, in fact, what the issue is. Please stop trying to read between the lines.
That you refuse to accept mine and others’ arguments about PEK and ZKE and SSDs is one thing. It’s an entirely different and more alarming thing when you refuse to accept that these issues are the issues and insist on continuing to spin a story about how I must be really angry about something different when other people are telling you it’s not so. I also can’t even imagine how it would’ve seemed like a good idea to fabricate a quote and attribute it to me in the way you just did. You did this on some of your earlier posts, too.
As for me, I don’t use eristic techniques, I don’t tell intentional falsehoods, and I don’t do things as you imply in saying I’ll “show up in another couple days and make this claim again”. Anyone is free to look to my comment history and see that I’ve only made a handful of comments here about Backblaze[0], and in all cases, I try to make sure my comments are fair and well-researched and backed with citations whenever possible.
I understand how this company is like your baby and so it may feel emotionally like I’m trying to kill your baby with criticism, but please understand that that is not my goal. My goal is, and always has been, to keep users secure. If that means I can help a receptive vendor improve their software, excellent. If that means I have to warn users to stay away from a vendor who behaves poorly, that sucks, but I still feel an obligation to do that too. If some of the negative publicity gets a vendor to start doing the right thing, good. That’s the whole reason I have to talk about these things. It certainly doesn’t do me any good otherwise.
> you refuse to accept mine and others’ arguments about PEK
Not at all! I agree with you COMPLETELY. You want a zero knowledge backup product, Backblaze offers that and makes a large amount of money from it (millions of dollars annually actually), and we think you (csnover) should use that product because that's what you want. I understand the arguments, and you are COMPLETELY correct and I accept your arguments - that's why we built it and sell it, because you are correct.
One of our OTHER product offerings (in addition to the zero knowledge backup offering) is to host public websites. Public websites just can't have zero knowledge. We offer both products SEPARATELY, I've explained this over and over again, but you'll just post in two more days saying "Backblaze thinks zero knowledge is bad" (spectacularly false). Public websites (by definition) cannot be zero knowledge, do you comprehend this?
> I don’t tell intentional falsehoods
You keep saying we don't have a zero knowledge backup system (wow, spectacularly false) and say we think zero knowledge is bad (again, spectacularly false since I've stated repeatedly that zero knowledge systems are THE BEST security). Then you say you don't tell intentional falsehoods. Come on, it's obvious something is going on.
> I’m trying to kill your company with criticism
Is that it? Why do you care? Did we do something to you? I have a hard time believing you put in all this effort because you flipped a coin and decided you would try to kill Backblaze. Why us? I can't fix what I don't know about.
I look forward to your 200 future posts about how Backblaze doesn't offer zero knowledge backups.
I also genuinely hope that I can incentivise them to actually start following software development best practices, since they do provide a valuable service and I’d like them to succeed.
I read thru your HN posts and twitter and I completely understand where you're coming from. Very good points.
But here's my response to you: You're paying them $6 per month. That's not enough money for them to do better than what they're doing now. They literally can't afford to do better, given their paltry income stream. (Just my guess, I have no insider info about them).
If they keep doing what they’re doing now, I absolutely expect to see a massive data breach and/or data loss event at some point
I guess that's a chance they're prepared to take, though they probably haven't thought about it in such stark terms. I'm sure their development practices will improve, maybe in time to prevent a massive breach or loss.
Disclaimer: I work at Backblaze so I'm biased and you should keep me honest.
> they probably haven't thought about it in such stark terms
Backblaze takes security incredibly seriously, and I assure you we have thought about it in such stark terms.
Expounding on that: Backblaze never raised any significant VC funding, we survive entirely on sales of our products, and most of that is keeping our customer's data utterly private, confidential, and safe. As this is the business we are in, our reputation is INCREDIBLY important to us. If we have a major breach we'll most likely lose our customer's trust, lose all our customers, and go out of business.
Since this is all I've done for the last 14 years, and my income and life savings is all wrapped up in Backblaze, I take this issue as seriously as a heart attack. So do my business partners (Backblaze was founded by 5 equal partners.) We're also up to around 200 employees who all would suffer greatly if we lose the trust of our customers.
Internally, Backblaze has a "Security Council" of software engineers and technical operations people with something like a combined 150 years of security experience and obsession. One of these council members got his CS degree from MIT and is both one of the smartest people I've ever worked with (we've worked together at 3 separate companies so far over 25 years) and also deeply paranoid and stressed out all the time. Another of the security council members has a PhD in computer science. And so on... They watch over all the design proposals, the APIs, the technical infrastructure, everything at Backblaze. They propose and implement new procedures, new programs like our BugCrowd program where we have external white hat hackers constantly trying to break in. We are also going through an internal security audit right now paying consultants to get yet another perspective.
In addition to the Security Council, all software engineers and all technical operations people are expected to worry about security all of the time. It is quite possibly our most talked about, most worried about, most important thing we do.
> I'm sure their development practices will improve
We always, ALWAYS strive to do better and do more.
Yev from Backblaze here -> sorry about that Thom - it's one of the things we're definitely working on hammering down. Right now we're growing quite a bit on our engineer and infrastructure teams and one of the projects we'd like to see is more automated status updates. We typically will throw updates onto Twitter or our blog if it's a large outage - or affecting many different people, but totally recognize that process can use some sprucing.
> Over the last year or so, we moved from using hard drives to SSDs as boot drives. We have a little over 1,200 SSDs acting as boot drives today. We are validating the SMART and failure data we are collecting on these SSD boot drives. We’ll keep you posted if we have anything worth publishing.
Would love to see SSD stats like this in the future. Recently was talking to some friends about what SSD to buy. I personally really like my HP EX950 - one friend said he'd never buy HP hardware. He said he was getting an Intel - I said I had an early Intel SSD fail on me, and I don't think QLC is the best option, but it is a nice value play. For performance, I do like Samsung, though they are expensive. Another friend said he'd never buy a Samsung SSD, as he had a reliability issue, and found lots of similar stories when he was researching it.
Of course these are all anecdotes and they aren't useful in making an informed choice. I suspect most SSDs are reliable "enough" for most consumer use, and not nearly reliable enough for certain kinds of critical storage needs. But it would still be nice to see the big picture, and be able to factor that into your SSD purchase decisions.
Apologies for being slightly off-topic, but presenting a table of text as a image is annoying to me. A table of text ought to be rendered in just plain old html in my old-school opinion.
Andy from Backblaze here. Actually you can download a spreadsheet with all the data from the tables. There's a link at the end of the post. Better than parsing HTML for the data.
The spreadsheets are identical to the screenshots you took of them. The images aren't responsive, aren't available to assistive technologies, provide no text alternative, do not respect a user's text settings, cannot be translated etc. Why is that better than HTML?
I was reading the Q3 2020 stats yesterday because I'm looking for a new drive.
It was somewhat annoying to have to type the HDD-model into the Google search bar instead of just double-clicking and selecting search from the context menu. It irritated me that it was an image.
The stats looks stable and consistent with common knowledge -- HGST > WD >> Seagate, Toshiba inconclusive.
Do anyone have anecdata on Toshiba MD/MG/MN 7K2 drives(excluding DT because those are HGST OEMs)? They are always less price competitive and thus always light on real-world stories though they seem comparably reliable as HGST.
Following Blackblaze report on them a few years ago "they appear to be great, but we buy in bulk and there is not enough volume of them there for us", I decided to use Toshiba drives almost exclusively, as a sort of fun experiment.
Drives deployed below 400, usage exclusively NAS (raid 1, raid10 and raid 6) inside below 50 employees companies. They appear to be insanely reliable and high performers, to the point the +15% price tax for them in french store seems highly justified for me.
Unsure about the >> Seagate in there, that info is over a a decade old now(?). It's worth pointing out they have a 12% failure rate on their highest density units but the other ones seem to do well outside of a DC environment.
It's still a separate factory making separate drives. This line even uses a different storage controller. But this is also true for luxury ranges, in general, so you may be asking for too fine of a distinction. (Their usual luxury range is the WD Red, however.)
LTO tape (specifically that which is rated for 15-30 years of archival storage) with the drive. The tape is usually rated for a couple hundred full passes, which should more than meet your needs if you're writing once and sticking them somewhere safe.
SSDs don't have this archival longevity yet, and hard drives are better when powered up and the data is always hot for scrubbing and migrating when indicators of drive failure present.
SSD's are not as bad as they used to be, but still not rated for long term unpowered storage. HDD would be better for that.
But HDD isn't your only other option. How important is the data, How often will you need to access it, and will you need to rewrite to the storage medium? You might want to consider Blu Ray. Or both, stored in different locations. Also look into LTO tape drives. LTO 6 drives should be cheaper than 7/8 (though still not cheap) and have a capacity around 6TB.
>Also look into LTO tape drives. LTO 6 drives should be cheaper than 7/8 (though still not cheap) and have a capacity around 6TB.
AFAIK a post on /r/datahoarders says that the breakeven point for tapes vs shucked hard drives from a pure storage perspective is around 50TB. Given the hassle associated with dealing with tapes, it's probably only really worth it if you have 100+TB of data to store.
What do you think the availability of LTO 6 drives will be in 10 years? The major benefit of SATA, and even Bluray, is the interface and drive will likely still exist in 10 years.
I'm still able to interface with an LTO 1 tape drive. It's all SCSI or SAS. Secondary markets like Ebay have made this surprisingly affordable (used drive, unopened older media).
LTO is nice in that they mandate backwards compatibility by two revisions, which come out once every 3 years or so. So that gives you time to roll forward to new media onto a new drive without breaking the bank, and giving time for the secondary market to settle.
Adding: This was a deliberate decision by the LTO Consortium; they wanted users to perceive LTO as the safest option for data retention standards.
LTO 6 is like 10 years old, so the availability in 10 years will probably be limited. That being said, LTO 7 drives are able to read LTO 6 so that might increase your chances.
I can vouch for the 50TB figure, it’s around there.
The amount of hassle depends on your workflow. If you create a backup every day and then bring the media off-site, tape is easier. Easy enough to put a tape in your drive, make the backup, and eject. Tape is not sensitive to shock and you can just chuck the tapes in your care or shove them in your backpack.
Depends on your archival needs. Consensus seems to be that tapes have a longer unpowered shelf life. In terms of speed it really is cold storage though. You can't just bring the tape over to a user's system and copy a file. And seek times for retrieval of arbitrary files are very slow compared to HDD.
If you really need it to last and re-writability isn't an issue, M Disc claims 1,000 years.
Why not b2 or glacier since you're encrypting anyway? If you don't have that much data then maybe M-DISC?
Personally I think safe is.. unnecessary. What is it protecting you from when your data is encrypted? If you put it in a safe then you probably care enough about the data not to have it in a single location no matter how secure it seemingly is.
Ignoring for a moment how insecure most cheap locks are (including locks on safes), little safes are rarely effective vs a prybar + carrying them away to be cut into at the attacker's leisure. Larger safes have some of the same issues w.r.t. cutting, but you can make it less convenient for an adversary to do it (and make them spend more time where they might be caught).
The $50 safes are not fire-rated... and hardly break-in rated.
For fire-safety you need something big, and mostly heavy, which will be costly (shipping/moving it alone)
Break-ins are not in my threat model for a document safe. If they were, I'd get a deposit box at a bank. I just want some of my personal mementos and documents to survive a fire.
AFAIK, M Disc really only matters for DVDs due to their organic materials. (Non-LTH) BDs on the other hand have inorganic materials and last pretty well.
I think there was a French study that compared DVDs, M Discs and BDs and the HTL BDs fared very well. Can't find the document though.
I would imagine previous generation tape drives (used) can be economical. Just need to find a reliable place that handles testing / refurbishing (cleaning, alignment, belts, etc) used drives. Also the other bit item is needing the appropriate controller and cabling.
Tape drives are open about both their condition and the condition of tapes. It’s all there in the scsi log pages, more detailed than SMART on hard drives.
Mechanically and electrically, everything is rated to last several times longer than the head
In other words, you just need to buy two used drives (one as spare) and verify they can write a full tape and their head hours and other error counters are sane. There is no reasonable need to refurbish a tape drive other than a head replacement, which is easy to do at home but so expensive (for older generations) that you might as well buy a new drive. All the testing you could hope for is done in POST and by LTT/equivalent (writing a tape and reading logs is good enough)
1. Backup to external SSD or NAS. This is the backup you will rely on if your PC loses all data. It will be fast to replicate to.
2. Mirror the external backup to a second external SSD. And sync it every week or month. Sync more often if your data is changing a lot.
3. The third layer is an external HDD mirror for the long term off-site backups. HDD are cheaper and more suited for being switched off long term.
4. If you can afford the expense of a forth step, every year buy another external HDD and put the previous one aside as an archive to be brought into service if the current one fails to boot.
I recommend separating your data into some short of hierarchy and choose what needs to be backed up to what level. So if you have some software ISOs that you could repurchase/redownload, then have a separate drive for junk like that and don't have it go all the way through the backup steps listed above.
Figure out what you really need and print it on good paper. Put that in a safe place, away from direct light and dampness.
Save the rest on two of Google Drive, OneDrive, iCloud, some other cloud storage, a backup service or copy to a computer in your home. Make your selection based on things that you will "touch" in some way at least every 12-24 months. Everything else will fail in a few years.
Don’t save crap you don’t need. Don't futz around with optical media, tape or other nonsense. Don't buy safes or safe deposit boxes unless that's going to be part of your routine in some way.
I tend to agree with this although it can be hard to determine what you won't want/need in advance and it probably takes at least some effort to winnow things down.
That said, I'm in the middle of going through my photos right now and deleting a bunch of stuff. (Which is a big job.) It's not so much for the storage space as I'll "only" be deleting a few hundred GB. But it's a lot easier to look for stuff and manage it when you don't have reams of near-identical or just lousy pics. One of my takeaways from this exercise is that I should really be better at pruning when I ingest a new batch.
I think that effort is worth it. As it stands, we've all become digital hoarders as the up-front cost to accumulate stuff like photos and documents goes to zero. The problem is you're dumping LOTS of cost into the future.
Photos are a big thing for me.
Initially, I used applications (Picasa and later iPhoto) to tag photos with metadata to indicate importance, etc. Applications tend to have zero respect for to commitment to preserving metadata. So by the time my kids are going to college, my family is going to have 200,000+ photos alone. What's the point? Am I getting pleasure (or to borrow from the Netflix organizational guru "Does it bring joy?") from this data?
Personally, my new strategy, having been burned by the tools is a pyramid:
1. Print and Frame/preserve important or significant pictures (Say 20-30/year)
2. Curate others that we care about. (Say 500/year or 5-8%)
3. Purge stuff of no value. (Say 2500/year or ~25%)
Based on my "performance" today, if I keep at it, I'll be able to reduce the rate of growth.
How long is 'a few years'? Controlled environments shouldn't be necessary for unplugged drives, just keep them at or slightly below room temperature.
I've had three external hard drives for 7 years, and none have stopped working. I have one, and keep two somewhere else (office, family). I connect one for a few hours every week/month to update, then leave alone until needed, or rotated with one elsewhere.
A few years isn't archival quality. An HDD will last longer and is cheaper, and speed is much less of an issue for a drive that will be written to and then chucked in a safe.
Yes, HDD's should probably be considered medium-term storage. Tapes seem a little more robust, but it seems like M Disc, estimated around 1,000 years, take the crown. Unfortunately 100GB is the limit, so very large files will be difficult.
How does Backblaze source its hard drives? I think their reports are getting big enough to warrant the attention of the manufacturers marketing departments, who are incentivized to game the system by supplying non-representative disks (i.e. the "good ones") to get an edge over the competition, if they can. I suspect any model that looks good in the report will likely shift more numbers than it otherwise would have.
Yev from Backblaze here -> We work with the drive manufacturers and their distributors! One of the reasons we think it's important to state the drive model is so people know exactly what is getting deployed versus just using ___ 14TB or ___16TB. But we source our drives just like any other large cloud storage provider would!
Andy at Backblaze here. All the drives are in data centers with temps around the 75-78 degree mark. Vibrations are kept to a minimum via the chassis design. We publish the data, including the SMART stats for all of the drives and there are attributes for temperature (SMART 194) and vibration (multiple) see https://en.wikipedia.org/wiki/S.M.A.R.T. for more info in SMART attributes.
Thanks Andy! Without me digging through the SMART data, has there been any difference in data center temps over the years for Backblaze? I ask because I've personally seen disk life vary greatly between "warm" datacenters (closer to 79F) and "cool" datacenters (closer to 72F). I don't have a huge dataset, only anecdotal evidence, but it seems to me temperature plays a pretty big role in drive longevity. Have you guys found the same, or is this a variable not controllable by Backblaze?
I'm always excited for this yearly post. Are there other vendors that provide this kind of insightful info for other types of infrastructure?
Also, kudos to Backblaze. I'm sure there's some side brand benefit to all the work that goes into making this public, but it's clear it's mostly just altruism.
What is the most reliable hard drive of all time, and why?
In other words, let's say I didn't care about capacity, and I didn't care about access time or data transfer rate, and I'd like a drive to store data 100, 200+ years into the future and have it have a reasonable chance of getting that data there -- then what kind of hard drive should I choose, and why?
It's a purely philosophical question...
Perhaps I, or someone else, should ask this question on Ask HN... I think the responses would be interesting...
I'd like a drive to store data 100, 200+ years into the future and have it have a reasonable chance of getting that data
That's flat out impossible given current drives. So many mechanical parts. So much to go wrong, such as air seals, bearings, etc. Even if the drive survived, will you be able to connect to its interface? No!
If you want to store data for 100 years, you will need to copy that data over, approximately every 5 years, onto the then current storage medium.
> I wonder why Western Digital is almost absent, does anyone know why?
Most of the time the answer comes down to price/GByte. But it isn't QUITE as simple as that.
Backblaze tries to optimize for total cost most of the time. That isn't just the cost of the drive, a drive that is twice as large in storage still takes the same identical amount of rack space and often the same electricity as the drive that is half the storage. This means that we have a spreadsheet and calculate what the total cost over a 5 year expected lifespan will turn out to be. So for example, even if the drive that is twice as large costs MORE than twice as much it can still make sense to purchase it.
As to failure rates, Backblaze essentially doesn't care what the failure rate of a drive is, other than to factor that into the spreadsheet. If we think one particular drive fails 2% more of the time, we still buy it if it is 2% cheaper, make sense?
So that's the answer most of the time, although Backblaze is always making sure we have alternatives, so we're willing to purchase a small number of pretty much anybody's drives of pretty much any size in order to "qualify" them. It means we run one pod of 60 of them for a month or two, then we run a full vault of 1,200 of that drive type for a month or two, just in case a good deal floats by where we can buy a few thousand of that type of drive. We have some confidence they will work.
> Guessing shit like the ST3000DM001 is a whole different thing entirely.
:-) Yeah, there are times where the failure rate can rise so high it threatens the data durability. The WORST is when failures are time correlated. Let's say the same one capacitor dies on a particular model of drive after precisely 6 months of being powered up. So everything is all calm and happy and smooth in operations, and then our world starts going sideways 1,200 drives at a time (one "vault" - our minimum unit of deployment).
Internally we've talked some about staggering drive models and drive ages to make these moments less impactful. But at any one moment one drive model usually stands out at a good price point, and buying in bulk we get a little discount, so this hasn't come to be.
> Internally we've talked some about staggering drive models and drive ages to make these moments less impactful. But at any one moment one drive model usually stands out at a good price point, and buying in bulk we get a little discount, so this hasn't come to be.
I don't know what your software architecture looks like right now (after reading the 2019 Vault post) but at some point it probably makes sense to move file shard location to a metadata layer to support more flexible layouts to work around failure domains (age, manufacturer, network switch, rack, power bus, physical location, etc.), reduce hotspot disks, and allow flexible hardware maintenance. Durability and reliability can be improved with two levels of RS codes as well; low level (M of N) codes for bit rot and failed drives and a higher level of (M2 of N2) codes across failure domains. It costs the same (N/M)*(N2/M2) storage as a larger (M*M2 of N*N2) code but you can use faster codes and larger N on the (N,M) layer (e.g. sse-accelerated RAID6) and slower, larger codes across transient failure domains under the assumption that you'll rarely need to reconstruct from the top-level parity, and any 2nd-level shards that do need to be reconstructed will be using data from a much larger number of drives than N2 to reduce hotspots. This also lets you rewrite lost shards immediately without physical drive replacement which reduces the number of parities required for a given durability level.
Is it safe to say that Backblaze essentially has an O(logn) algorithm for labor due to drive installation and maintenance, so up front costs and opportunity costs due to capacity weigh heavier in the equation?
The rest of us don’t have that, so a single disk loss can ruin a whole Saturday. Which is why we appreciate that you guys post the numbers as a public service/goodwill generator.
> algorithm for labor due to drive installation and maintenance ... the rest of us don't have that so a single disk loss can ruin Saturday
TOTALLY true. We staff our datacenters with our own datacenter technicians (Backblaze employees) 7 days a week. When they arrive in the morning the first thing they do is replace any drives that failed during the night. The last thing they do before going home is replacing the drives that failed during the day so the fleet is "whole".
Backblaze currently runs at 17 + 3. 17 data drives with 3 calculated parity drives, so we can lose ANY THREE drives out of a "tome" of 20 drives. Each of the 20 drives in one tome is in a different rack in the datacenter. You can read a little more about that in this blog post: https://www.backblaze.com/blog/vault-cloud-storage-architect...
So if 1 drive fails at night in one 20 drive tome we don't wake anybody up, and it's business as usual. That's totally normal, and the drive is replaced at around 8am. However, if 2 drives fail in one tome pagers start going off and employees wake up and start driving towards the datacenter to replace the drives. With 2 drives down we ALSO automatically stop writing new data to that particular tome (but customers can still read files from that tome), because we have notice less drive activity can lighten failure rates. In the VERY unusual situation that 3 drives are down in one tome every single tech ops and datacenter tech and engineer at Backblaze is awake and working on THAT problem until the tome comes back from the brink. We do NOT like being in that position. In that situation we turn off all "cleanup jobs" on that vault to lighten load. The cleanup jobs are the things that are running around deleting files that customers no longer need, like if they age out due to lifecycle rules, etc.
The only exceptions to our datacenters having dedicated staff working 7 days a week are if a particular datacenter is small or just coming online. In that case we lean on "remote hands" to replace drives on weekends. That's more expensive per drive, but it isn't worth employing datacenter technicians that are just hanging out all day Saturday and Sunday bored out of their minds - instead we just pay the bill for remote hands.
Is it actually required to have employees wake up and replace that specific failed drive to restore full capacity of a tome? I would expect an automatic process - disable and remove failed drive completely from a tome, add SOME free reserve drive at SOME rack in the datacenter to the tome, and start populating it immediately. And originally failed drive can be replaced afterwards without hurry.
> I would expect an automatic process - disable and remove failed drive completely from a tome, add SOME free reserve drive at SOME rack in the datacenter to the tome
We have spec'ed this EXACT project a couple different ways but haven't built it yet. In one design, there would be a bank of spare blank drives plugged into a computer or pod somewhere, let's pretend that is on a workbench in the datacenter (it wouldn't be, but it helps the description). The code would auto-select a spare drive from the workbench and begin the rebuild in the middle of the night. At whatever time the datacenter technician arrives back in the datacenter they can move the drive from the workbench and insert it in place of the failed drive in the vault. It doesn't even matter if it is half way through the rebuild, the software as written today already handles that just fine. It would just continue from where the rebuild left off when it was pulled from the workbench.
The reason we don't really want to leave the drive in a random location across the datacenter in the long run is we prefer to isolate the network chatter that goes on INSIDE of one vault to be on one network switch (although a few vaults span two switches to not waste ports on the switches). Pods inside the vault rebuild the drive by talking with the OTHER 19 members of the vault, and we don't want that "network chatter" expanding across more and more switches. I'm certain the first few months of that would be FINE, maybe even for years. But we just don't want to worry about some random choke point getting created that throttles the rebuilds in some corner case without us in control of it. Each pod has a 10 Gbit/sec network port, and (I think) a 40 Gbit/sec uplink to the next level of switches upstream. The switches have some amount of spare capacity in every direction, but my gut feeling is in some corner case the capacity could choke out. We try to save money by barely provisioning everything to what it needs without much spare overhead.
Plus the time to walk a drive across the datacenter isn't that big of a deal. Some of the rebuilds of the largest drives can take a couple days ANYWAY. That 5 extra minutes isn't adding much durability.
> because we have notice less drive activity can lighten failure rates
It's a rite of passage to experience a second drive failure during RAID rebuild/ZFS resilvering.
I got to experience this when I built a Synology box using drives I had around and ordering new ones.
One of the old drives ate itself and I had to start over. Then I did the math on how long the last drive was going to take, realized that since it was only 5% full it was going to be faster to kill the array and start over a 3rd time. Plus less wear and tear on the drives.
Yev from Backblaze here -> We just started deploying more of them! We added almost 6k WDC drives in Q4 - so we're getting more of them in the fleet! They have a pretty low AFR - but haven't been deployed for too long, so they'll be interesting to follow!
We do have some drives that we use as boot drives and for logging. We write about them a bit in the post - they're primarily SSDs, so not included in the overall mix of data drives!
Used or Referbs. Maybe some old stock but at this point I doubt it...
You will see "UltraStar" branding but it will be "Western Digital UltraStar" not HGST UltraStar.
Some retailers may even still put HGST in the listing (i.e Amazon still has several listing for new Drives that refer to HGST) but the Packaging, offical documents, and everything that comes out of WD will not have any HGST branding on it, it will not say HGST on the drive etc
Andy from Backblaze here. Larger drives do take longer to rebuild, but to date we haven't changed the encoding algorithms we built. There are other strategies like cloning which can reduce rebuild time. We can also prioritize rebuilds or drop a drive into read-only mode as needed. The system was built expecting drive failures.
I think you should stick to SSDs for performance reasons. :-) I can't go back to spinning drives for my desktop computers.
But here is one idea for you... Backblaze started in my dive one bedroom rental apartment in Palo Alto, California. Here is what that looks like with cubical walls setup in my living room: https://i.imgur.com/twu5vKe.jpg
Ok, so my desk in that picture is the one on the far right. I positioned my desk in the living room against a wall carefully so that my gas furnace room (with a door) was DIRECTLY on the other side of the wall. Then I put the computer inside the furnace room, drilled a hole through the drywall large enough to run the monitor cables and USB for the keyboard and mouse, and my desk was SILENT.
I caught SO MUCH grief for putting my computer in a furnace room, but the silence was worth it. I still claim to this day that if the furnace pumped out all of it's heat into the closet surrounding the furnace that would be a terrible design, that's not how it works at all. :-) But if it bothers you deeply, find a wall where there is a coat closet on the other side.
Also wow you founded Backblaze, super exciting to meet you here. I am a happy customer and also got my family members to sign up so that they don't have to worry about losing pictures.
Heard about Backblaze from one of the professors at university. Can't imagine unsubscribing ever, it's definitely a part of essential software package for me.
Also if BackBlaze founder recommends to stick to SSDs I will stick to SSDs :)
Since moving to Fractal Design Define (i.e. soundproofed) version 4/5/6/7 cases over the past ~10 years, the only noise I ever hear anymore is from the fans, primarily GPU and case fans when running heavier jobs. In my main system I have 6 spinning rust drives (8-16TB from various manufacturers) running 24/7 and I never hear them, even during heavy read/writes... and I often sleep on the couch nearby ;)
I actually have Fractal Define 7XL but am hesitant to get HDDs bc haven’t used them for some 10 years. Does FD7 isolate HDD noise to the level of for example AIO noise?
I'd recommend checking out some YouTube reviews in regards to water-cooling, several have talked about AIO but I skipped over those parts. I don't have AIO but if you want to keep absolutely silent, I would suggest mounting AIO fan-kit in the front of the 7XL instead of up top since that requires removing some of the critical soundproofing.
Sadly, at least in some cases the "5400" drives, are just relabled 7200 drives at a lower price point. Apparently it's cheaper than making actually 5400 drives.
Au contraire I feel like drive reliability has gone _down_... Especially for consumers - the big difference between Blackblaze and regular users is that they have their disks spinning continuously, and the reliability numbers seem to only apply in that scenario. If you switch off and store a drive my experience is that after a year it's very high probability it won't switch on again. This is a big problem in academic labs where grad students generate terabytes of data and professors and departments are too stingy to provide managed storage services in that scale so it all sits in degrading drives in some drawer in the lab.
My experience has been the opposite. Every HDD drive I have works. The new ones are fine, even if I let them sit for months, as are the old ones after years.
My experience of data storage among academics has been disturbing. Masses and masses of work is stored on USB sticks and laptops. Hundreds of hours of work, maybe even thousands of hours and no backups. I’ve hit it multiple times and it blows my mind each time.
Yes, buying a basic backup solution is going to set you back a few hundred dollars minimum (or not, if you go for BB or similar) but it seems like a basic minimum.
I don’t know how you change the culture but it’s bad among those I have worked along side.
I haven’t bought large drives in years and recently started doing so. I have been really impressed with how good they are and how well they perform in an always-on NAS. I’m so impressed with the Synology I got and can’t speak highly enough of it. I just wish I’d bought one with more bays.
This. I bought a hard drive docking station and the idea was to go through all my 6 hard drives from the past 10 years which I haven't used. Only the laptop drives worked.
Is the docking station supplied with enough current? 3.5" drives tend to take considerably more power (and use 12V for their motor) particularly when spinning up. I'd give those drives another try in a different docking station or connect them directly to a PC, RAID or NAS device.
If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year? Better cooling, different access patterns, etc.
If this change doesn't have an obvious root cause, I'd be interested in finding out what it is if I were Backblaze. It could be something they could optimize around even more.