Hacker News new | past | comments | ask | show | jobs | submit login
Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture (backblaze.com)
142 points by nuriaion on March 11, 2015 | hide | past | favorite | 69 comments



Reading the comments, is anyone else bothered by this reply from a Blackblaze representative:

"Right now, Backblaze has only one datacenter, so the short answer is "no". :-)

The longer answer is that for online backup, there is one copy of your data on your laptop, and another copy in the Backblaze datacenter in Sacramento. If a meteor hits our datacenter in Sacramento pulverizing it into atoms, you STILL would not lose one single file, not one - because your laptop is still running just fine where ever you are with your copy of the data. In the case that occurs, we will alert our users they should make another backup of their data."

There are a million one things other than a comet strike that can go wrong in a data centre. I would not trust a backup provider that does not replicate my data across at least two data centers.


Yev from Backblaze here -> It's true, we just have the one. Backblaze was bootstrapped, so we cannot over-expand and maintain profitability, which allows us to stay in business. We're pretty up-front about having the one datacenter. We'd LOVE to add more in the future, but truthfully, it would at least double our costs, and we'd need to raise prices. We're thinking of ways of avoiding that while maintaining our current business-model. If you are looking for something more geo-redundant, take a look at services like Amazon S3, they are great, but the downside there is they charge per GB to make up for the extra costs, so depending on the amount of data, it can get pricey. Either way, we recommend a 3-2-1 backup policy (3 copies of your data, 2 onsite but on different mediums and 1 offsite) as a good start to a backup strat. We're just one solution of many, though, we like to think we're the easiest one!


How about an add-on cost/service that tags your data as needing datacenter redundancy, and only replicating that to a new datacenter. It has the benefit of not requiring as much up-front investment, as it's used it pays for itself, and you have a bunch of current customers you can upsell to. The architecture to segregate redundant from non-redundant backup customers could be a pain, but as long as you have tools to migrate data between systems (I imagine you do), then it could just be running two separate backblaze clusters in the first datacenter, one which supports redundancy and one which doesn't, and then just migrate customer data between the clusters as the add/drop the redundancy service. That saves you from having to cherry-pick specific files/customers from the the cluster to duplicate in the other datacenter, you just make sure one cluster is always redundant.


We're definitely looking at options like this, but the engineering work that it would take to implement solutions like that are not insignificant, and a lot of our engineering muscle has been working to roll out the Vaults over the past year and change! It could certainly be another revenue stream for us, but building out a new datacenter is expensive, especially if you don't buy/guarantee build-out ahead of time, so we'd have to forecast how many people would want that service to prepare accordingly, again not insignificant stuff, but is definitely possible in the future!


The nice thing about using separate clusters is that you can build them out in chunks. Build X new capacity in your main datacenter as a new cluster, and X new capacity in a different datacenter, and replicate. Need more redundant capacity? Build Y new capacity in your main datacenter, and Y new capacity in a different datacenter, not even necessarily the same backup datacenter as before. You end up with one main non-redundant cluster, and a bunch of smaller redundant clusters spread over one or more additional datacenters.

If you're really lucky, you siphon off customers from the non-redundant service for this at the same or faster rate as they are signing up for the non-redundant service, allowing you to not have to build that out much for a short while.


As an existing customer, I'd pay for geo-site redundancy at double the current pricing.


Brian from Backblaze here, I wrote that comment so if anybody has any questions, fire away!

I think the MOST IMPORTANT THING is that an online backup provider like Backblaze is totally transparent to their customers about their architecture and what we do and do NOT do. If our design is not reliable enough for your particular needs, then you are then able to make the choice not to use us. I never want to mis-lead customers and hide exactly how durable their backup really is.

Finally - what I recommend to my very closest family members and trusted friends is that for data you feel would be catastrophic to lose, I recommend you have at least three copies including the primary copy. That is two separate backups with TWO SEPARATE VENDORS who did not share any code, hopefully managed by two separate UIs. For bonus points, one backup should be "offsite". For example, many Backblaze customers use Time Machine on the Macintosh for a local backup, and Backblaze for their remote backup, and that is what EVERYBODY should be doing. I can show you many support cases where Time Machine failed to restore a file and Backblaze saved the day, and vice versa. The fact is that users make mistakes, UIs are hard to use, your 14 year old son decided to unplug the Time Machine USB hard drive to free up a USB port, or your 14 year old daughter decided to uninstall the Backblaze agent - in other words, stuff happens!!


No I'm not bothered by it as I wouldn't expect replication across multiple data centers at a "$5/mo for unlimited" price point.

I routinely recommend BackBlaze as a quick and easy cloud backup solution to family and friends. I view the risk of partial or complete data loss with BackBlaze extremely low compared to the local backup solutions that some people rely on (i.e. periodic backups to an external USB drive)

For myself and my own work files, I don't put complete trust in any single backup provider, regardless of what their replication policy is.


I really feel this is the correct answer.


You get what you pay for. This product is very squarely aimed at people that want backup cheap (and there's nothing wrong with that). Should you be backing up your enterprise with this (and only this)? Probably not. Is it good enough as a backup of your family photos and videos? Probably, but it depends on your risk aversion level. Additionally, you can always throw another backup service into the mix and get the redundancy you want, at the appropriate cost.


Time Machine + SuperDuper! clone + Backblaze + Arq (to S3 Glacier). I'm ready for meteors, bring it on.


Data Backup 3 [1] + SuperDuper! clone + Crashplan + Arq (to S3 Glacier) + DropBox + Tarsnap (Subset) + Aperture Vaults.

But I would still prefer not to be hit by any Meteors.

I have a fondness for Tarsnap and Arq, but I find Data Backup 3 from ProsoftEngineering to be one of the best for doing local backups. At the end of a day, I just toss in a 64 GB USB Key (with FileVault Encryption), and no matter how much data I've created, it usually is less than 10 seconds for a full versioned backup. SuperDuper Clone about once every couple weeks.

I originally quit BackBlaze because it fired up my CPU to 100%, no matter how much time I spent trying to tweak it, but now I'm running into the same issue with CrashPlan which, after a couple years, has started constantly scanning my filesystem, forcing me to sudo launchctl unload /Library/LaunchDaemons/com.crashplan.engine.plist, and then remembering to sudo launchctl load /Library/LaunchDaemons/com.crashplan.engine.plist before I go to bed to let it run overnight.

What is it about Crashplan not letting you just shutdown everything from within the application.

[1] http://www.prosofteng.com/databackup3/


I like your style!


If you are the type of customer for whom this is actually an issue, you are not backing up your data on BackBlaze. You are going (and paying) for someone like http://www.zetta.net/architecture.php.

This type of service starts at around $175 for 500 GB of SSAE16-audited datacenters, but you get all the things that consumer based backup systems don't - such as Plugins for SQL, Exchange, Hyper-V, NetApp, as well as Backup & DR software licenses for an unlimited number of servers - and, most importantly, 24 x 7 US-based engineer-level support.


"For Backblaze Vaults, we threw out the Linux RAID software we had been using and wrote a Reed-Solomon implementation from scratch. It was exciting to be able to use our group theory and matrix algebra from college. We’ll be talking more about this in an upcoming blog post."

I hope I'm not the only one uncomfortable about this. I mean, I understand the need for greater flexibility and features that MDRaid doesn't provide, but this wording stinks of NIH and rebuilding the wheel from scratch "because it was fun", discarding the maturity and reliabilty of established software. And data storage is all about reliability.

I hope I'm just reading too much from this, and it isn't actually representative of Backblaze's engineering practices.


Brian Wilson from Backblaze here (not the "Brian Beach" who is author of the blog post) - this was not a case of NIH. We use lots and lots and lots of existing software like Debian, Java, Ext4. We use tools like ansible and Zabbix. But this one thing just didn't exist for us in the form we needed it. We looked, we really did.

We did write the Reed-Solomon ourselves in a "clean room" so we did not have to pay any licensing fees and we clearly didn't steal anybody else's source code, but that is a very small amount of code. Like 80 lines of Java. Seriously. We referenced the technical papers we read to implement it in that blog post, but here it is again: http://www.cs.cmu.edu/~guyb/realworld/reedsolomon/reed_solom... And we unit tested the living heck out of that code, plus we mathematically verified various parts.

But I'm open to an alternative solution if you can suggest one? Remember our three highest priorities are: 1) reliable, 2) low cost, 3) simple. The "low cost" includes things like we do not want to pay ongoing licensing fees to other companies.


I'm actually rather surprised you didn't just go with JErasure (http://jerasure.org/).


Thanks, that's basically what I wanted to hear. :)

My first thought was that you could've reused the R-S code from mdraid or dm-raid or ZFS, but on second thought 1) it may be too specialized to be reusable, and 2) it's GPL (or CDDL), so you can't just plonk it into your own code.

And yeah, if it's just 80 lines of Java, I'm worrying about the wrong things.


> 2) it's GPL

Backblaze is a web service, so for better or worse, the GPL doesn't apply here, since we never have access to their binaries. The AGPL would apply, but that's not the license used.


I doubt that Backblaze decided to go this route on a whim, particularly because it's really their only option for cost efficient large scale storage. MD is terrible for this use case. There's no mature, open source object store with R-S coding.

So everyone uses MD, but why does it suck for Backblaze-like storage? It only provides RAID levels, so you have partition-level redundancy but not necessarily object-level redundancy. You still need another layer to put objects in the right places to achieve the level of redundancy you want, and to re-balance as necessary. And a RAID array is local to one system, so achieving multi-host redundancy means duplicating the data as many times as necessary.

MD can also be finicky. Since it's part of the kernel, kernel upgrades can (rarely) produce weirdness. You can get stuck between a rock and a hard place, where you need to upgrade to fix a vulnerability but upgrading too quickly puts your data at risk.

According to Backblaze, they used to do 3 RAID6 arrays of 15 drives each per pod. This gives an overhead of 1.15:1, and two devices can fail before you lose data. Write performance is not going to be great, nor is rebuild performance. That is probably part of the reason why they could only push 950MB/s per host.

This only provides disk-level redundancy, and not host-level. So now you have to at minimum duplicate your array onto a second host, and your overhead is 2.30:1. A third host brings it to 3.46:1. I'm surprised that they were even using RAID for non-boot devices at all, given the overhead.

Erasure coding allows you to safely store data with far less than a 2:1 overhead. Their current design claims that they can lose 3 storage devices before they risk losing data permanently, with an overhead of 1.17:1. That is pretty compelling from a cost perspective.


Agreed. R-S encoding is hard. You should use something like zfec (extensively battle-tested by Tahoe-LAFS) or JErasure rather than rolling your own.

(I say this as someone who worked at a company where we rolled our own, but that was ten years ago and we didn't have those options.)


They don't say they did it because it was fun, only that it was fun. (To be fair, they also don't seem to say why they did do it.)


That was a great read. I was surprised that the data on a drive tries to go back to the same drive when it fails. One of the things that GFS did (I presume they still do) and Blekko does is that when a drive fails its data is reconstructed on other working drives so replacement leaves no long term degradation risk. If you don't do that then while your drive is dead you have lost some data resiliency until it gets replaced, as opposed to just waiting until it has been successfully recreated elsewhere[1].

Its no wonder storage companies like EMC are hurting when you have innovators like these guys out there.

[1] Which give a 3x replication system and a sharded or (chunked) file can happen pretty quickly.


Brian from Backblaze here. We do angst over the idea of a "hot spare" where the very second we fail a drive it can begin rebuilding elsewhere. But that takes up redundancy even when it is not used (an extra drive waiting) which raises cost.

At our current scale it is becoming less and less of an open debate, because we now have 7 day a week staffing at our datacenter and the datacenter techs jump right in and replaced failed drives often within an hour or so. A "hot spare" would only save a couple hours of rebuild time. But remember, your mileage will vary - until you reach half our scale you cannot afford even a Monday-Friday datacenter tech, so you might only be able to replace failed drives on Mondays and Wednesday, which widens your exposure.


Have you considered rebuilding into already available space in the cluster?

Something similar to how ceph or swift handles rebuilds? you get rid of the individual disk sitting around as a spare. though it would break the idea of a tome being a specific collection of disks. you would need to be able to identify and move a shard around your cluster into other vaults and a shard would need to be smaller than the raw disk size.

this would increase network overhead as well. (more movement.)

I'm probably just rambling here so you can probably ignore me. (you have awesome tech there though)


> I'm probably just rambling here

:-) Not at all! Don't assume we're some perfect team of scientists that know all the correct solutions before we start coding. We often angst over these decisions and designs, knowing that once we write the code a lot will be set in stone (hard to change) for a numbers of years. The reason it becomes hard to change is we don't have a huge development team that can afford to rewrite the software every year, so we try to get it correct and then go on to work on new things or polishing up corners that need polishing.


Two people for a year is a lot of hard drives you don't get to buy. One of the interesting things I got to experience at Google when I was there was the difference between drive economics at scale and single drive pricing. Next time you're in Sunnyvale we should chat over a beer.


I'm really looking forward to the Reed-Solomon article. It seems that very few RAID-like applications are built to handle arbitrary data and parity stripes.


Yev from Backblaze -> we are SO jazzed to write that one up soon. We're trying to find a good way to present it, otherwise it would have been included here, but that would have made for a very long read. Stay tuned!



Note that the Plank paper claimed to support arbitrary N+M, but was then amended six years later identifying a critical flaw that invalidates its claims. Nevertheless, the technique can be adapted to work for up to triple parity.

So better than most, but still not arbitrary. And the limiting factor seems to be write performance.


Getting rid of RAID makes things a lot easier since you don't have to suffer through rebuilds, which causes a lot of I/O for the entire RAID. You still have to repopulate the drive, but you have fine-grained control of when to do it and even which files have the highest priority.

For those looking to build something similar, check out ceph or gluster.

Is a single file spread across multiple data centers? At the claimed 99.99999% annual durability, doesn't the chance of a natural disaster that could take out the entire data center start being a major factor?

I realize that the customer also has a copy of the data so you don't have to take the same precautions as something like S3, but it'd be sad if a datacenter got taken out by a meteor or airplane crash the same day that the customer's laptop was stolen.

Finally, a question for backblaze devs. In your opinion, how often do you need to scrub a drive to check for problems?


Yev from Backblaze -> That meteor question comes up a lot (https://www.backblaze.com/blog/vault-cloud-storage-architect...). We currently do have one data center, but this design allows us to bring others online. If the datacenter was hit by a meteor all our customers would get an email blast urging them to create a local backup immediately. The chances that both the DC and the user would get hit by the same natural disaster are relatively small. Still, it's not a storage service like S3 so geo-redundancy plays a smaller role. We do plan on building out other datacenters in the future, but since we're bootstrapped, we have to do that when the time is right, otherwise it would be very easy to over-extend ourselves and start losing money.

edit -> I ignored your Backblaze dev question, sorry. We have multiple processes running at all times on the pods, and they go shard-by-shard. We're always optimizing, but the short answer is, we're always looking for errors.


It doesn't take a meteor. I got hit when EV1 had a fire in one of their DCs, fortunately we had a local backup from which we restored and kept on running but a lot of companies were not in that position and had a hard time surviving. EV1 was hurt badly by this, it's not just meteors. What happened was that a transformer on the floor exploded, took a dividing wall with it and cause a (surprisingly!) relatively minor fire.

What took down the DC for several weeks was the firedepartements investigation. They took their time to figure out the root cause of the fire which is their right but the collateral damage of that was substantial.

So don't just plan for meteors.


The nice thing from backblaze's perspective and their customers is that downtime is far more tolerable than it would be for most businesses. Most disasters that are going to impact a data center aren't going to destroy the physical hard drives, assuming the data center has the usual safeguards in place.


Brian from Backblaze here-> this is definitely true. If you ask to have a 5 TByte restore prepared, it will take us a full 22 hours to get that all assembled for you. If you want us to FedEx the prepared restore on a USB hard drive, it will take ANOTHER 24 hours, and if you are in Europe it's more like 48 hours.

And what's luxurious about "backup" as a business is this doesn't bother many customers. As long as we keep communicating to them on the progress, and we assure them they are going to get every solitary bit/byte/jpeg/mp3/movie back - they often tell us to take our time and do it right. For "backup" accurate and durable is about a thousand times more important than "instant gratification".


At our current growth rate, Backblaze deploys a little over one Vault each month.

That's a full rack of storage pods a workday. Some back-of-the-envelope math says that's almost a tractor-trailer worth of hardware a month. Wow.


Yev from Backblaze here -> We're VERY proud of our datacenter techs, and you'd hear more about them if they weren't so shy. It IS a monumental achievement though, considering that a few years ago we only had two guys running our entire farm.


The client I'm working for failed to get a web server procured with 6 months lead time. And that's a company 100X bigger than Backblaze. I wish they understood how much money they are wasting by being stingy.


All you can do is keep sending them these posts :)


Did you do that math right? Shouldn't it be about one pod per day?


I find the work Backblaze is doing is wonderful, mostly because of how open they are with their data. Watching their numbers on certain HDD failures has really helped steer some of my recent purchasing decisions. I'm also really interested to see more details on their custom RAID replacement.

Essentially raid is dead to me, and ZFS/BTRFS, etc seem to be the only way to go forward, so I hope they gpl the code.

For anyone from BackBlaze reading this, I'm curious though, have you found that backplanes are becoming a primary bottleneck? Because that seems to be the case (Sata 6gb/s hurts after using things like fusion-io or even thunderbolt.) Any insights into the future of backplanes?


Backblaze seems so forward thinking with the hardware, but if you've ever tried to restore a file using their web interface, it's an exercise in frustration.

If you want to pull a single file, you'll be navigating through a windows 95'esq tree. They store snapshots, but if you want to change the snapshot, you wait a minute for each while it loads. Even going back to a snapshot you were just looking at, you wait the whole load time.

Now if you actually need to restore something. You can download a zip file. They will only let you make the zip file so big, so you have to break up your restore into multiple zip files. You will have to do that manually, there is no way to have BB auto-generate the parts for you. These are zip files and not one of the many archive formats that allow for parts.

Besides that, you will need double the amount of storage to recover this data since you'll need to store the zip and the extracted backup.

The way around this is to pay BB to put the data on a USB drive (flash up to 128GB or external up to 4TB) at $99 and $189 respectively. A 128GB external usb on Amazon, first result is $120. They actually won't give you a 4TB unless your restore needs it, but the price is still $189. Labor I guess? According to their own FAQ, you will wait 2-3 days for them to ship these drives, so hopefully you don't need that restore anytime soon.

I really liked Backblaze up to the point I needed to use it for it's real purpose. It seems like nobody at BB cares about the restore process or that it doesn't sell new subscriptions.

--

Also, maybe someone from BB can explain why the secure.backblaze restore website loads tracking pixels from googleads.g.doubleclick, a.triggit, s.adroll,facebook, ads.yahoo,x.bidswitch, ib.adnxs and idsync.rlcdn. Are you selling my need for a new harddrive or something?


> It seems like nobody at BB cares about the restore process

Brian from Backblaze here -> I care! It just keeps getting bumped by something higher priority. I have a spec for how to speed up the restore tree browsing, it's just waiting for us to have a spare moment. For a while the Vaults took precedence.

Part of running Backblaze without VC funding is we can only hire programmers when we can afford it out of profits, and we're up to about 6 programmers (the result of a recent burst of hiring) which handle all of: Windows, Macintosh, iOS, Android, and in the datacenter they built the pods, the Vaults, and the web front end. But we'll get there, I swear.

> or that it doesn't sell new subscriptions.

This is unfortunately the heart of the problem. The most important thing to get smooth as glass is the BACKUP part, that sells new subscriptions. If we have your data safe, we can always hobble through a restore even if it is a little slow and clunky we can get all your files back after your laptop is stolen. If it held up sales, we'd jump over and do the one week of work to speed it up.


Yev from Backblaze here -> we're working on the restore process. It can DEFINITELY be smoother, and we're hoping to address that in the near (hopefully) future! As for the tracking pixels, a lot of those aren't actually in use anymore but are remnants of advertising that we've run in the past, but that could use some cleaning up as well. I'll chat with the team to see if we can expunge the one's we're no longer using!


Always impressed by the write-ups the Backblaze team does. Very informative and very clear on what you guy are achieving and how you do it.

Crazy amount of hardware involved, and Backblaze is the "small kid on the block" in relation to FB, Google, and Amazon.


Yev from Backblaze here -> Yea, we hope they start doing more stuff like this too! More information = everyone wins!


Backblaze is fantastic for sharing all their hardware development efforts. We've built our own pod for onsite storage and it's amazingly awesome.


Yev from Backblaze here -> Nice! Glad it's working for you :)


What does 99.99999% annual durability mean in practical terms? That everyone should expect to lose a few bytes per year? Or that only one in a million customers will be affected by data loss?

(I've never been good at statistics.)


That number is probably the estimated probability that over the course of a year, a single vault loses every file contained on it (which is what would happen if a single vault had 4 drives simultaneously and irrecoverably fail).

As a consumer, the type of failure you would experience, and the probability of experiencing that failure, given that Backblaze has suffered a vault failure, depends on how they distribute your data amongst their vaults. They don't explicitly say how they do this, so it's impossible to know for sure, but we can consider the two extreme scenarios.

Scenario 1: Each customer is assigned a single vault, and all your files are on it. In this case, if Backblaze lost a vault, you would either luck out and have your files on another vault and be completely unaffected, or get really screwed and have all your files on the bad vault, and lose them all. They've got 150 PB of storage, and each vault stores 3.6 PB of data, so we can estimate that currently you may have something like a 1 in 40 chance of having your data on any given vault. So under this scenario, you would have a 1 in 400 million chance of losing all your files.

Scenario 2: Each customer's files are uniformly distributed across all vaults. In this case, if Backblaze lost a vault, all customers would lose a fraction of their files. Again, using our estimate that they might have 40 vaults, you would have a 1 in 10 million chance of losing 2.5% of your files.

So up to now, we're basically just doing the math without questioning the assumptions of the model. In reality, I think your practical risk is mostly concentrated in things outside of the model: ie, an event that affects all of their vaults simultaneously, like a fire, earthquake, meteor strike, etc. If I had to make a bet about what that number is, I'd put it in the 1/10,000 to 1/100,000 range. In other words, orders of magnitude higher than losing data because some hard drives failed, or a backblaze employee spilled his coffee, or something like that.


Thanks. IMHO the greatest risk of data loss is bugs in the homegrown software and/or operator error during maintenance operations, not a natural disaster. We infallible software engineers always underestimate that stuff, but it's usually the cause.

Also, I'm not worried. If that probability only concerns data loss on Backblaze's side, even if it's 1/10,000, then that's still not the probability of actual customer data loss. Because for that to happen there'd have to be a simultaneous loss of data on the customer side as well. That probably extends the durability considerably.


We infallible software engineers always underestimate that stuff, but it's usually the cause.

My former boss used to say that 90% of all problems are cabling. His percentage may be off, but the sentiment certainly isn't.


Brian from Backblaze here - expanding on Yev's answer a little, a customer has an email address/password that they login to Backblaze's datacenter with. The email address is bound to a Backblaze "Cluster" for life (at least so far we have never migrated a customer between clusters). The cluster contains a variety of services, it's the unit of scaling for our company. A single cluster scales easily up to at least 200 Vaults (probably much more, we'll let you know), and any one customer's data will be spread across pretty much all the Vaults in that cluster with enough time. A customer backs up once per hour, and each time they are told to backup to a vault with spare space on it with no affinity. Theoretically it might be possible all your data lands on one vault, but it's really unlikely.

Vaults belong to a cluster - so your backup only puts data on the group of Vaults assigned to that one cluster, and the vaults in that cluster don't contain data from customers on OTHER clusters.


Yev from Backblaze -> Customers/files are striped across multiple vaults for added security, so odds of all data from a single user going poof are fairly high and would most likely come from the datacenter falling in to a surprise sinkhole.


It's really impossible to use that number in a meaningful way, since you also have to take into account that most instances of file loss on their end will be unnoticed anyway (you'll still have a local copy that will simply be uploaded again).


I know this is a hijack that doesn't have anything to do with the article, but every time I see a Backblaze article I can't help but say it: Give us a Linux Client already!!!!


Stay tuned.


Very much interested in this. I asked a while ago if I could build one myself, was initially told it was OK but was later told it's against your ToS.

Days, Weeks, Months, or Years?


Pretty sure we can't say much other than stay tuned at this point :)


Still no linux support?

Any plans for that or even just an API so I can write one?


Their website requires TLS but doesn't support TLS 1.2. What technology are they using to serve their website that cannot support a 7 year old standard?


I'm happy their webserver has better cipher suites than it did last week.


Yev from Backblaze here -> we're working on it!


Brian from Backblaze here -> seriously, I just beat that team up today again, we'll get there.


FTA: "If one of the original data shards is unavailable, it can be re-computed from the other 16 original shards, plus one of the parity shards"

So you would have to read 17x the data to recreate it. Given disk latency, network bandwidth, etc. I'm guessing it'll take quite a while to recreate a 6TB HDD if it fails.


The drive operations (including waiting for the head to seek, etc) could take place in parallel. There's also the chance the data needed was in cache. So it might not be horribly bad.


We're talking about 102TB of data (6TB x 17), so a cache won't dent this figure too much (especially, given that each of those disks is one of 45 drives in the pod, which means the cache is shared across all of them). Then, each of the drives will also be serving files (or storing them), which means disk heads will be seeking around all over the place while rebuilding the failed drive...


Actually, it shouldn't take particularly long. The key is to distribute.

Sending 102TB of data to the pod that's rebuilding would take forever, this is true.

Instead you have each peer pod be responsible for 1/17th of the parity calculations.

1. 17 pods each read 17 megabytes and sends them across the network.

2. Pod A gets all 17 of megabyte 1. Pod B gets all 17 of megabyte 2. etc.

3. Each pod calculates their megabyte of the replacement drive and send it off.

4. Repeat until 6TB have been processed.

So this way each pod reads 6TB from disk, sends 6TB across the network, receives 6TB across the network, and calculates one nth of the data for the replacement drive.

It scales perfectly. It's no slower than doing a direct copy over the network.

Just make sure your switch can handle the traffic (which it already has to handle for filling the vault in the first place).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: