Petabytes on a budget: How to build cheap cloud storage

sh1mmer · on Sept 1, 2009

One of the smart guys in our (Y!) cloud team just point out something to me that hadn't occurred to me.

This system is definitely optimized for backup. That totally make sense for Backblaze. However it's important to not compare this like for like with something like S3 which is optimized for much better read/write performance.

At the basic level the cooling on this system seems minimal. Those tightly packed drives would sure get hot if they were all spinning a lot. More than that since they are using commodity consumer hardware, and they already used up their PCIe slots for the SATA controllers there isn't any place to add anything more than the gitabit (I assume) ethernet jack on the mobo. That means there throughput is limited.

Again, this is a great system for backup. Most of the data will just sit happily sipping little power. However, if you are thinking of this as equivalent to a filer, that's an unfair comparison.

Andys · on Sept 2, 2009

Fast 120mm fans can move a pretty decent amount of air - up to 120 ft^3/min each, and they used three in parallel.

It would take about a dozen 80mm normal-speed computer fans to reach this.

hyuen · on Sept 2, 2009

I was thinking about the IOPS that you can sustain with this type of device, and as mentioned here, this is intended for backup, which may or may not be very good acting as a filer.

Probably you can get some high IOPS with all these drives in parallel, but probably not as much as the other more expensive ones (EMC, Netapp).

But I have to agree this is an awesome setup optimized for a specific application. Way to go

dmillar · on Sept 1, 2009

This is unbelievably awesome. Can you imagine what we could achieve if every company had this level of transparency?

PStamatiou · on Sept 1, 2009

Thinking the exact same thing! While technology isn't more than any typical hacker can slap together in their apartment (minus maybe the case fab), I think the application is ingenious and the full blog post about it, like dmillar, pointed out is rather awe-inspiring. And I was all happy because my new Core i7 box has 2TB of space and 6GB of ram... sigh

That being said - did seagate ever fix their firmware issue on the 1.5TB drives that would cause random corruption? (heard about it maybe 6 months ago)

jbellis · on Sept 1, 2009

> While technology isn't more than any typical hacker can slap together in their apartment

I think you missed the part about testing a dozen SATA cards, etc.

The attention to detail here is a lot more than something you'd slap together in your apartment.

PStamatiou · on Sept 1, 2009

I would have some smart-ass comment here but I just read your HN profile. Good day, Sir.

tzury · on Sept 2, 2009

smart move ;-)

Periodic · on Sept 1, 2009

This is what always gets in my way. It takes a lot of work and a lot of expertise to put together a home-grown system that works as well as one from the major vendors. If you're only going to be using one or two, you're much better off going to one of those major vendors because a large part of the price is their expertise and testing that went into it. For a large setup like Backblaze they can spread the cost of design over many systems, but for smaller companies it just isn't feasible.

We hacker types love to think that we could do the same thing in no time with little budget, and I'm sure we could get a first approximation. But the devil is in the details. Debugging the complex interaction of 20 different hardware components is not my idea of fun.

Hats off to them, particularly for sharing.

uhgygghhj · on Sept 2, 2009

I did this as an inhouse backup for a data warehousing app. Just slapping 4 ide cards into a case and putting 16x250gb IDE drives on them resulted in a system that would copy about 1 disk worth before hanging with some fault or suddenly dropping to 1% speed.

Just because you can in theory hook 40 drives to n cards doesn't mean it will work - well done to them

rantfoil · on Sept 1, 2009

Secondary revenue streams anyone? I know we'd be interested in becoming a customer if Backblaze at some point decided to monetize this part of their knowledge too.

This seems like an opportunity to disrupt from the low end of the market for some scrappy enterprising engineers. If they don't do it, someone should.

mmt · on Sept 3, 2009

I don't think it takes quite as much effort as you imply, since a fairly basic storage array from one of the major vendors is $100k. (And then there's "maintenance" and software upgrades, but that's a separate matter).

Even if you got a $100/hr consultant, $100k would be half a year. I suggest it could be done in half that time, by someone who has suitable background. Someone with the specific expertise should be able to do it in 6 weeks elapsed and 2 weeks of full time work, tops. (bt; dt)

lsc · on Sept 1, 2009

as for the firmware 'bug' - yeah, they fixed that one, However, consumer sata disks are still not something you want in most raid systems. See, a consumer disk, when it encounters an unrecoverable error, retries over and over, hanging your raid until you fail the bad drive by hand. (tested on a 3ware hardware raid and with MD. I am given to understand that zfs handles this in a sane manner.) Western digital allows you to change the behavior, search for WDTLER, but in general.

I'm sure the backblaze folks have built something that handles that sort of thing into the system, but just saying; it's something you have to think about before you slap consumer drives into a raid that needs to keep working when one drive fails.

Devilboy · on Sept 2, 2009

The cloud storage way is to avoid RAID 5 or 10 etc completely and just use triple JBOD instead - store each file / chunk of data on at least 3 JBOD drives and manage replication and whatnot in software. I think Google first popularized this technique.

The whole point of this excercise is reducing cost across the board and upgrading consumer SATA disks to enterprise ones will make this setup a whole lot more expensive since the drives make up most of the hardware cost.

lsc · on Sept 2, 2009

But 3 jbod takes more space than raid5- in terms of disk cost, that's worse than raid10, zraid, or mirroring. The enterprise drives are maybe 20%-30% more expensive than consumer disk, while replicating your data 3 times is 50% more expensive than a mirror or raid 10.

Zfs, if it does deal with consumer drives as well as it claims to, would solve the problem at the same disk space cost as raid5.

wmf · on Sept 2, 2009

To get reliability without replication, it's not enough to use enterprise drives; you need redundant controllers that are multipathed to the disks. This is fairly expensive.

If you want a shared-nothing cluster with less than 3x overhead you can use erasure codes. The software complexity is significant, but at scale it should be the cheapest option. http://cleversafe.org/ or http://allmydata.org/trac/tahoe

lsc · on Sept 2, 2009

One solution I'm looking at is essentially mirrored raid5 over iscsi but it gets around your raid card single point of failure; the idea is that I have 2 zfs boxes (to begin/for testing these are OpenSolaris xen guests that control the spare drive slots I have in boxes that are primarily doing other things.) on each of those, i zraid the available drives. I then export iscsi luns of some standard size. The client box takes an iscsi lun from each opensolaris box and mirrors them with whatever software raid stack the client likes.

The big win here is caching... each of the opensolaris boxes can be using write-back caching, because all data is mirrored to the other opensolaris box. much like a NetAPP or EMC with dual heads, I don't have to worry about disk inconsistencies caused by write-back cache unless both drives fail. (of course, your 'real' dual-headed SAN will switch to write-through caching if one head dies, and doesn't require a full rebuild when the other head comes back online, but I can't afford a 'real' dual-headed SAN.)

Rebuild times, I imagine, will be quite significant.

Now, I am very leery of the performance of this system (being as these are xen instances, it wouldn't surprise me if it was unsuitable for anything but tape replacement) but I haven't tried yet. It seems, though that it would work just fine if I was not too cheap to use real hardware.

If you wanted to add complexity, you could have 3 or more of these opensolaris boxes, and use software raid5 on the client to save space.

gstar · on Sept 2, 2009

Are your target boxes virtual machines?

I always prefer the strategy where mirroring the data happens before the disk is presented to the VM. Have a look at GlusterFS and see if that might be an option? I know it wasn't playing nice with zfs for a while there, though.

lsc · on Sept 2, 2009

Virtual private servers, yes. I use Xen.

And I agree, giving the user 2 block devices and expecting them to mirror is a very bad idea. I know this from experience. in an earlier setup, I'd give each user two block devices, one from each disk in the box. the idea being the user would mirror or not as appropriate. well, you loose a customers data, you loose the customer. Lesson learned.

But this is where my virtual setup gives me an edge. On every Xen server, I have a control guest or driver domain, the dom0. all disk and net I/O goes through the dom0 anyhow, so there wouldn't be much more overhead to me doing the mirroring on the Dom0, and passing the md through to the DomU (what the user controls.) I'm not adding another point of failure, either; if the dom0 chokes, all guests on the system will crash regardless.

Devilboy · on Sept 2, 2009

In a RAID 5 set you lose a lot of throughput when a drive fails, and with TB drives it takes many hours for a hot spare to replicate the missing data. During that time another drive failure is catastrophic for your data and the more disks in your RAID 5 set the more likely this scenario becomes. When you get to petabytes of data such catastrophic failures are just a matter of time.

With RAID10 again you have to build large arrays to beat triple JBOD for storage efficiency - a 12 x 1TB array yields 5TB usable space while triple JBOD gives you 4TB. BUT you still have a single point of failure - your RAID card can give out and your array is unavailable. Using triple JBOD you'll be able to have your data on different physical machines or even geographically separate data centers if you wish.

The advantages outweigh the reduction of storage density when you're dealing with petabytes of storage - why else would Google et al be doing it?

lsc · on Sept 2, 2009

raid5 sucks hard when a drive fails, you are right. But, see, for me, what I need is something that can be mounted as a block device by a xen DomU. Something that performs reasonably well.

On top of that, I am kindof dumb compared to the sort of person I'd want writing my block device drivers. I want to take well-tested, open-source software components and plug them together in a clear manner. I don't know of anything off the shelf that will give me a filesystem with reasonable performance on a '3 jbod' system, unless you mean running md in a raid1 with 3 drives. (I'm actually doing that on my more remote servers; the idea is that I can wait longer before replacing a bad disk.)

The system I proposed above is basically that you can export drives that may fail to clients, who can then do their own redundancy. Because the drive is specified as 'may fail' write-back caching may be utilized. (God, ram has become cheap.) The client, in my case, will be the Dom0 of the DomU that wants the space, but if I was selling this space to random people on the internets, it seems that the client could be some box running md that treats it's iscsi devices syncronously, meaning it waits for the write to return from both MD devices before it returns it's write. If the intermediary client device did no caching at all, it seems that I may be able to setup IP failover. (though that part... sounds dangerous. I'd need to be very careful with 'fencing' it or what have you, so that the two nodes were not active at once.)

Devilboy · on Sept 2, 2009

Yea I see your point. I think in the end it still comes down to scale, for a couple of racks it's not worth the risk and effort while on a larger scale it might be worth brewing your own meta-filestorage system.

lsc · on Sept 2, 2009

Yeah. that's my thought. I'm at a couple racks now, but I want a solution that will scale beyond that. Thing is, to be honest, the coraid crap will probably be more reliable (and cheaper if you include my time) than stuff I build myself for the first few units, until I learn what I'm doing, but long-term, I could be significantly better off if my idea works as planned, as my units are potentially much cheaper and have a whole lot more cache (and cache redundancy) than the coraid.

On the other hand, if I handle the 'beta' poorly, like taking it out of beta early, there won't be a long term to worry about.

uhgygghhj · on Sept 2, 2009

Does anyone use raid if they could use ZFS ? These people can be on BSD and so could use ZFS for free

bestes · on Sept 1, 2009

Yes, they did a firmware update. Here is one article that talks about it: http://www.tomshardware.com/news/seagate-barracuda-1.5TB-fre...

herf · on Sept 1, 2009

Seagate took about 4 months to fully resolve the issue (their first fix had a bug!) I had a drive that bricked when applying the fix several months after the initial firmware updates. Apparently a buffer overflow in the error log that locked the drive on reboot.

Seems dangerous to source all your hardware from one vendor over a small timeframe. Would be nice to have some redundancy over manufacturers too.

sophacles · on Sept 1, 2009

I have the 500GB in the same series (with the corrupted firmware) and the fix did very nice things for it. Also, I have never had problems with the 7200.12 (the newer version of the same drive) in a 1TB model.

barrkel · on Sept 1, 2009

I should think so - I have 4 of them in my ZFS pool, and scrubs are turning up no errors.

edw519 · on Sept 1, 2009

I always thought it was best to focus on what I did best (application software) and leave the infostructure to others. Until I saw this:

  Raw Drives      $81,000
  Backblaze      $117,000
  Dell           $826,000
  Sun          $1,000,000
  NetApp       $1,714,000
  Amazon       $2,806,000
  EMC          $2,860,000

I had no idea. Kinda makes one rethink what business they want to be in.

cperciva · on Sept 1, 2009

  Backblaze      $117,000
  [...]
  Amazon       $2,806,000

I cry foul. Backblaze's "67 TB" pods actually only hold 58.5 TB, so their hardware cost per PB of storage is $134k, not $117k; and that's without any high-level redundancy. Servers fail -- both catastrophically, and by silently corrupting bits -- and Backblaze's $134k / PB doesn't have any protection against that. Datacenters also fail -- power outages, cut fibre, FBI raids, etc. -- and any system which stores all of its data in a single datacenter lacks any protection against that. Store each file on two different servers in each of two different datecenters, and suddenly Backblaze's $134k turns into $536k. The price for Amazon, in contrast, is based on the assumption that their prices remain fixed for the next 3 years -- which seems a rather radical assumption.

Is backblaze's solution cheaper than S3? Absolutely. But they're also twisting the numbers a bit.

skolor · on Sept 1, 2009

Interesting, but Amazon doesn't guarantee most of those things. They guarantee a 99.99% uptime, but that isn't counting a complete datacenter failure. In fact, it sounds to me as if they have a similar set up to the backblaze people.

99.99% uptime means roughly 1 hour per year of downtime. I don't know what the specific failure rates on the components are, but it seems reasonable that A) the data drives are hot-swappable, and will not cause downtime when they are replaced and B) the rest of the components fail once a year (or less) and take ~10 minutes to replace and reboot the system. With 4 main places of failure (PSU, Boot Drive, Motherboard/Ram/CPU, Drive controllers), as long as you have staff constantly on call and they can respond within 5 minutes of a failure, 99.99% uptime seems reasonable.

I don't know where you got the idea that your data was on 4 different servers when using S3. I can't find even the slightest amount of information on that. Yes, that would be nice, and it is cool to think that, but its rather doubtful that they're actually doing that (or they could probably add another 9 to their uptime).

lecha · on Sept 1, 2009

Re. geographical redundancy, see comment http://news.ycombinator.com/item?id=422574 :

"Amazon keeps at least 3 copies of your data (which is what you need for high reliability) in at least 2 different geographical locations. "

I can't find the original source supporting that statement, but I also know it to be true based on direct contact with AWS team. (I'm using AWS since private alpha of EC2 in 2005)

skolor · on Sept 1, 2009

I can't find that anywhere either, and would be interested in seing it. In fact, all I can find is: A bucket can be located in the United States or in Europe. All objects within the bucket will be stored in the bucket’s location, but the objects can be accessed from anywhere. which seem to imply that your data is located in one location, not 2.

In addition, looking at the actual S3 contract, they're really only guaranteeing at 99.9% uptime, which allows for up to 8 hours of downtime a year, more than enough to completely re-build the outer server once a year, as long as they can keep the data intact (which they seem more than capable of with their setup, once again assuming their data center is not completely destroyed).

ShabbyDoo · on Sept 1, 2009

>but that isn't counting a complete datacenter failure

I believe that S3 replicates to multiple physical locations. So, while you might experience downtime, you probably won't experience data loss.

cakeface · on Sept 1, 2009

Another thing that they aren't taking into account over S3 is power costs, physical datacenter rent costs, and bandwith hookup. Still its probably a lot cheaper but important to note that they are comparing apples and apple cider.

pmorici · on Sept 1, 2009

On their graph it says they subtracted data center costs from the price of S3.

lsc · on Sept 3, 2009

what does s3 charge for transfer? $0.10 per gigabyte? I bet that covers much of the data center costs.

Periodic · on Sept 1, 2009

That's just the cost of hardware. How much more do you think it would cost if they included all their research and development, installation and assembly hours, etc?

That is just the cost to buy all the components they have. If you budgeted that in you'd just have a bunch of boxes at your office that weren't even wired up. On the other hand, from Dell or Sun you at least have all the hardware in the chassis, if not some basic configuration and OS install done for you. Go with Amazon and you've already got it racked up in a data center with dedicated techs and replacement parts on hand.

It isn't a fair comparison.

sho · on Sept 2, 2009

So you're currently paying $800k+ per box to install Solaris?

Tell you want, since you're a fellow HNer, I'll save you some money and do it for half that. I'll even throw in some basic configuration. You're welcome!

modoc · on Sept 1, 2009

It really depends on what they ascertained their needs as. Perhaps they don't feel they need the additional protection, or they have another backup scheme in mind. If their solution works for them (which is seems to) then indeed, for their needs at least, the cost comparison is pretty accurate.

It might not add up the same for your needs (assuming you want multi-data center redundancy of all data, etc...).

spydez · on Sept 1, 2009

Well, to be fair, Backblaze is using consumer-grade hardware here. No redundant PSUs. No mirrored OS drives. Cheaper components. Etc.

They're going Google's route, which is fine, it's just... a whole different direction than a Sun Thumper. I'd be interested to see how their costs per petabyte stack against Google's.

blasdel · on Sept 1, 2009

No, they aren't going Google's route!

Google's operation has at most half the disks-per-rack density of Backblaze, but with 4 or 8 times the server density. Google does almost all their computation on the same nodes where the data is stored. Google is also storing each tablet a minimum of 3 times within each cluster, with most systems having multiple replica clusters.

Backblaze's system is going to have several orders of magnitude less bandwidth:

  way too many drives in one server (some of them on the PCI bus!)
  use of port multipliers (causing 5 drives to share one SATA cable's 300MB/s)
  RAID6 with too many drives per array (15-way XOR is no fun)
  why use JFS?
  only access is via HTTPS, not clear if SSL is done on the pod

cheriot · on Sept 1, 2009

I'm guessing that backups are accessed rather infrequently and that someone recovering their data won't mind a few seconds in lag time. This isn't my area of expertise, but their trade off seems sensible. It's just a different scenario than what google is dealing with.

mmt · on Sept 1, 2009

As a sysadmin for organizations of different sizes, I've been acutely aware of this for quite some time.

The more painful corollary is that, even if one measures not by total storage but by performance (a more critical aspect of, e.g., databases), the prices are still an order of magnitude or two apart.

I would argue that the reason "No One Sells Cheap Storage" in the backup/archive sense is that there isn't enough demand. Obviously, this is starting to change.

I am, however, startled that a driveless pod is so expensive ($2467 or $54.82 per drive), considering that external performance is, effectively, limited by gigabit Ethernet.

Two SuperMicro 846E1s would be $2k for 48 drive bays, and one could easily connect 10 of them to a single $300 SAS RAID card. Add another $1k for mobo and cables, and that's $11300 for 240 drives or $47.08 per drive, without having to do a custom case.

Granted, it takes up almost twice the rack space, but it's still a decent power density at 50W/U AC (assuming 6.5W per drive DC).

It takes 52 groups of 13 1.5TB disks (Hmm.. 13 pops up an awful lot in their setup) to make a PB using their pods. 51 groups means 17 pods is .9945TB for $134k ($134.5/PB).

Using my pods, it would be .936TB for $120.3k ($128.5/PB), but what's $6k/PB between friends?

wmf · on Sept 1, 2009

Two SuperMicro 846E1s would be $2k for 48 drive bays, and one could easily connect 10 of them to a single $300 SAS RAID card.

I don't think a single SAS card can support more than 128 (or maybe 256 if you're lucky) disks. This doesn't affect your math much, but I wouldn't want someone to try this at home and be disappointed. If you're not willing to build a custom case and don't feel like suffering the vagaries of SATA expanders, LSI SAS cards plus Supermicro JBODs are the way to go.

mmt · on Sept 3, 2009

From experience, I can safely say that luck has nothing to do with it. LSI MegaSASes support more than 240, so it doesn't affect my math at all.

Similarly, from "at home" experience, I can also safely say that luck factors prominently in the vagaries of SATA port multipliers. The reason I even made the attempt was a lower price per port. This is why I'm so started that a solution based on them would end up being more expensive than a SAS-based one.

Of course, if the camel of performance gets its nose into the tent of ones requirements, it has to be the SAS way or the highway.

lsc · on Sept 1, 2009

you might also want to check out coraid. it's at a similar price point. http://www.coraid.com/PRODUCTS/SR2421 - now, I've heard mixed experiences with it, and have no personal experience, but if you are talking about wiring up something yourself, it's definitely something to consider.

Personally, I'm working on a system whereby I just fill up otherwise unused slots on my chassis and use that for backup. The system isn't done yet, but I've got 6 disks out there waiting for me to figure it out.

mmt · on Sept 2, 2009

Hmm.. looks suspiciously like exactly the SuperMicro case, for about twice the price of rolling ones own.

lsc · on Sept 3, 2009

from what I understand the coraid systems are plan9-based, and only have a gigabyte of cache ram.

If I only needed one I'd seriously consider it, as my time, for setting up one unit and working out the kinks, would unquestionably be more than that, and they have quite a bit of experience working out the problems by now, even if I made my system redundant, their system would unquestionably be more reliable the first iteration.

However, I plan on needing a bunch more, so it probably makes sense for me to roll my own, work out the kinks, and hopefully end up with something cheaper that has a lot more cache.

johnrob · on Sept 1, 2009

This is not apples vs apples. Some of those items are living cows, and some are cooked steaks served with a side of mashed potatoes. I'm sure the cow looks cheap (per pound) until you are tasked with turning it into a steak.

gstar · on Sept 1, 2009

They might have missed a trick here.

To address vibration, acoustics and gyroscopic effect, what I've seen done in highly dense enclosures is to rotate every second drive around 180 degrees in a bit of a shotgun approach to balancing stuff.

Still, awesome.

sh1mmer · on Sept 1, 2009

This is the trick, though. They want to improve their system. By sharing with you they may have solicited some really helpful advice.

ciupicri · on Sept 2, 2009

This has sounded ok for me too, but when I've thought a bit about it I've realized that the drives must be synchronized so that each one would "cancel" each others vibrations. The vibrations generated by the drives might have indeed the same amplitude and (spatial and temporal) frequency, but I doubt that they'll have the same phase offset. For this to happen they would need to be started at _exactly_ the same time which I don't think that happens in practice.

Do you have some pictures of those enclosures?

gstar · on Sept 2, 2009

I don't have pictures. I can describe, though.

Two disks were mounted in a frame linearly, both screwed to the frame, with the power/sata connectors toward the middle of the frame, and one drive upside-down.

These cassettes were removable as a unit for hot-swap, and were inserted linearly into a half-deep 19" rack enclosure.

That the two drives were physically connected to the same frame, and removed and replaced as a unit would make it seem as if they would be started in phase. Now I'm not physicist, but I'm not 100% sure that's so important - if you have two gyroscopes contra-rotating and firmly connected, running at the same speed, surely they resist movement by sheer gyroscopic effect?

Andys · on Sept 2, 2009

Unfortunately the SATA backplanes are only aligned one way, so they could only rotate a set of 5 drives at once.

tsuraan · on Sept 1, 2009

67TB of storage, with 4GB of cache. I'd really love to see some performance numbers versus the way-too-expensive competition. If the systems are being used as tape drive replacements, I could see this working well, but as an actual NAS-like device, I can't imaging how it could perform acceptably. Of course, if those Intel motherboards have the dual 1Gb/s NICs that Intel boards generally do, it will probably take a while to fill the drives anyhow.

notaddicted · on Sept 1, 2009

http://www.wolframalpha.com/input/?i=(67+TB)+%2F+(2*(1+Gb%2F...)

It'll take 3 days at the theoretical max of the networking equipment to read/write the 67TB. The overhead of HTTPS constrains the networking so this is too low.

I'd expect that their internet connection (i.e. in/out of the data center) is the real bottleneck.

I believe that the system is being used as a tape drive replacement.

lsc · on Sept 1, 2009

If the load is relatively sequential, then yeah. But if the load is mostly random, then there is an argument for having more cache.

lsc · on Sept 1, 2009

Actually, that's the problem with most NetAPP/EMC foo, too. they are way light on the cache.

These guys only use 4GiB ram because they are saving money on the motherboard, I bet. Personally, if I were building it, I'd increase the cost by another grand or so and use dual low power opterons with 32GiB ram. (of course, that would also increase the space taken by the motherboard, so that would require some case redesigns. Still, opterons and registered ecc ddr2 are both incredibly cheap right now.)

pmorici · on Sept 1, 2009

What if they used AndrewFS with SSD drives for cache on front end nodes.

eleitl · on Sept 2, 2009

This made the Beowulf list yesterday, and below is what I wrote in response:

"Seagate ST31500341AS 1.5TB Barracuda 7200.11 SATA 3Gb/s 3.5″ Aargh! Should be definitely substituted by 2 TByte WD RE4 drive.

Today I've built a 32 TByte raw storage Supermicro box with X8DDAi (dual-socket Nehalem, 24 GByte RAM, IPMI), two LSI SAS3081E-R and OpenSolaris sees all (WD2002FYPS) drives so far (the board refuses to boot from DVD when more than 12 drives are in though probably to some BIOS brain damage, so you have to manually build a raidz-2 with all 16 drives in it once Solaris has booted up). The drives are about 3170 EUR sans VAT total for all 16, the box itself around 3000 EUR sans VAT. I presume Linux with RAID 6 would work (haven't checked yet), too, and if you need more you can use a cluster FS.

Maybe not as cheap as a Backblaze, but off-shelf (BTO) and you get what you pay for.

swombat · on Sept 1, 2009

Wow.

Do they offer S3-like storage? They should. If they can offer something like S3, but at one third of a penny per gigabyte per month (heck, let's splash out - a whole penny per gigabyte per month) I know quite a few people who'll be interested in talking to them... (including myself)

sophacles · on Sept 1, 2009

It seems to me that their business model, and their requirements, and their implementation solve one problem, and they do it well. The problem backblaze is solving is backups. Their front-end software sends a pile of data to the datacenter, where it is written en-masse. Should restore ever be needed, the data is then transfered back en-masse. Both of these operations have needs that are different from an S3 type service. Backup/restore can have latencies built in, can have slow seek times, etc. This allows them to take advantage of the lower end hardware.

For example each of these "pods" only has 4GB of RAM. If I was doing lots of random I/O on 67TB of drives, I would want a heck of a lot more ram for caching efficiencies.

jacquesm · on Sept 1, 2009

Apparently they use HTTPS for input/output using tomcat and some custom application.

That's a strange choice, HTTPS would incur quite a bit of overhead for something that is essentially a (large) drive at the end of a network cable used internally only. Why the encryption ?

quote from the article:

"A Backblaze Storage Pod isn’t a complete building block until it boots and is on the network. The pods boot 64-bit Debian 4 Linux and the JFS file system, and they are self-contained appliances, where all access to and from the pods is through HTTPS. Below is a layer cake diagram.

Starting at the bottom, there are 45 hard drives exposed through the SATA controllers. We then use the fdisk tool on Linux to create one partition per drive. On top of that, we cluster 15 hard drives into a single RAID6 volume with two parity drives (out of the 15). The RAID6 is created with the mdadm utility. On top of that is the JFS file system, and the only access we then allow to this totally self-contained storage building block is through HTTPS running custom Backblaze application layer logic in Apache Tomcat 5.5."

That's an odd choice for a storage server protocol stack.

antonovka · on Sept 1, 2009

That's a strange choice, HTTPS would incur quite a bit of overhead for something that is essentially a (large) drive at the end of a network cable used internally only. Why the encryption ?

To help prevent a network compromise from resulting in storage management compromise? Just because something is internal doesn't mean it's safe. Once a host/network segment is compromised, you don't want it to be easy to jump to the next.

Otherwise, you've built M&M security. Hard candy shell on the outside, soft gooey chocolate insides. Mmm.

sophacles · on Sept 1, 2009

One thought I had regarding that -- their client talks to a "main server" management system, which talks to each pod on using the tomcat layer. The "main server" then replies to the client with an IP or similar to send data to directly. Since this goes over the web, https is used for security. Since https is already being used, it is not so difficult to use it all the way for authentication etc.

jbellis · on Sept 1, 2009

It's actually a fair bit harder to offer generic key/value large object storage than to offer a backup system that talks to a client you wrote yourself.

Of course these days there are well-understood solutions to those problems, compared to when S3 first started, but it's still not stuff you can pull off the shelf. (Although with the Cassandra distributed database, the metadata problem is close to that point now. </plug>)

Periodic · on Sept 1, 2009

The cost they are quoting is the hardware cost. It isn't including developer time, housing costs, maintenance, etc.

Hardware is cheap, power and developers aren't.

thaumaturgy · on Sept 1, 2009

I have been searching high and low for something like this for months -- I have a number of clients that have been begging me for this, as well as something for my own needs.

I'm excitedly posting a link to this on my personal site, and today I have a lot of phone calls to make to clients.

Why is it that the best products and services are also the hardest to find when you're looking for them?

ShabbyDoo · on Sept 1, 2009

What terrified me was the number of low-level issues they had to address. SATA protocol problems, custom-designed SATA cards? Wow. I'd have no idea how to begin with that stuff. As other posters have noted, this company's business is storage. How many PB must one host for such specialization to make sense?

Bjoern · on Sept 1, 2009

What a pity, I'd like to have a way to store just a encrypted backup directly (via ssh) instead of installing a app which crawls my system. Oh did I mention I use Linux? sign

Does somebody know any other company which provides this solution? I don't trust other people to crawl my system and to "encrypt" it for me.

EDIT: Added Question

bestes · on Sept 1, 2009

Did you look at tarsnap (http://www.tarsnap.com/)? I don't use it, but read about it here quite a bit and it sounds really great.

blhack · on Sept 1, 2009

I'm not entirely sure if this is a troll or not, but...

rsync does exactly what you're looking for. You can also pipe it over SSH encyption on the wire. type "man rsync" at a shell for a description of how.

soult · on Sept 1, 2009

How about http://allmydata.com/? They offer a JSON-based, restful API, an FTP interface or you can use their open source software which includes encryption.

californiaguy2 · on Sept 1, 2009

Yeah, rsync.net

datums · on Sept 1, 2009

Service Idea: S3 offsite backups. Run http://www.eucalyptus.com/open/ on this kind of hardware.

ALee · on Sept 2, 2009

These guys are very very good at what they do. Btw folks, no funding and they're already profitable. Gleb and crew rock!

preview · on Sept 1, 2009

I wonder about the single point of failure posed by the power supplies. One failed box is not a big deal (since I assume the data is replicated over several). But, what if they get a bad batch of supplies and see a relatively high failure rate? I wonder how high a power supply failure rate they can handle.

The need to stagger the power on of the two supplies poses a problem. What if power to a data center is lost? When power is restored, all box will try to start, blowing all fuses. Granted, this is a catastrophic event, so its frequency should be very low. But, this also seems like an area that could be automated.

lsc · on Sept 1, 2009

You should only use 75% of the rated capacity of a circuit, which means you have enough power to turn them all on at once.

some of the more expensive managed power supplies also support a staggered power on after power fail. But I don't worry about it; only using 75% of the power circuit solves that problem for me.

preview · on Sept 1, 2009

Unfortunately, using 75% of the rated capacity may not be enough to handle the inrush. The article discusses this point, "...if you power up both PSUs at the same time, it can draw a large (14 amp) spike of 120V power from the socket." That would mean one pod per 20A circuit. Ugh. In normal operating conditions, a 5.6A max load would allow three pods per circuit.

Addressing this would require a little bit of design, but the problem is relatively simple. If they wanted to get fancy, they could add a chaining feature--pods on the same circuit would be connected together so that they'd power on serially. This would get away from their goal of using off-the-shelf parts. It is, as with many things, an engineering trade-off.

lsc · on Sept 1, 2009

The PDUs that support 'staggered power-on' are 'off the shelf' - if that is not an option (really, we're talking maybe $500 for each 20a circuit, retail) the next thing I'd do is set 'power on after power fail' to off, then have some remotely accessible way to trip the power button. (I'm working on a solution to that particular problem, but that's not 'off the shelf' - yeah, everything is on PDUs I can trip remotely, but there are reasons why it is much better to ungracefully reboot with the 'reset' jumper than to ungracefully reboot via cutting off the power.)

from there it would be easy enough to have an automated process turn on servers one at a time.

psranga · on Sept 1, 2009

Great stuff. Would anybody care to clarify why access is through HTTPS and not HTTP?

I presume all accesses to these pods are from within their data center? Or do they directly expose these boxes to clients (whoa!)?

phsr · on Sept 1, 2009

BackBlaze runs online, off-site backups (like Mozy and Carbonite) at $5/month for unlimited storage. HTTPS is used to keep client data protected, I assume

durana · on Sept 1, 2009

The price comparison with other solutions isn't really fair. The cost per PB of these storage building blocks is directly compared to complete storage solutions from companies like NetApp and EMC. In the end the cost of the complete solution they assemble these building blocks into may well be cheaper than solutions from other companies and that's the number they should be using for comparison.

modoc · on Sept 1, 2009

I love how open they are about their solution, and I also want one for Christmas:)

jwilliams · on Sept 2, 2009

Awesome. Brings up a few questions for me (sure there are answers, just curious really).

Why a Core 2? An Atom mobo would be lower power and cheaper. Why 4Gb? Seems like an overkill. They are using a HD to boot. Couldn't they boot of a USB key?

yellowbkpk · on Sept 2, 2009

I don't know for sure, but there might not be an Atom-based board that has as many PCI/PCIe slots.

jwilliams · on Sept 3, 2009

Good point. The one I've got only has one PCI in fact.

byoung2 · on Sept 1, 2009

Very interesting insight! I may try building a scaled-down version just for fun.

The cost comparison between raw drives, their custom solution, and Amazon S3, etc was a little skewed. S3 is designed for pay as you go storage so you're not paying for capacity you don't need. If you just need a few dozen gigabytes, it's a much better deal. If you need terabytes or a petabyte, a dedicated storage solution is more economical.

It's the same argument as vacation house vs timeshare. If you lived in a timeshare all year, it would cost more than buying the house.

pmorici · on Sept 1, 2009

Seems along the same lines as Capricorn Tech. http://www.capricorn-tech.com/products.php

ggruschow · on Sept 2, 2009

Not really. BackBlaze's post is about how to get on-line storage up for the lowest cost per bit. They appear to have done a great job at quadrupling the density of standard 1U setups for a similar price.

Capricorn appears to be in the business of selling standard 4 drive per 1U setups and support. The fact you have to contact their sales department to even get close to a price seems to indicate they're not competing on price.

dabeeeenster · on Sept 1, 2009

What happens when one of the PSU's fails catastrophically, pushing a big electrical surge through the hard disks, frying half of them?

I didn't see anything in here that discusses that eventuality? And when you have that many servers, it IS going to happen at some point...

papercrane · on Sept 1, 2009

It's addressed in the section "A Backblaze Storage Pod is a Building Block"

From the article:

When you run a datacenter with thousands of hard drives, CPUs, motherboards, and power supplies, you are going to have hardware failures it’s irrefutable. Backblaze Storage Pods are building blocks upon which a larger system can be organized that doesn’t allow for a single point of failure.

dabeeeenster · on Sept 1, 2009

Sounds to me like they haven't implemented this yet. I wouldn't want my data on that sort of solution. Backed up data would surely need a level of geographic redundancy.

papercrane · on Sept 1, 2009

In the next section they say they have implemented the machine level redundancy. As for geographic redundancy I think the idea is that you use them for that. As in you backup your data locally and then send it to them as a redundant copy.

notaddicted · on Sept 1, 2009

It is called a fuse.

ars · on Sept 1, 2009

To get all the various logos and certifications on them, they are required not to do that. Every line is voltage limited.

sh1mmer · on Sept 1, 2009

I'd love to hear what they are doing to monitor their system. They didn't mention that. 3 out of 15 drives failures to loose a volume seems quite likely at that scale.

I'd like to know what levels of warnings and alarms they use with which system, e.g. nagios, etc.

sschueller · on Sept 1, 2009

How do you replace a drive? With so many drives it seems like a lot of effort to pull the entire server out to just access one drive.

papercrane · on Sept 1, 2009

They have three raid-6 volumes per machine, so they probably leave it in the rack until two of the volumes are unusable and then refurbish the machine.

rarrrrrr · on Sept 2, 2009

I doubt it very much. When drives start failing out of a RAID6 array, you lose it's ability to automatically detect and correct single bit errors on disk.

With that much data, single bit errors will happen predictably.

If their usage patterns are anything like ours, disk failures are well under normal rates. Most backup data is stored and ignored. The drives just don't get much stress. Downing a machine for maintenance is probably acceptable to their usage pattern.

tricky · on Sept 1, 2009

effing love this.

I wonder what the reliability stats on this setup is, though. Is it really cheaper to jam all those drives in one unit without redundant PSU's, MB's, or a boot drive?

I'd guess you'd have to build at least 2 of these units and mirror them to get any sort of reliability. And, at that point, how long does it take to copy 58 TB across https?

data is hard.

skolor · on Sept 1, 2009

I would assume they don't care about any of the hardware, just the data. If you look at the setup, and how the drives are raided, there 45 drives are sub-divided into 15 raid arrays, of which it would take 3 to die before they lost data. Essentially, they would need 20% of their drives to die simultaneously for them to lose data.

Now, for the rest of the hardware, its not that important to if it fails. If one of the other components die, you're only looking at some down time (and a possible dead hard drive or two from a PSU dying, which I assume they monitor regularly). As long as the data is secure and in one piece, it doesn't really matter whether the pod is up or down, until someone needs the data. Just send out your repair guy to replace it and reboot it, and its fine.

Andys · on Sept 2, 2009

In sysadmin terms, they have aimed for data integrity over availability.

If a system goes down and they have to replace a disk or a power supply or motherboard, the data is still safe, and if by chance there's any $5/month users who need to restore data held on that one unit, well they can wait a few hours :-)

silverscreen · on Sept 2, 2009

Great read! Why a seperated boot disk? Why can't that be on the storage array?