Hacker News new | past | comments | ask | show | jobs | submit login
Amazon’s Glacier secret: BDXL (2014) (storagemojo.com)
213 points by another on Jan 15, 2017 | hide | past | favorite | 71 comments



They're almost certainly doing something like Microsoft's Pelican: https://www.microsoft.com/en-us/research/wp-content/uploads/...

The first comment on TFA says as much.

Edit: This is the actual Pelican paper: https://www.microsoft.com/en-us/research/wp-content/uploads/...


My quick TL;DR for other readers (please let me know if I'm wrong):

The systems described are servers that are over-provisioned with an order of magnitude more hard drives than you would typically see in a standard server. This allows a much higher ratio of total hard drive space to other components and cooling and power requirements, but at the cost of only being able to spin up some small fraction of the hard drives simultaneously. It then becomes a complicated scheduling problem of optimizing which hard drives to spin up and when based on the total workload of all the data you are trying to read and write at a given time. This system makes sense for data that is expected to be accessed on the order of once per year or less.


Yep. This is basically what they do. Source: I was around when they were designing and building Glacier.


This is very cool.


So Pelican fits 1152 WD WD6001F4PZ 6TB disks in 52U (Source: https://www.microsoft.com/en-us/research/video/rethinking-st...).

I wonder if microsoft going to publish specs (like opencompute?) to pelican servers?


I highly doubt this. Especially given the introduction of the near line tiers.

It's probably just very widely striped, price-segmented data.

Also, see: https://news.ycombinator.com/item?id=4416065


I thought this was debunked at the time?

When I was running exchange systems, our biggest challenge was delivering IOPS. We had to use SAN, and wasted significant storage because we'd spend our IOPS budget at 40-60% storage capacity.

I figured at their scale they would have similar problems.


IOPS isn't important for glacier. You just upload to some buffer and the eventually move it to the slow storage.

Reading is pretty slow from glacier.


He meant that if EBS has the same issue as his Exchange servers. To explain in more detail: You have 10TB disk space with 10.000 IOPS, your users buy 4TB with 10.000 IOPS then you have 6TB of storage wasted.

If Amazon has that problem with EBS, then selling that storage capacity as Glacier and using just the idle IOPS (or leaving a small bit reserved) allows them to sell capacity that would otherwise just be useless.


Aren't IOPS incredibly expensive on Glacier? There was that guy who paid $150 for a retrieval. https://medium.com/@karppinen/how-i-ended-up-paying-150-for-...


That's the point. They aren't trying to sell IO with glacier, since they've already saturated that with EBS. They just want to sell the spare storage capacity, ideally in a write once read never use case. That way they can achieve 100% utilization out of the drives.

So if you use a lot of IO with Glacier, they are going to charge you like crazy, since you're potentially impacting EBS customers.


I'm that guy. I should update the post; Amazon "fixed" the retrieval fees in late 2016 and I would've paid less than a dollar had the current pricing scheme been in effect when I did the retrieval.


Sorry I didn't really finish the point.

With exchange, we had all of this expensive, reliable SAN storage that would be perfect for a low requirement glacier like solution. Unfortunately, we lacked the ops mojo to pull it off.


Archive is not about iops. its about streaming bandwidth.

for example I used to look after a quantum iScaler 24 drive robot, each drive was capable of kicking out ~100 megabytes a second. It was more than capable of saturating a 40 gig pipe.

However random IO was shite, it could take up to 20 minutes to get to random file. (Each tape is stored in a caddy of (from memory) 10 tapes, There is contention on the drives, and then spooling to the right place on the tape.)

Email is essentially random IO on a long tale. So, unless your users want a 20 minute delay in accessing last year's emails, I doubt its the right fit.

The same applies to Optical disk packs (although the spool time is much less.)


I think that's the point - the e-mail is using up all of the IOPs. There would be a small amount of IOPs left over that could deal with streaming data. The data is unlikely to be accessed on a regular basis. The data not used by e-mail would then be used for the archive - data that's pretty much write-only.


It makes sense for email when you aren't giving your users access to their old email, but storing it for regulatory compliance purposes.


Why do you care how fast you can read it back when you're storing it for regulatory purposes? Isn't that a sunk cost? Buy high capacity, high reliability and don't care for the read speed?


With SANS, the iops budget is a function of your hardware config. If you want more IOPS, you get more RAM/SSD involved. More importantly, Amazon gets to sell EBS on their terms: a specific amount of IOPS with a specific amount of storage. If you want more IOPS, you have to buy more EBS. The "wasted storage" you're thinking of would be on your instance using EBS, not EBS itself.


Using BDXL seems like a pretty good solution. Most of this data is archival and existing data is very unlikely to change. You can use HDD/SSD as a buffer as users upload data, and then optimize the packing to ensure you're using all available space on a disk. Possibly encrypt each user's data block on the disk. The system itself would only need to track metadata (file metadata, cartridge/disk, key). Deleting a file would be deleting the key and marking the file as inactive. Once/if a cartridge is marked as completely deleted, can just recycle it.


I feel it will go the way of Data8


Nice investigation and also Facebook has been using 50gb blue rays http://www.businessinsider.com/facebook-uses-10000-blu-rays-... and is moving to 300gb http://www.businessinsider.com/ces-2016-facebook-uses-panaso... .


And Sony have been pushing a BD based backup/archiving system for business use, iirc. Built around multiple discs in a cartridge.


Could you provide me some insight into why optical storage is a better solution than standard HDDs? Is it just the cost, or is cooling / form-factor a big part of it?


It is cost that provides the upside for BluRay than HDDs.


Optical discs cost cents when purchased in bulk, they are literally just plastic + some coating, no expensive metal or electronic parts are required.


And, to my knowledge, least one specific principal level engineer worked on both systems


Why isn't anyone thinking tapes? You can get LTO 7 tapes for $0.008 per Gigabyte that allow 100-300 writes before the tape should be destroyed. Quantum and HP make monstrous tape libraries that hold 5-10 petabytes per rack. You can also cartridge-ize your library for even more dense storage on a literal warehouse rack somewhere.

Tapes also match the slow retrieval speeds as you have to read the data out onto a drive linearly.


Amazon already denied using tapes, which was mentioned in the third paragraph of the article.


There isn't a source mentioned for the denial of tapes but this article claims an Amazon insider verified that it is tapes:

http://www.theregister.co.uk/2012/10/18/ipexpo_2012/


Previous HN discussion about this: https://news.ycombinator.com/item?id=7647571


This is an extremely interesting deductive analysis. However, considering it is amazon, there always exists that persistent "other" possibility: they're purposefully taking a loss.


Or at any rate, pricing based on net present value of archived data given decreasing costs over time for storage.

I do find it rather fascinating that AWS has managed to keep the technology used by Glacier, even at a high-level (i.e. disks vs. tape vs. optical), so under wraps. My personal guess is that it's powered-down disk drives on the grounds that's the simplest long-term solution but that's purely a guess.


Given the scale of Glacier, I'm surprised that Amazon is able to keep their underlying storage technology a secret.


This is one reason to assume it's unremarkable. If Amazon were buying and offloading truckloads of BDXL disks and drives, someone would eventually notice. A good explanation is, therefore, that the technology is unremarkable and boring.


Does someone know about Google nearline and coldline storage? Google claims coldline access within miliseconds.


> Google claims coldline access within miliseconds

Well, that basically tells you almost all you need to know. It's disk in JBODs. The only question is SMR vs conventional. Anyone who knows that can't tell you in public.


If that's how Google is implementing it, and the prices are similar, then isn't that some evidence that Amazon might be implementing it the same way?


Amazon's latency is measured in hours, whatever process they use involves literally cold storage, either disks that are completely switched off, or some tape-like media archive system, but the article makes a good case for why it's almost certainly not tape


It could still be normal disks, if we're feeling conspiratorial. It's easy to make faster storage slower; just add waits. Or maybe they make it take up to five hours so they can avoid peak traffic times in whatever data center your info happens to be located in.


I think it's because even for cold storage one copy is still persisted on disk.


I've got a USB3 BDXL writer attached at my desk and it is quite handy and not too expensive. I back up my whole data (work) partition to it every so often and occasionally take one over to a relative's house as my own home-grown "glacier" system.


Althought your setup is interesting, isn't using 2.5( * ) sata disks via an usb3 adapter way cheaper and flexible?

( * ) So you don't need a power adapter.


Perhaps, but optical is lighter, much simpler, durable, can be mailed, and fits into CD cases for travel, etc.


And optical media is also immutable (read only) when you need it to be immune to any data changes.

I find that read only point in time backups gain value over time. Especially if you need to pull a file that would have been long rotated and replaced by newer backups on read write media (eg. HDDs).

Unfortunately the market for this use case is not large and this is greatly reflected in the prices and relatively hard to source high quality optical media. For BD this would be (inorganic) HTL Panasonic media which only has a market inside Japan itself. M-Disc is the other alternative although it only has proven itself within the DVD market, as classical HTL BD media is expected to be very similar in endurance to what M-Disc has on offer in the BD range.


Btw, I have one of those too.


Would you not be better off to upload it somewhere instead? Assuming your relative lives close enough, a natural disaster would destroy all your backups.


It's a huge state and if it fell into the ocean (wink, wink) data would not be a top-five concern of mine. ;)

Also, I prefer privacy. To each his own, however.


Regarding your last point, I am renting my own server in a datacenter on a different continent. So basically I have total control over my data etc.


> So basically I have total control over my data etc.

How would you stop someone who gains physical access to your server?


That is not really something I have to worry about if the server is in the datacenter of a reputable hoster.


If it's backups, you could just encrypt everything prior to uploading.


Do you burn bdxl with Linux? I have a burner and it's great at single layer, I haven't tried any bdxl media yet.


Yes, though now that I think about it am not sure I have broken into the third layer yet. Hardware (LG) says it is capable.


Ex-Glacier engineer... and no I'm not going to tell you what or how it's done. NDAs and all that jazz. These speculation threads always make for fascinating reading for people on the team.


Glacier in particular seems to attract the speculative fascination. Do people not realise the name is not in jest, it really is done with graphene-doped room-temperature ice crystals and laser interference lithography?


We used to joke among ourselves that actually it was done using vinyl records. Have you seen how many vinyl records you can fit into a single rack?

Added bonus, 9 out of 10 customers actually preferred the feel of their data when it is restored.


I liked the story that the truck-mounted Snowball came from Glacier tech. Amazon has been putting data on a truck and then driving it around Virginia. The delay in reading it back is the time it takes for the truck to arrive at a datacenter and plug in. :)


Smart man. :)


Are there packet written?

That seems to have been the major stumbling block with higher capacity optical media, that one can't do the drag and drop writes that one have with spinning rust and flash chips.


Bulk transfer. They'd write it to a normal disk first, then slowly copy the data in bulk.


What if they are using 2.5 inch 5TB drives like this http://www.theverge.com/circuitbreaker/2016/11/15/13642078/s... I use. They are nice as we can plug them into a 15 port USB stick, they auto power down when not in used. Amazon could have developed a box like what backblaze.com has done.


Facebook disclosed how they are archiving photos long term, they manage a 1 to 1.4 ratio with Reed Solomon Redundancy and 8 disk of 14 can fail without loosing data. https://code.facebook.com/posts/1433093613662262/-under-the-...


From what I recall, Writable optical disks have a much shorter life span compared to tape (~15 years vs 75 years)

Plus, if I was designing an archival system, it wouldn't be on blueray, unless there was a requirement for magnetic resistance.


You are mistaken here. Tapes, unless stored with extreme precision, usually last ~20 years safely. They tend to have a longer life, but not without a high risk of deterioration.

In my experience, written CD and DVD only lasts for <10 years, if you're lucky. However studies show you can get 30-45, even 45+ years out of them.

Most Blu-Ray expectancy exceeds this due to the different non organic dye based layering and coating.

The one mentioned above, M-Disc was something developed for DARPA and supposed to last 1000 years in theory.

See:

http://www.mdisc.com/

http://loc.gov/preservation/resources/rt/NIST_LC_OpticalDisc...

http://www.zdnet.com/article/torture-testing-the-1000-year-d...


They make dvds and brds that are supposed to last 1000 years

https://en.m.wikipedia.org/wiki/M-DISC


I wonder if Amazon is also deduplicating. It seems likely that a share of users would store large media files or software packages without encryption.


I remember reading this back around when it came out. Have there been any new pieces of the puzzle identified (or announcements...) since then?


Could it possibly just be compression on tradtional HDDs?

- Cheaper storage because data is heavily compressed

- Slow retrieval time due to slow decompression


I seriously doubt it. If I have many TB of stuff I need to have saved off site, you can bet the very first thing I'm going to do is compress the hell out of it. So they certainly wouldn't be able to count on compressing it further.


Custom engineered SSD. Powered off at rest.

See: http://www.storagesearch.com/ssd-petabyte.html


SSD isn't quite ready for backup use cases (and certainly wasn't when Glacier was built a few years ago).

However, is 2017, so we can say for sure that the extrapolation to 2016 in the linked article from 2010 was pretty good. It is too optimistic by a factor of ~2 in density, and ~10x in cost, but is spot-on even compared to most predictions from a year or two ago.

Since any whacko can claim they made a prediction in 2010, I double checked:

http://web.archive.org/web/20100322200343/http://www.storage...

Thanks for sharing the link!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: