Amazon’s Glacier secret: BDXL (2014)

fnord123 · on Jan 15, 2017

They're almost certainly doing something like Microsoft's Pelican: https://www.microsoft.com/en-us/research/wp-content/uploads/...

The first comment on TFA says as much.

Edit: This is the actual Pelican paper: https://www.microsoft.com/en-us/research/wp-content/uploads/...

CydeWeys · on Jan 15, 2017

My quick TL;DR for other readers (please let me know if I'm wrong):

The systems described are servers that are over-provisioned with an order of magnitude more hard drives than you would typically see in a standard server. This allows a much higher ratio of total hard drive space to other components and cooling and power requirements, but at the cost of only being able to spin up some small fraction of the hard drives simultaneously. It then becomes a complicated scheduling problem of optimizing which hard drives to spin up and when based on the total workload of all the data you are trying to read and write at a given time. This system makes sense for data that is expected to be accessed on the order of once per year or less.

match · on Jan 15, 2017

Yep. This is basically what they do. Source: I was around when they were designing and building Glacier.

hbk1966 · on Jan 15, 2017

This is very cool.

kakoni · on Jan 16, 2017

So Pelican fits 1152 WD WD6001F4PZ 6TB disks in 52U (Source: https://www.microsoft.com/en-us/research/video/rethinking-st...).

I wonder if microsoft going to publish specs (like opencompute?) to pelican servers?

hemancuso · on Jan 15, 2017

I highly doubt this. Especially given the introduction of the near line tiers.

It's probably just very widely striped, price-segmented data.

Also, see: https://news.ycombinator.com/item?id=4416065

Spooky23 · on Jan 15, 2017

I thought this was debunked at the time?

When I was running exchange systems, our biggest challenge was delivering IOPS. We had to use SAN, and wasted significant storage because we'd spend our IOPS budget at 40-60% storage capacity.

I figured at their scale they would have similar problems.

_jcwu · on Jan 15, 2017

IOPS isn't important for glacier. You just upload to some buffer and the eventually move it to the slow storage.

Reading is pretty slow from glacier.

t0mas88 · on Jan 15, 2017

He meant that if EBS has the same issue as his Exchange servers. To explain in more detail: You have 10TB disk space with 10.000 IOPS, your users buy 4TB with 10.000 IOPS then you have 6TB of storage wasted.

If Amazon has that problem with EBS, then selling that storage capacity as Glacier and using just the idle IOPS (or leaving a small bit reserved) allows them to sell capacity that would otherwise just be useless.

nickodell · on Jan 16, 2017

Aren't IOPS incredibly expensive on Glacier? There was that guy who paid $150 for a retrieval. https://medium.com/@karppinen/how-i-ended-up-paying-150-for-...

qeternity · on Jan 16, 2017

That's the point. They aren't trying to sell IO with glacier, since they've already saturated that with EBS. They just want to sell the spare storage capacity, ideally in a write once read never use case. That way they can achieve 100% utilization out of the drives.

So if you use a lot of IO with Glacier, they are going to charge you like crazy, since you're potentially impacting EBS customers.

markonen · on Jan 16, 2017

I'm that guy. I should update the post; Amazon "fixed" the retrieval fees in late 2016 and I would've paid less than a dollar had the current pricing scheme been in effect when I did the retrieval.

Spooky23 · on Jan 15, 2017

Sorry I didn't really finish the point.

With exchange, we had all of this expensive, reliable SAN storage that would be perfect for a low requirement glacier like solution. Unfortunately, we lacked the ops mojo to pull it off.

KaiserPro · on Jan 15, 2017

Archive is not about iops. its about streaming bandwidth.

for example I used to look after a quantum iScaler 24 drive robot, each drive was capable of kicking out ~100 megabytes a second. It was more than capable of saturating a 40 gig pipe.

However random IO was shite, it could take up to 20 minutes to get to random file. (Each tape is stored in a caddy of (from memory) 10 tapes, There is contention on the drives, and then spooling to the right place on the tape.)

Email is essentially random IO on a long tale. So, unless your users want a 20 minute delay in accessing last year's emails, I doubt its the right fit.

The same applies to Optical disk packs (although the spool time is much less.)

ajosh · on Jan 17, 2017

I think that's the point - the e-mail is using up all of the IOPs. There would be a small amount of IOPs left over that could deal with streaming data. The data is unlikely to be accessed on a regular basis. The data not used by e-mail would then be used for the archive - data that's pretty much write-only.

JimmyAustin · on Jan 15, 2017

It makes sense for email when you aren't giving your users access to their old email, but storing it for regulatory compliance purposes.

robbiep · on Jan 16, 2017

Why do you care how fast you can read it back when you're storing it for regulatory purposes? Isn't that a sunk cost? Buy high capacity, high reliability and don't care for the read speed?

cbsmith · on Jan 16, 2017

With SANS, the iops budget is a function of your hardware config. If you want more IOPS, you get more RAM/SSD involved. More importantly, Amazon gets to sell EBS on their terms: a specific amount of IOPS with a specific amount of storage. If you want more IOPS, you have to buy more EBS. The "wasted storage" you're thinking of would be on your instance using EBS, not EBS itself.

UseStrict · on Jan 15, 2017

Using BDXL seems like a pretty good solution. Most of this data is archival and existing data is very unlikely to change. You can use HDD/SSD as a buffer as users upload data, and then optimize the packing to ensure you're using all available space on a disk. Possibly encrypt each user's data block on the disk. The system itself would only need to track metadata (file metadata, cartridge/disk, key). Deleting a file would be deleting the key and marking the file as inactive. Once/if a cartridge is marked as completely deleted, can just recycle it.

baybal2 · on Jan 16, 2017

I feel it will go the way of Data8

zitterbewegung · on Jan 15, 2017

Nice investigation and also Facebook has been using 50gb blue rays http://www.businessinsider.com/facebook-uses-10000-blu-rays-... and is moving to 300gb http://www.businessinsider.com/ces-2016-facebook-uses-panaso... .

digi_owl · on Jan 15, 2017

And Sony have been pushing a BD based backup/archiving system for business use, iirc. Built around multiple discs in a cartridge.

kakarot · on Jan 16, 2017

Could you provide me some insight into why optical storage is a better solution than standard HDDs? Is it just the cost, or is cooling / form-factor a big part of it?

zitterbewegung · on Jan 16, 2017

It is cost that provides the upside for BluRay than HDDs.

gambiting · on Jan 16, 2017

Optical discs cost cents when purchased in bulk, they are literally just plastic + some coating, no expensive metal or electronic parts are required.

pacaro · on Jan 16, 2017

And, to my knowledge, least one specific principal level engineer worked on both systems

shiftpgdn · on Jan 16, 2017

Why isn't anyone thinking tapes? You can get LTO 7 tapes for $0.008 per Gigabyte that allow 100-300 writes before the tape should be destroyed. Quantum and HP make monstrous tape libraries that hold 5-10 petabytes per rack. You can also cartridge-ize your library for even more dense storage on a literal warehouse rack somewhere.

Tapes also match the slow retrieval speeds as you have to read the data out onto a drive linearly.

lilyball · on Jan 16, 2017

Amazon already denied using tapes, which was mentioned in the third paragraph of the article.

shiftpgdn · on Jan 16, 2017

There isn't a source mentioned for the denial of tapes but this article claims an Amazon insider verified that it is tapes:

http://www.theregister.co.uk/2012/10/18/ipexpo_2012/

tyingq · on Jan 15, 2017

Previous HN discussion about this: https://news.ycombinator.com/item?id=7647571

darksaints · on Jan 15, 2017

This is an extremely interesting deductive analysis. However, considering it is amazon, there always exists that persistent "other" possibility: they're purposefully taking a loss.

ghaff · on Jan 15, 2017

Or at any rate, pricing based on net present value of archived data given decreasing costs over time for storage.

I do find it rather fascinating that AWS has managed to keep the technology used by Glacier, even at a high-level (i.e. disks vs. tape vs. optical), so under wraps. My personal guess is that it's powered-down disk drives on the grounds that's the simplest long-term solution but that's purely a guess.

WalterBright · on Jan 15, 2017

Given the scale of Glacier, I'm surprised that Amazon is able to keep their underlying storage technology a secret.

rbanffy · on Jan 15, 2017

This is one reason to assume it's unremarkable. If Amazon were buying and offloading truckloads of BDXL disks and drives, someone would eventually notice. A good explanation is, therefore, that the technology is unremarkable and boring.

binaryanomaly · on Jan 15, 2017

Does someone know about Google nearline and coldline storage? Google claims coldline access within miliseconds.

monocasa · on Jan 15, 2017

> Google claims coldline access within miliseconds

Well, that basically tells you almost all you need to know. It's disk in JBODs. The only question is SMR vs conventional. Anyone who knows that can't tell you in public.

CydeWeys · on Jan 15, 2017

If that's how Google is implementing it, and the prices are similar, then isn't that some evidence that Amazon might be implementing it the same way?

_wmd · on Jan 15, 2017

Amazon's latency is measured in hours, whatever process they use involves literally cold storage, either disks that are completely switched off, or some tape-like media archive system, but the article makes a good case for why it's almost certainly not tape

CydeWeys · on Jan 15, 2017

It could still be normal disks, if we're feeling conspiratorial. It's easy to make faster storage slower; just add waits. Or maybe they make it take up to five hours so they can avoid peak traffic times in whatever data center your info happens to be located in.

fishnchips · on Jan 15, 2017

I think it's because even for cold storage one copy is still persisted on disk.

mixmastamyk · on Jan 15, 2017

I've got a USB3 BDXL writer attached at my desk and it is quite handy and not too expensive. I back up my whole data (work) partition to it every so often and occasionally take one over to a relative's house as my own home-grown "glacier" system.

tropin · on Jan 15, 2017

Althought your setup is interesting, isn't using 2.5( * ) sata disks via an usb3 adapter way cheaper and flexible?

( * ) So you don't need a power adapter.

mixmastamyk · on Jan 15, 2017

Perhaps, but optical is lighter, much simpler, durable, can be mailed, and fits into CD cases for travel, etc.

sengork · on Jan 15, 2017

And optical media is also immutable (read only) when you need it to be immune to any data changes.

I find that read only point in time backups gain value over time. Especially if you need to pull a file that would have been long rotated and replaced by newer backups on read write media (eg. HDDs).

Unfortunately the market for this use case is not large and this is greatly reflected in the prices and relatively hard to source high quality optical media. For BD this would be (inorganic) HTL Panasonic media which only has a market inside Japan itself. M-Disc is the other alternative although it only has proven itself within the DVD market, as classical HTL BD media is expected to be very similar in endurance to what M-Disc has on offer in the BD range.

mixmastamyk · on Jan 16, 2017

Btw, I have one of those too.

_jcwu · on Jan 15, 2017

Would you not be better off to upload it somewhere instead? Assuming your relative lives close enough, a natural disaster would destroy all your backups.

mixmastamyk · on Jan 15, 2017

It's a huge state and if it fell into the ocean (wink, wink) data would not be a top-five concern of mine. ;)

Also, I prefer privacy. To each his own, however.

_jcwu · on Jan 15, 2017

Regarding your last point, I am renting my own server in a datacenter on a different continent. So basically I have total control over my data etc.

rbanffy · on Jan 15, 2017

> So basically I have total control over my data etc.

How would you stop someone who gains physical access to your server?

_jcwu · on Jan 15, 2017

That is not really something I have to worry about if the server is in the datacenter of a reputable hoster.

Guest98123 · on Jan 15, 2017

If it's backups, you could just encrypt everything prior to uploading.

TheCondor · on Jan 15, 2017

Do you burn bdxl with Linux? I have a burner and it's great at single layer, I haven't tried any bdxl media yet.

mixmastamyk · on Jan 15, 2017

Yes, though now that I think about it am not sure I have broken into the third layer yet. Hardware (LG) says it is capable.

Twirrim · on Jan 16, 2017

Ex-Glacier engineer... and no I'm not going to tell you what or how it's done. NDAs and all that jazz. These speculation threads always make for fascinating reading for people on the team.

exawsthrowaway · on Jan 16, 2017

Glacier in particular seems to attract the speculative fascination. Do people not realise the name is not in jest, it really is done with graphene-doped room-temperature ice crystals and laser interference lithography?

Twirrim · on Jan 16, 2017

We used to joke among ourselves that actually it was done using vinyl records. Have you seen how many vinyl records you can fit into a single rack?

Added bonus, 9 out of 10 customers actually preferred the feel of their data when it is restored.

dwyerm · on Jan 16, 2017

I liked the story that the truck-mounted Snowball came from Glacier tech. Amazon has been putting data on a truck and then driving it around Virginia. The delay in reading it back is the time it takes for the truck to arrive at a datacenter and plug in. :)

sintaks · on Jan 16, 2017

Smart man. :)

digi_owl · on Jan 15, 2017

Are there packet written?

That seems to have been the major stumbling block with higher capacity optical media, that one can't do the drag and drop writes that one have with spinning rust and flash chips.

atombender · on Jan 15, 2017

Bulk transfer. They'd write it to a normal disk first, then slowly copy the data in bulk.

jayonsoftware · on Jan 16, 2017

What if they are using 2.5 inch 5TB drives like this http://www.theverge.com/circuitbreaker/2016/11/15/13642078/s... I use. They are nice as we can plug them into a 15 port USB stick, they auto power down when not in used. Amazon could have developed a box like what backblaze.com has done.

kennethh · on Jan 16, 2017

Facebook disclosed how they are archiving photos long term, they manage a 1 to 1.4 ratio with Reed Solomon Redundancy and 8 disk of 14 can fail without loosing data. https://code.facebook.com/posts/1433093613662262/-under-the-...

KaiserPro · on Jan 15, 2017

From what I recall, Writable optical disks have a much shorter life span compared to tape (~15 years vs 75 years)

Plus, if I was designing an archival system, it wouldn't be on blueray, unless there was a requirement for magnetic resistance.

pmlnr · on Jan 16, 2017

You are mistaken here. Tapes, unless stored with extreme precision, usually last ~20 years safely. They tend to have a longer life, but not without a high risk of deterioration.

In my experience, written CD and DVD only lasts for <10 years, if you're lucky. However studies show you can get 30-45, even 45+ years out of them.

Most Blu-Ray expectancy exceeds this due to the different non organic dye based layering and coating.

The one mentioned above, M-Disc was something developed for DARPA and supposed to last 1000 years in theory.

See:

http://www.mdisc.com/

http://loc.gov/preservation/resources/rt/NIST_LC_OpticalDisc...

http://www.zdnet.com/article/torture-testing-the-1000-year-d...

pmorici · on Jan 15, 2017

They make dvds and brds that are supposed to last 1000 years

https://en.m.wikipedia.org/wiki/M-DISC

gwicke · on Jan 16, 2017

I wonder if Amazon is also deduplicating. It seems likely that a share of users would store large media files or software packages without encryption.

Flammy · on Jan 15, 2017

I remember reading this back around when it came out. Have there been any new pieces of the puzzle identified (or announcements...) since then?

Nition · on Jan 16, 2017

Could it possibly just be compression on tradtional HDDs?

- Cheaper storage because data is heavily compressed

- Slow retrieval time due to slow decompression

usefulcat · on Jan 16, 2017

I seriously doubt it. If I have many TB of stuff I need to have saved off site, you can bet the very first thing I'm going to do is compress the hell out of it. So they certainly wouldn't be able to count on compressing it further.

rrggrr · on Jan 15, 2017

Custom engineered SSD. Powered off at rest.

See: http://www.storagesearch.com/ssd-petabyte.html

hedora · on Jan 16, 2017

SSD isn't quite ready for backup use cases (and certainly wasn't when Glacier was built a few years ago).

However, is 2017, so we can say for sure that the extrapolation to 2016 in the linked article from 2010 was pretty good. It is too optimistic by a factor of ~2 in density, and ~10x in cost, but is spot-on even compared to most predictions from a year or two ago.

Since any whacko can claim they made a prediction in 2010, I double checked:

http://web.archive.org/web/20100322200343/http://www.storage...

Thanks for sharing the link!