Hacker News new | past | comments | ask | show | jobs | submit login
Amazon S3 and Glacier Price Reductions (amazon.com)
293 points by jeffbarr on Nov 22, 2016 | hide | past | favorite | 179 comments



Would really like to see some massive reductions in the operation costs and most importantly, bandwidth costs.

The bandwidth costs are so far out of line with what the network transfer actually costs, it just feels like price fixing between the major cloud players that nobody is drastically reducing those prices, only storage prices.

Charging 5 cents per gigabyte (at their maximum published discount level) is the equivalent to paying $16,000 per month for a 1 gigabit line. This does not count any operation costs either, which could add thousands in cost as well, depending on how you are using S3.

There are several providers that offer a unmetered 1gbps line PLUS a dedicated server for ~600-750/mo. Providers like OVH offer the bandwidth for as little as 100/month. ( https://www.ovh.com/us/dedicated-servers/bandwidth-upgrade.x... ) I am just not sure how amazon can justify a 160x price increase over OVH or a 30x increase over dedicated server + transfer.

For the time being, the best bet is to use S3 for your storage and then have a heavily caching non amazon CDN on top of it (like cloudflare) to save the ridiculous bandwidth costs.


That's exactly why I started Ploid.

A consulting customer came to me a year ago, with a growth from 200TB/year in data production to over 6PB/year and their budget couldn't sustain that jump (or anywhere close to it)

Having come from the mass-facilities and data center space with MagicJack, I knew the wholesale cost of bandwidth, power and drives were continuously falling.

There are certain clients and use cases that need access to their data all of the time and the very bones they are built on is based on collaboration (Genomics).

For example, this client is now storing 6PB of data with us, 3 copies in separate data centers. We are half the price of S3, and we include all the bandwidth for free, but limited to a 10GigE per PB stored. This has worked out extremely well - we were about 20% (!!!!) the price of Amazon after you factor in bandwidth.

There are lots of challenges we faced, like over zealous neighbors in the environment, storing lots of small objects and high usage of ancillary features like metadata but for customers of any size. By putting the "tax" on bandwidth, a lot of these business cases are solved. I see why Amazon does that.

AWS is truly great, but as you get into very high scale (specifically in storage - 2PB+), it becomes extremely cost prohibitive.


"By putting the "tax" on bandwidth, a lot of these business cases are solved. I see why Amazon does that."

However, S3 has the same egress pricing as EC2. Do you think it's really a "business case tax" they're applying across all services?


It makes a lot of sense to be able to run loss making products. Otherwise everyone would use S3 together with Google compute engine and Azure databases (let's assume they'd be cheapest). In this scenario all providers would lose out.

In the current world, they can keep prices for some products below costs but make their money with bandwidth and the other services people are forced to use to avoid egress traffic.


"In the current world, they can keep prices for some products below costs but make their money with bandwidth and the other services people are forced to use to avoid egress traffic."

Which AWS products are loss leaders?

S3 storage pricing is not exactly cheap. Neither is EC2 instance pricing.

"Otherwise everyone would use S3 together with Google compute engine and Azure databases (let's assume they'd be cheapest). In this scenario all providers would lose out."

No, S3 would do well, GCE would do well, Azure would do well. Providers only lose out to the extent their products no longer compete on merit alone.


I can imagine that this is a good reason. Otherwise they could make bandwidth cheaper so that people who cannot move everything can at least move part of their applications.

I think the three providers are smart enough to know why they charge that much for bandwidth. And this is the only reason I could think of why all 3 of them charge that much. And I'm pretty sure that some products run at a loss, they do for nearly every company. But AWS won't tell us which ones.

It's reasonable to think that S3 is loss making or about breakeven on its own but recoups costs due to bandwidth charges.


There's still latency, you know ;).


I guess the latency between AWS Frankfurt and GC Belgium should be low enough (5-10ms) to use it for most applications. E.g. storing large amount of data at one provider and renting compute engines for processing at the other one. The latency shouldn't be an issue there, as long as the throughput is high enough.


Can confirm on this, storage for a lot of stuff is in S3 and compute is GCP preemtpibles. Works if you have a small dataset which requires a large volume of compute.


Is that cheaper than using Google for storage as well? Or are there other reasons for that setup?


Bit of both, no point moving it as the automation/clients that dump data to S3 make it quite hard to change.


GCS supports the S3 API modulo resumable uploads (we do them differently): https://cloud.google.com/storage/docs/migrating

Feel free to send me a note, my contact info is in my profile (I helped build preemptible VMs and I'm sort of fascinated you're doing this).


Yes - I think it directly applies to EC2 as well. Still an underlying commodity


Could you link to ploid please? Neither google nor bing can find it from that name.



Dead? I get either site not found or a domain parking page.


Maybe it's under a different domain name. I was hoping to draw @bkruse out to tell us where we can sign up for Ploid because I'm interested too.


(repost)

Sorry for the delay - yes it's ploid.io - nothing up there yet. We've been in stealth mode while we've been building the system for our first client - HudsonAlpha institute for biotechnology Feel free to ping me at brandon at ploid.io - happy to share any insight we've gained!


Yes, you've tickled my curiosity too but I can't find you.


Sorry for the delay - yes it's ploid.io - nothing up there yet. We've been in stealth mode while we've been building the system for our first client - HudsonAlpha institute for biotechnology

Feel free to ping me at brandon at ploid.io - happy to share any insight we've gained!


> The bandwidth costs are so far out of line with what the network transfer actually costs, it just feels like price fixing between the major cloud players that nobody is drastically reducing those prices, only storage prices.

Yes, it makes me wonder. For many applications, the ridiculous bandwidth costs will be substantially higher than compute costs, so it makes no sense that nobody in this supposedly competitive market wants to compete in this area.

> There are several providers that offer a unmetered 1gbps line PLUS a dedicated server for ~600-750/mo. Providers like OVH offer the bandwidth for as little as 100/month. ( https://www.ovh.com/us/dedicated-servers/bandwidth-upgrade.x.... ) I am just not sure how amazon can justify a 160x price increase over OVH or a 30x increase over dedicated server + transfer.

OVH and other large volume providers are probably relying on the fact that many customers won't use 100% of their purchased capacity, but even taking that into consideration, AWS/GCE/Azure bandwidth pricing is insane.


It is indeed curiously unspoken among the major providers. (Softlayer too raised all their bw rates when IBM bought them.)

Part of building a moat? On one side we have a big stack of custom APIs. On the other -- once you have a few hundred terabytes of data in our systems -- it will be very costly to emigrate.


Exactly!

I knew as soon as I saw the headline that the would be reducing storage costs (who cares????) and not reducing the bandwidth costs which have been stuck for years, which are price-fixed with the other providers, and which account for the vast majority of our S3 bill.


Are bandwidth in costs free? Are you referring to bandwidth out costs? I don't work for them but I imagine they model bandwidth out as disk reads for accessing and reading your data. Think about how much it would cost to read at the rate of a 1 gigabit line from disk continuously for a month. Put a CDN in front of S3 if you have high bandwidth costs and low diversity of files.


That would go a way towards understanding the over-inflated egress (bandwidth out) costs, if S3 did not already charge for disk reads.

S3 already have request pricing in place: https://aws.amazon.com/s3/pricing/#Request_Pricing

In addition to that, the pricing for "Data Transfer OUT From Amazon S3 To Internet" is exactly the same as for "Data Transfer OUT From Amazon EC2 To Internet", so this is not specific to S3 but to EC2 and AWS as a whole it seems.

AWS appear to have really expensive egress costs (or really profitable egress margins) compared to OVH and Hetzner. If so, then something is stuck, either the costs are not being addressed or the margins are not being passed on.


Due to the high traffic volume, they peer with most large providers. Of course they need to pay rent on the fiber that they rent between DCs and peering points, but at this level the cost should be rather low. And a lot of the traffic will be free, especially in Europe where you can peer with many providers on exchanges.


CloudFront costs are super expensive outside of the US and Europe. We are a small company and just in south America we are paying 1000USD per month for 4 TB of traffic (0.25$ per GB). Based on traffic alone we are loosing money with some customers.


Why not use something like CloudFlare for ~$200/mo?


Cloudflare doesn't support large media files if I recall correctly.

Also, it means using DNS with them as well.


Seems like you should use CloudFront for things that are both small and need to be fast (HTML/ JS/ CSS), and then use a separate service for hosting your fat media files. Heck, rent a pair of servers with unlimited bandwith somewhere for $100 a month. Yes the bandwidth may be a bit oversold and spikey, but if a giant file download slows from 10 MB/ sec to 5 MB/sec for a bit, I doubt it matters THAT much.


Check out Fastly or CacheFly.


We have tried CloudFlare around 2 years ago, we had complains about performance issues so we changed back to CloudFront.

Here is a blog post that shows the same problem we were having: http://goldfirestudios.com/blog/135/CDN-Benchmarks%3A-CloudF...


For many hosts, bandwidth is what they are really selling. You may rent a server from them and they may make some money from that over the capital costs, but the amount they are charging for bandwidth in comparison to what they pay for it is where the real money is made.


Also from OVH, have a look at OVH Object Storage, which is pretty much S3 but for 1/10 of the price [1].

There is also a GitHub comparison for all object storage providers (this price reduction isn't included yet as I see) [2].

[1] https://www.ovh.com/us/cloud/storage/object-storage.xml [2] http://gaul.org/object-store-comparison/


I just tried to sign up for an Object Storage account. The form assumed I was from Canada so I couldn't fill it in.


I use s3 as a webserver for static pages. Total bandwidth costs: 17 cents a month. 14 or so cents now. And I have some tens of thousands of visits per month. The only cognitive thought I've had to put in since setup was questioning one aspect of a bill (it amounted to pennies, but I couldn't figure it out, and maybe they were overbilling bigger entities). And they gave me a month free (including DNS costs).

I wish the Route53 would go down... Hard to beat, even if I were running job-replacing amounts of income.


I assume from these numbers that you only have a few GB of traffic per month. If your page ever became popular this would quickly change.


Or he has a very skinny page :) Not a 3 MB js monstrosity.


Even with 100kb (assuming it's not only text), 1mln pageviews per month across all sites will bring you to 100GB, a point where a small vps at DO/Vultr/Hetzner/OVH would be cheaper than AWS. And since he was talking of several pages, 1mln pageviews isn't even a very popular page.


If you are getting 1m pageviews, you should be able to make what, $5000 in adsense revenue?

At that point, I think the reliability and multicast nature of Cloudfront or some other CDN is worth the $5 premium over DO/Vultr/OVH....


A) I think the revenue would be far less than that

B) Not everything is about selling ad space, or even generating revenue.


I think it is possible to draw some conclusions from the volume discounts AWS is giving on data transfer. From 10TB/month to 150TB/month the price drops $0.09 => $0.05. I don't see how this kind of change in individual customer's volumes would make any difference on how much Amazon pays for the traffic. So I assume it is quite profitable for them to be selling at $0.05 (and lower) and the higher prices are just because smaller customers accept them.


The bandwidth seems to be used for trapping users in. You don't get charged when you're uploading data to AWS but downloading it back is quite expensive.


I wholeheartedly agree. The bandwidth cost reduction is long overdue.


Well, the costs are nicer, but mostly, Glacier goes from an unusable pricing model to a usable one. I was terrified to use Glacier. The previous model, if you made requests too rapidly, you might be hit with thousands of dollars of bills for relatively small data retrievals -- very easy to make a very expensive bug.

I had wanted Amazon to wrap it in something where they managed that complexity for a long time. Looks like they finally did.

Now the only thing Amazon needs to do is expand free tiers on all of their services, or at least very low cost ones. I prototype a lot of things from home for work -- kinda 20% time style projects where I couldn't really budget resources for it. The free tier is great for that. All services ought to have it -- especially RDS. I ought to be able to have a slice of a database (even kilobytes/tens of accesses/not-guaranteed security/shared server) paying nothing or pennies.


> The previous model, if you made requests too rapidly, you might be hit with thousands of dollars of bills for relatively small data retrievals -- very easy to make a very expensive bug.

Glacier has supported for something like 18 months now, the ability to put a policy on your vault that capped your maximum retrieval cost. Whenever your request would cause you to exceed that limit, it would get throttle response that the SDK handles happily. I've used it when I needed to retrieve a whole bunch of data and wanted to do it faster than the free tier supported. I set it at $5 and just left the retrieval running.


A t2.large costs 10 cents/hour, a t2.medium RDS instance costs 7 cents/hour. If you put in 50 hours/month on this side project, that's $8.50 for compute plus maybe $3 for ~ 30GB of storage.

$11.50/month doesn't sound too hard to budget for.


It might be more about bureaucracy than about cost. At my last job even small expenses would require printing forms and getting the CFO to sign them, so in the end it was pretty much not worth it for small tests.


Your work can't give you a few VMs in a lab somewhere that you can get free rein to prototype on? That is usually not too hard to get..

it seems the idea of the Amazon free tier is to give people a taste of AWS so they can decide to go in more or not. It's not really designed to be a free prototype for existing large customers product. Like the other poster said, you can host a tiny VM for $6 or so a month, not a big expense.

If you are asking for $4000 a month production cluster, yes that is harder to just get..


I run dozens of small side projects off a $10/mo DigitalOcean server. I do not mind paying that absurdly low amount.


Agreed.

Glacier was, from my point of view, literally unusable. Now, it's usable. I still may not use it but at least I can.


RDS does have a free tier: https://aws.amazon.com/rds/free/


While cool, this is only valid for the first 12 months of an AWS account.


I've seen people suggest this a lot when someone is asking for hosting advice. Do people just make new accounts every year or something?


Yes, that's what I do and it's pretty much encouraged when I asked AWS if it was allowed.


I've been doing that every year since 2007! You can even use the same credit card information.


I dunno, I've never done that, too much of a pain :/


Dynamo might fill this need for you; you can provision a 1/1 table for less than a dollar per month and just put everything in it. Dynamo will bank some provisioned throughput for you to allow bursting, and the client apis all implement exponential backoff in the event of a throttle, so a 1/1 table isn't nearly as scary as you might initially think.

Of course, then you're using Dynamo.


Is there a common perception of DynamoDB? Or are you just referring to the old Dynamo?


Same about prototypes! I wonder how many other people have created really great products for their company doing the same thing. It would be in any company's best interest to provide free tiers for experimentation.


While I'm not going to complain about a price reduction, I'd honestly be more excited if S3 implemented support for additional headers and redirect rules. Right now, anyone hosting a single page app (e.g. Angular/React) behind S3 and Cloudfront is going to get an F on securityheaders.io.

And even worse, there is no way to prerender an SPA site for search engines without standing up an nginx proxy on ec2, which completely eliminates almost all of the benefits from Cloudfront. This is because right now S3 can only redirect based on a key prefix or error code, not based on a user agent like Googlebot or whatever.

This means that even if you can technically drop a <meta name="fragment" content="!"> tag in your front end and then have S3 redirect on the key prefix '?_escaped_fragment_=', that will be a 301 redirect. This means that Google will ignore any <link rel="canonical" href="..."> tag on the prerendered page and will instead index https://api.yoursite.com or wherever your prerendered content is being hosted rather than your actual site.

Not only is it a bunch of extra work to stand up an nginx proxy as a workaround, but it's also a whole extra set of security concerns, scaling concerns, etc. Not a good situation.

edit: For more info on the prerendering issues, c.f.:

https://github.com/prerender/prerender/issues/93

https://gist.github.com/thoop/8165802


Can you share your needs with me so I can pass them along to the S3 team?


Tell the CloudFront team to support adding custom HTTP headers to client responses (not the currently supported headers added to the origin requests), such as HSTS. CloudFlare and others already support this.


Thanks Jeff, I just sent you an email:

https://www.fwdeveryone.com/t/QOU4DQDbS8e4tddfI22J0w/s3-prer...

Of course this won't yet show up in Google until we get that nginx proxy stood up or this feature gets implemented. :-)


OT: That fwd:everyone service is really cool. Damn :)


Thanks! Consider making an account, we're re-launching the site in the next couple weeks so there should be a lot of good content up there shortly.


Looks really good. Will sign up shortly.


So, you're building a pre-rendering system, but only Google gets the benefit?

Also, doesn't google penalize sites that serves different content to googlebot and other UAs?


> Doesn't google penalize sites that serves different content to googlebot and other UAs?

The content you're serving to Google needs to accurately reflect the content users are seeing.


This is really something that needs to be setup/supported at the CloudFront layer (as CloudFlare and other CDNs already do) instead of S3. The danger is that somebody sets an HSTS or other security-related header on their bucket and breaks access for all other customers inadvertently for their customers that fetch from the S3 domain.


> While I'm not going to complain about a price reduction, I'd honestly be more excited if S3 implemented support for additional headers and redirect rules. Right now, anyone hosting a single page app (e.g. Angular/React) behind S3 and Cloudfront is going to get an F on securityheaders.io.

You can setup "Origin Custom Headers" in CloudFront ;)


That's for sending headers to your web server or S3 though, not for sending headers to the user. There are a few extra headers you can send to the user, but not the security related ones.


Take a look at Netlify (disclaimer, I'm a co-founder), you'll get redirects, built-in prerendering, automated deployments, custom headers, etc. All with full cache-ability on our CDN.

https://www.netlify.com


I can vouch for Netlify. We use them for all our stuff at Graphcool and it is awesome. Going from S3+CF to netlify provides a step improvement similar to going from EC2 instances to a managed docker cluster.


Hmm, as someone who is about to move from using a webserver to hosting our single page app on S3 you may have just convinced me not to.

What exactly are the benefits over simply setting up nginx if not simplicity? Yeah, it's great to just serve the asset from s3 but the complexity of what you just described negates it almost entirely.


> What exactly are the benefits over simply setting up nginx if not simplicity?

We weren't fully aware of these limitations when we decided to host our site on S3. Had we been, we may have just used nginx.

There are obviously a ton of benefits to S3 and Cloudfront, it's just than in practice you can't really get them if you need Google to index your site. And while Google claims they can now execute javascript and include async content, in practice this isn't true for any real Angular or React site.

And even if every search engine were to magically execute js correctly, you'd still need to prerender your site in order for Facebook and Twitter to populate the preview cards for your site with the proper headline, summary, and image.


You should keep in mind that S3 really is an object storage. It's named as such, advertised as such, priced as such and the limitations you're hitting are because it's built as such.

It does work if you want to host a static site, and it's nice that they offer a bunch of extra niceties to help make those work... but expecting things like user-agent-specific redirects is a bit much for what's essentially a filesystem.


I mean they let you put S3 behind Cloudfront with a TLS cert. Having read through all the documentation, hundreds of blog posts, etc., I've seen absolutely nothing that indicates that S3 + Cloudfront isn't meant for serious web hosting.


Amen! Have been baffled by the urge to move stuff to S3 for app hosting. I understand the convenience (sort of) and the scalability aspects (mostly) but you seem to lose loads of functionality.


It works for anything static, but anything dynamic needs to go on a server or Lambda (or possibly the API gateway).


Do Angular sites seem to render correctly in Google Webmaster tools but they aren't actually indexed properly unless you have pre-rendered pages served? I'm curious since when I ask Google to index a specific page it shows everything properly loaded/rendered and I was under the impression that was also being indexed.


I don't really understand the advantage. Getting a small vps with nginx is fast to set up, needs very little maintenance and can handle a large amount of traffic (requesting static pages). Can't really see how S3 can be cheaper or easier.

Yes it gives some scalability, but so do many cloud providers. Digitalocean and Vultr both have SSD storage that you can attach to VMs. The speed I've seen is fast enough to easily saturate the bandwidth. Of course you'd need to scale up when you need bandwidth of more than 100mbit, but this is still cheaper than paying for AWS bandwidth and servers.


Using S3 as a web server means: managed reliability, scalability, security, maintenance, updates, https, and price efficiency (pay only if pages are visited)

Cloudfront or API gateway help you integrating with non static resources.

Bandwidth price is identical with ec2.

With lambda out there, for me ec2 is just legacy.


If i'm not wrong, cloudfront actually follows 302 redirects internally instead of passing the redirect on to the client.

but since ?_escaped_fragment_= is a suffix, not a prefix, i don't think redirection rules help.


Confusingly, _escaped_fragment_= is always a prefix if you have a #! in your url, otherwise I think it can come anywhere in the URL params.[1] Here is the exact system of redirects we were using, along with an explanation of why it didn't work:

https://www.fwdeveryone.com/t/mdDouBesQwCv0za_o7GjCA/serving...

[1] https://prerender.io/js-seo/angularjs-seo-get-your-site-inde...


Ah, yes, if you use #!.

In that case, can't you do a cloudfront custom behaviour on /?_escaped_fragment_=*? I havent't tried this.


Is anyone using either S3 or Glacier to store encrypted backups of their personal computer(s)? I've only used Time Machine to back up my machine for a long time, but I don't really trust it and would like to have another back up in the cloud. Any tools that automate back up and restore to/from S3/Glacier? What are your experiences?


I use Arq (https://www.arqbackup.com/) and it works very well. I've only tested retrieving small amounts of data from it so I can't comment much on a large bill. I only wish it worked on Linux. I've been thinking about seeing if it would work with Wine.


Also recommend Arq against any service. At a certain scale, especially after factoring bandwidth for restores, Amazon Cloud Drive at a flat $60/year becomes more attractive.


Ditto. I've been using Arq for 4 years now to back up nearly a TB of data to S3 and Google Cloud Storage.


Out of curiosity, why both?


I guess mostly to not have all my cloud backup eggs in one basket.


+1 for Arq!


I have about 150GB in /home backed up daily using Duplicity, which encrypts & compresses everything and saves to a second internal backup drive. Data is kept for 6 months minimum. After several years, the total backup size is 190GB which syncs daily to an S3 bucket and my monthly bills are about $11USD. If I ever had to restore all that from S3 it would cost extra but would not be prohibitively expensive.

Install the AWS CLI (https://aws.amazon.com/cli/) and choose whatever method you like for making an encrypted local backup. Then sync that backup partition to S3 every day.

Here's an example command you can adapt for cron to call via a shell script:

    /home/tom/.local/bin/aws s3 sync /mnt/backups/daily s3://your-s3-bucket-name --storage-class STANDARD_IA --acl private --sse
The -sse means server side encryption which is redundant since I've encrypted the data prior to uploading, but why not?


I use Tarsnap (https://www.tarsnap.com/), which uses S3. It's true, though, that I only backup a small enough subset of my files, not anywhere close to a complete image. OTOH, it's very easy to use (CLI only), and as secure as it gets.


Using S4cmd on my Debian box I backup copies of my entire Lightroom folder structure contents. All my personal photos and videos get uploaded to a bucket on S3, then I convert the entire bucket to Glacier. Now with the new pricing of $0.004/GB and easy retrieval, it's a very nice setup.

To backup nearly a terabyte of photos costs about $4/month in storage costs. Uploading costs a bit extra due to the pricing for requests.

https://github.com/bloomreach/s4cmd


How much would it cost if you ever needed to restore that TB?

I looked into Glacier a couple of years ago, and, from memory, restore costs were insane.


I was a masochist and hand-rolled incremental tar snapshots to Glacier using a cronjob. It works great for keeping price down, though I learnt the "lots of retrievals" lesson the hard way once I actually tried to retrieve my data. I now compromise by doing a full snapshot monthly and incremental daily, so I'm upper bound to 30 retrievals.

I'd highly recommend not repeating my mistake - use a real backup service for your actual data. Though rolling your own can be fun and interesting, it's probably a bad idea to bet your data on it.


I use my Synology NAS pointed at Google Nearline (through the included S3 support!). GCS also supports customer supplied encryption keys, but I store the bytes as encrypted on the NAS box itself.

As others have said, if you're just trying to back up your Mac, take a look at Arq.

Disclosure: I work on Google Cloud (and get $30/month of credit, so my first 3 TB are free).


Check out rclone [1]. I haven't used it yet but it's free and supports lots of backends.

[1] http://rclone.org/


I use duplicity to backup my VPS to S3, and Fast Glacier to archive stuff to Glacier from my Windows machine (things like photographs, important documents etc.)

Note: Glacier is not a backup service, it's an archive service. It's for long term backups, vs the relatively short term of standard backups (Full, delta, delta.. cycles). With Glacier if you delete an archive before 90 days, you'll still be charged for the full 90 days of storage.


Arq on OSX is pretty great.


I second this. It costs about $50 for a license but they support incremental backups up to any of a NAS, s3, s3 glacier, various tiers of google cloud storage, dropbox, and google drive. Client-side encryption is built in. It's good software.


Check out Duplicity or Duplicati. Also Arq for OSX I believe is very nice and has a great GUI but isn't free.


+1 for Duplicity


This a really dumb question, but since I've never used Glacier what does the workflow for a Glacier application look like? I'm used to the world of immediate access needs, and fast API responses, so I can't imagine sending off a request to an API with a response "Your data will be ready in 1-5 hours, come back later".


I work on quality control medical data (MRI images) and have huge data sets from machines going back over a decade. Most of the useful stuff is extracted metrics (stored in a db), but every now and then we need to pull up a data set and run updated analysis algorithms. We'll usually keep the latest couple of years in S3, and the rest in Glacier.

The data trove is fairly unique, and valuable in being the only of its kind, but we don't need anywhere near instant access to most of it.


It may be for backing up infrequently accessed data (compliance logs, etc) for example.

Hypothetical: you create a logging service for users to send all their log data to you. You promise 365 days of archives, but 30 days of data accessible at any time. You create a lifecycle rule on your S3 bucket to automatically archive data to Glacier 30 days after creation. On the 31st day, your user decides they want to look at an old log. They click the big Download button. You display a message saying they'll get an email from you when that data is ready to download.


Why not? If you have ever worked with any asynchronous API, you have already been introduced to the "come back later" model. Does it really matter how much later it is?


Glacier is not for data you want readily available. It's for when you care more about storage than access.


Right, but if you want to retrieve some data through their API, how does it work? Normally you open the connection, ask for the data, then receive it and close the connection, does that change if there's a 5+ hour wait between the ask and the receive? Do you just leave the connection open? Provide them with a webhook to call when it's ready? I don't personally care about the answer but I'm pretty sure that's what they were asking.


Well the CLI gives you a job ID that you can use to check on the status and retrieve when it's ready. You can also ask to be notified by SNS.

http://docs.aws.amazon.com/cli/latest/reference/glacier/init...


With Glacier you submit an "InitiateJob" request to say "Fetch me this archive". That returns you a job ID in the response.

From there you can submit a "DescribeJob" request, with that Job ID as the parameter, and the Glacier service responds with the state of the job.

Once the job is marked as complete, you submit a "GetJobOutput" request with that Job ID. That response is the archive body. (similar to how you'd do a GET request from S3).

You've got 24 hours to start the download of the archive before you'll have to repeat the entire InitiateJob->GetJobOutput cycle again.


At work our needs are simple, we manually run the aws cli to sync files up to S3 where there's a 1 day lifecycle policy to move them to Glacier. We don't use the API for restores, we do those through the web console and check back in a few hours to see if the files are downloadable.

I think through the API you do not leave the connection open, you check with whatever frequency you want and when it's ready, the response will include the temporary location on S3 for the file.


Is the outgoing bandwidth still the same price? Bandwidth cost is kind of high compared to other services.


Yes, they are high for all 3 major cloud providers. I guess it's to keep you from using the cheapest part of each provider. That way you have to use their whole ecosystem and they can offer some products below costs to lure people in.


What is the mechanism that makes it cheaper to take longer getting data out? Is it that they save money on a lower-throughput interface to the storage? Is it simply just market segmentation?


In theory, tape [1], optical [2], or spun-down disk [3] are cheaper but slower than spinning disk. Erasure coding [4] is also cheaper but slower than replication. One could even imagine putting cold data on the outer tracks of hard disks and warm on the inner tracks. In practice I suspect Glacier is market segmentation.

[1] http://highscalability.com/blog/2014/2/3/how-google-backs-up...

[2] http://www.everspan.com/home

[3] https://www.microsoft.com/en-us/research/publication/pelican...

[4] https://code.facebook.com/posts/1433093613662262/-under-the-...


it has more to do with clever uses of erasure coding than you might think.


Pure speculation here, so I'm probably completely wrong, but I imagine it's so that they can essentially reduce the amount of seeking. Glacier uses magnetic tape storage, I think, and I believe each tape has to physically be inserted into a machine to be read, and then be removed afterward. So there would be some downtime as tapes are swapped in/out. Therefore, it would make sense to aggregate reads ahead of time and maybe even physically reorder whole tape accesses to reduce the time it takes to load them.

But this wouldn't explain why the read rate factors into cost. Maybe they scatter data across tapes as well, and higher bandwidth requires loading more tapes concurrently?

Again, total conjecture. Please let me know which parts are wrong and which are right.


It is rumored that Amazon denied that Glacier uses tape. It's still not that clear what it does use. https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd... (also read the comments)


It looks like they're workload optimized drives.

https://news.ycombinator.com/item?id=4416065


Storing the data in tape systems or some other system which is automated but not available in a matter of milliseconds.


Tape is cheaper than hard drives are cheaper than flash drives are cheaper than ram.


I currently use S3 Infrequent Access buckets for some personal projects. These Glacier price reductions, along with the much better retrieval model look really great.

However using Glacier as a simple store from the command-line seems horribly convoluted:

https://docs.aws.amazon.com/cli/latest/userguide/cli-using-g...

Does anyone know of any good tooling around Glacier for the command line?


If you don't want to mess with all of that, you could use the standard "aws s3" command to upload your files to your s3 bucket like normal, and just apply an archive policy to your bucket or archive/ prefix or whatnot and it will automatically transfer your files to glacier for you.


Yup, works great.


Perhaps you can script in python using boto3.


Has anyone tried to migrate to Backblaze. Their pricing seems really aggressive but I am not sure if we can compare Amazon and Backblaze when it comes to reliability.

https://www.backblaze.com/b2/cloud-storage-providers.html


I love the folks at backblaze but the single datacenter thing really worries me (and again, disclosure, I work on Google Cloud). If you're just using it as another backup, maybe that's less of a concern: your house would have to burn down at the same time that they have a catastrophic failure. But it is part of the reason you see S3 and GCS costing more (plus the massive amount of I/O you can do to either S3 or GCS; I'd be curious what happens when there's a run on the Backblaze DC).

Again, huge disclosure: I work on Google Cloud.


"But it is part of the reason you see S3 and GCS costing more"

I bet that when Backblaze increase their scale by adding data centers they will decrease prices, not increase them.

According to Ford's laws of service: Price, volume (scale) and quality are never opposed.

1. Decrease price and you can increase volume.

2. Increase volume and you can increase quality.

3. Increase quality and you can increase volume.

4. Increase volume and you can decrease price.


Sorry if I wasn't clear: your bytes on GCS and S3 are stored across multiple buildings (GCS Regional, S3 Standard). More copies is more dollars not less ;).


"More copies is more dollars not less ;)"

As far as I am aware GCS does erasure coding across sites?

Backblaze could do multiple tiers of erasure coding and they would still be able to reduce prices given more scale, ceteris paribus.

It's not a question of number of replicas, data centers or technical implementation, but a question of pricing policy.

Does one want to use volume and scale to drive prices down (and cheaper prices to increase volume) or does one want to use volume and scale to bloat margins? Backblaze are arguably doing the former.

Does one want to lock customers into an ecosystem by enforcing excessive bandwidth prices or does one want to pass on bandwidth cost-savings to customers? Backblaze are arguably doing the latter.

Backblaze would continue to be cheaper because their pricing policy serves customers across all dimensions.

More scale is definitely less dollars not more (even if it means a fraction of a few more erasure coded shards across sites).

Disclosure: I do not work for Backblaze.


Anyone else finding their S3 bill consisting of mostly PUT/COPY/POST/LIST queries? Our service has a ton of data going in, very little going out and we're sitting with 95% of the bill being P/C/P/L queries and only the remaining 5% being storage.

Either way, good news on the storage price reductions :)


I hit that scenario, when we create tons of small files (e.g. < 10KB ones). In that use case it is often cheaper and easier just to use database such as DynamoDB.

See my other comment, it got link to article about S3 costs optimizations which got more detailed recommendation.


What app/site are you using to upload to S3? I use a combination of CloudBerry Backup and Arq Backup on my Macs/PCs here and the requests aren't that high (on average about 30Gb of data per machine in around 300K files).

I am guessing it comes down to the algorithm used to compare and upload/download files. I believe the two solutions above use a separate 'index' file on S3 to track file compares.


It's more that we have a pretty high throughput system, using Lambda.

Users authenticate with an API gateway endpoint, we do a PUT to store a descriptor file, send a presigned PUT URL back so they can upload their file, we then process the file and do a COPY+DELETE to move it out of the "not yet processed" stage and finally do another PUT to upload the resulting processed file.

Despite a lot of data, the storage bill is barely scratching $40, but we're at almost $700/mo on API calls.


Heyo, sounds not quite right if you wanna shoot me an email randhunt@amazon.com I'd be happy to try to figure it out. Your API calls shouldn't be that much more than the storage cost without a really strange access behavior. I don't know the answer off the top of my head but I'm down to try to find it out. GET's cost money and outbound bandwidth cost money but PUTS/POSTS should be neglible.


Thanks! I'll shoot an email.

Edit: Sent!


Ah, thanks for the extra info. We have several web apps that take user uploads and store it to S3 buckets here too, but still we don't see an adversely high request load. Not sure if the handshaking involved in getting a pre-signed URL will be upping your count?

We just use the AWS SDK on our Ruby back end. The user file is first uploaded to the (EC2) app server, then we use the SDK call to transfer it to the S3 bucket. Our storage and request costs are about equal at this stage.

Using Lambda/Node, I guess that the SDK is not an option and you have to use the pre-signed URL method? Or else use Python and the SDK library?


You can generate pre-signed URLs without making an API call.


Great discount. I'm only surprised that Infrequent Access doesn't get any discount.

By the way, I wrote article, how to reduce S3 costs: https://www.sumologic.com/aws/s3/s3-cost-optimization/


Do we know now how Glacier actually works? Tape robots, spun-down disks, racks of optical media?

Best source I could find was: https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...


Any chance Google will match this price for their coldline storage? I was planning to archive a few TBs in Google coldline, but Glacier is now cheaper and has a sane retrieval pricing model.


> For example, retrieving 500 archives that are 1 GB each would cost 500GB x $0.01 + 500 x $0.05/1,000 = $5.25

Shouldn't that be $5.025? Or did I misunderstand?


In our startup, the biggest cost is bandwidth. We live in an age where videos can be created and streamed in seconds to millions of people. With so high cost for bandwidth, it's very difficult for bootstrapped startups to grow as quickly as those who raise VC funding. I hope AWS can reduce the outgoing bandwidth cost by 50%.


EDIT: My mistake, this is the new S3 pricing! NOT Glacier pricing! Thank you res0nat0r.

Am I understanding this right? $0.023/GB/month for Glacier, so * 12 months/year = $0.276/GB/year, which means:

    10GB  = $2.70/year
    100GB = $27.00/year
    1TB   = $270.00/year
    ...
And this is only the storage cost. This doesn't take into account the cost should you actually decise to retrieve the data.

So considering a 1TB hard drive [0] costs $50.00, how is this cost effective? I can buy 5x1TB hard drives for the price of 1TB on Glacier.

I understand there is overhead to managing it yourself. So, is this just not targeted to technically proficient folks?

[0] https://www.amazon.com/Blue-Cache-Desktop-Drive-WD10EZEX/dp/...


The new glacier pricing is $0.004/GB/month, those prices above look to be the new S3 pricing. Also you get 99.999999999% durability.


S3 data is backed up in three geographically separated data-centers, so you need to budget for 3 copies.

It's a running service (rather than cold storage) which means you need servers, power network ports, etc. Again, times 3.

Finally, include the software development to build the server stack and the staff for 24x7 ops and security.

It's possible to beat their pricing but most people who actually do manage it by cutting in an area which they don't need.


That's comparing apples and oranges - your hard drive doesn't live in a secure, fireproof data center in the cloud you can access from anywhere.


Or eggs and omelettes...


I do miss my little 3 TB seagate that I rsync'ed from home to my office. That said, what if someone broke into our little office (encrypted backup of course)?

The reality though is that no business would be comfortable with the plan being "our data is replicated offsite at Steve's house". And, other than maybe a pair of NAS boxes (one at each end) the cheapness of the solution assumes you have a great network connection between the two, a machine to plug it into, and only need a single hard drive. That is, how would you do offsite, active backup of say 50 TB?

Disclosure: I work on Google Cloud (and do offsite backup to GCS Nearline, that I'll move to Coldline when I get a minute to play with our new per-object lifecycle rules).


Where are you going to keep it? On site? Not so good for disaster recovery.


It's kind of funny when Feral hosting (I know I've been mentioning them a lot but trust me I'm in no way affiliated with them, I'm just satisfied customer, though they have trouble now with OVH) offers 2TB + 10Gbps unlimited bandwidth connection (it is unlimited, I abused it as much as I could with no warnings or anything) for 120 british pounds per year. That's good enough for me for mass storage of non-important data like Movies, Music, TV series, etc., that I convert then serve to users.


I Googled "Feral Hosting", clicked on the link to their web site [0], and was redirected to (presumably) their status page [1], which reads:

> tl;dr www.feralhosting.com is down, database lost, slots are up and will remain up. We're moving to the honour system for paying bills. ETA 25th November.

Not exactly a good first impression, to say the least.

[0]: https://www.feralhosting.com/

[1]: https://status.feral.io/


Living up to their name, I guess.


If costs matter to you, e.g. for home backups, don't buy Glacier (and heck don't buy S3). A 3TB drive costs about 110eur, so if you'd have to buy a new one every year (you don't) that'd cost 110/3/1000/12=0.31 cents per gigabyte per month. Glacier? 7 times more expensive at 2.3ct.

Hardware is usually not a business' main cost but it does matter for home users, small businesses or startups that didn't get funded yet, some of whom might consider Tarsnap or some other online storage solution which uses Glacier at best and S3 at worst. Now you could suddenly be 7× cheaper off if you do upkeep yourself (read: buy a raspberry pi) and if you throw away drives after one year.


There is value to having off-premises replicated storage on something more durable than home-user targeted drives.

Google cloud nearline costs $0.12 per gigabyte-year with prices that will continue to fall. For a typical 500g hard drive that saw perhaps 700g of unique data, that's $84/year to have an outside-the-house replicated backup using something like Arq.


They are not the same thing at all. Glacier is the backup of the backup. It's where you go if your house burns down and the offsite backup at a relatives house is destroyed as well.

If you want to compare them, you have to buy space on a different continent, and store your backup there.


> A 3TB drive costs about 110eur, so if you'd have to buy a new one every year (you don't) that'd cost 110/3/1000/12=0.31 cents per gigabyte per month. Glacier? 7 times more expensive at 2.3ct.

Your pricing assumes that the drive is never powered.


I had such a setup when the Joplin 2011 tornado hit: http://www.ancell-ent.com/1715_Rex_Ave_127B_Joplin/images/ and I got off lightly. But the separate room my BackupPC hard drives were in was breached (see 302-2nd-bathroom-with-hole-of-unknown-origin) and those drives become fit only for Ontrack's $$$ recovery service, maybe, and one of my computers with e.g. my email was seriously damaged. The data on it was easily recovered from rsync.net's off site Denver location, who's service I love and will continue to use for my most important and "hot" data.

LTO (-4) tape had gotten capacious and cheap enough that I went back to tape (I'd outgrown DAT); if I didn't have a big sunk cost in a well working tape system and pool of tapes, which are very easy to put in e.g. a safe deposit box (they're a bit fragile, but nothing like a hard drive), I'd already be using one of S3, Glacier, or Backblaze, maybe even GCS since suddenly and irretrievably losing access to my backup data because a bot decided I was evil would not likely coincide with a total data loss at home (Google simply cannot be trusted if you're small fry like myself, as HN has been discussing as of late).

As Glacier has gotten sane enough to use without twisting your mind into a pretzel, with the new price reduction for slow retrieval I can seriously think about adding it to the mix and switching to it when my LTO-4 tape drive dies someday (e.g. ~3TiB for ~$12/month per my quick calculation just now), instead of buying another tape drive.


how does this compare to Google's Coldline storage?


The biggest difference is that Glacier is still a "suspend/resume" type of accesss. However, if you just want to compare pricin, it'll depend on your access pattern and object sizes.

Retrieval in all Google Cloud Storage is instant and for Coldline is $.05/GB (and Nearline $.01/GB). If you value that instant access, it seems the closest you'd get with the updates to Glacier is via the Expedited retrieval ($.03/GB and $.01/"request" which is per "Archive" in Glacier). Then you have to decide how much throughput you want to guarantee at $100/month for each 150MB/s. (It's naturally unclear since it was just announced what kind of best-effort throughput we're talking about without the provisioned capacity).

If you're never going to touch the bytes, and each Archive is big enough to make the 40 KB of metadata negligible then the new $.004/GB/month is a nice win over Coldline's $.007. Somewhere in between and one of the bulk/batch retrieval methods might be a better fit for you.

But again, it's still a bit of a challenge to go in and out of Glacier while Coldline (and Nearline and Standard) storage in GCS is a single, uniform API. That's worth a lot to me, and our customers. But if Glacier were a good fit for a problem you have, and you're talking about enough money to make the pain worth it, you should seriously consider it.

Disclosure: I work on Google Cloud, so naturally I'd want you to use GCS ;).


Google has a "Glacier vs Nearline" calculator, which doesn't appear to have been updated yet:

https://cloud.google.com/pricing/tco/storage-nearline

(work at Google)


With S3 Standard essentially getting S3 Standard - Infrequent Access storage pricing, where does that leave the latter?


Hey, did you mean Reduced Redundancy (not Infrequent Access)?

I just noticed that the new pricing for Standard (2.3¢) is now less than the pricing for Reduced Redundancy (2.4¢)! So there appears to be no reason to use Reduced Redundancy anymore.


You're right, I mixed up after spending too much time on the different pricing pages yesterday. :)


Any EBS storage price reductions? Those are pretty high at this stage.


That's both a good and a terrible change.

- The price reduction on S3 is good! Kudos AWS.

- The price change on glacier is a fucking disaster. They replaced the _single_ expensive glacier fee with the choice among 3 user selectable fee models (Standard, Expedited, Bulk). It's an absolute nightmare added on top of the current nightmare (e.g. try to understand the disks specifications & pricing. It takes months of learning).

I cannot follow the changes, too complicated. I cannot train my devs to understand glacier either, too much of a mess.

AWS if you read this: Please make your offers and your pricing simpler, NEVER more complicated.

(Even a single pricing option would be significantly better than that, even if its more expensive.)


I wrote the post after spending all of about 15 minutes learning about this new feature. We did make things simpler.

> AWS if you read this: Please make your offers and your pricing simpler, NEVER more complicated.

I am reading this. We made them simpler. Decide how quickly you need your data, express your need in the request, and that's that. If you read the post you will see that these options encompass three distinct customer use cases.


I believe this HUGELY simplifies glacier pricing.

I don't think you understand just how insanely, incredibly, bizarrely complicated the old glacier pricing was.


There are 3 pricing models for S3, the 3rd one (glacier) having 3 sub-pricing models to be chosen at time of requests.

I don't think you realize how insanely complex the entire S3 pricing model once you get out of the "standard price".

Maybe I just have too much empathy for my poor devs and ops who try to understand how much what they're doing is gonna cost them. It's only one full page, both sided, of text after all.


Is that really that much different than hard drives? No hard drive manufacturer uses the same standards to determine how much space there will actually be on the disk. You can get hard drives that spin at many different RPMs. You can get hard drives with many different connector types. Drives with different numbers of platters. An 8TB, 7200 RPM, SATA, Western Digital drive is not going to have the same seek time as a 1TB, 7200 RPM, SATA, Western Digital drive.

There are so many combinations of hard drives that will result in different performance for different situations all with different costs. Then you start talking about cold storage as well and you've moved into other media formats.

Just because there is a page worth of a pricing model doesn't mean AWS or any cloud provider is doing anything incorrectly. You're paying for on demand X and engineers who are going to utilize that should understand it as well as they would understand how to build an appropriate storage solution of their own. On demand just means now they don't have to take the time to design, implement and operate it themselves.


I'm not saying that S3 isn't complicated, or that the pricing model, even after the changes, isn't nuts.

I'm saying that anyone who thinks this is more complicated than it was does not understand just how crazy glacier pricing was before. Three static glacier pricing tiers is a lot better than the previously system which is so complex that earlier versions of the AWS pricing page just gave up and called it "free".

(Briefly: The old Glacier model's pricing wasn't based around data transfer, but on your automatically retroactively provisioned data transfer capacity based on your peak data transfer rate, billed on a sliding scale as if you'd maintained that rate for the entire month. There's probably a less intuitive way to bill people for downloading data, but if so I've never seen it. It was a system seemingly designed to prevent users from knowing what any given retrieval would cost.)


I appreciate your enthusiasm for our products, but this is a huge step forward for S3 customers. I think it's fair to say you find Glacier too complicated (it's part of the reason Coldline doesn't look like that), but to say this new setup is worse just isn't true.

I can assure you that if AWS had announced a simple flat-rate structure that was more expensive for everyone there would be plenty of unhappy existing customers. It's a tough call to balance simplicity and efficiency, by making this complex for you, they allow you to opt-in to the "screw it, give me my bytes back as fast as you can" model. You could just pretend that's the only option ;).

Again, I'm not criticizing your unhappiness with the complexity of Glacier. But I think it's only fair to recognize that the folks at AWS have just released a major improvement that provides real value for some customers.

Disclosure: I work at Google Cloud.


This is a good change. The stupid former "rate-based" system meant that if you stored a large file (say, 10GB), you'd have had to pay quite a bit of money if you tried to retrieve just that file. After all, you can't control the rate at which you retrieve one file.


I'm ignorant to exactly how Glacier's API works, but what's stopping you from reading bytes from the socket at any rate you want (for a single file)? e.g. what is stopping me from doing 128KB/s with this code:

    byte[] buffer = new byte[128*1024];
    while(socket.read(buffer) != -1) {
        // do something with buffer here
        Thread.sleep(1000);
    }


The rate pertains to Glacier's "retrieval" of the file and making it available for download, not the reading of the file from your end.

(If you end up never downloading the file, you still get charged)


Simple is easy to shop around. Their goal is not to make pricing simple, don't hold your breath.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: