S3 was down

scrollaway · on Sept 14, 2017

Quick tip for those using static (jekyll, hugo...) sites on s3: If you have cloudflare in front of it, you can turn on aggressive html caching by creating a page rule: example.com/* => Cache level => Cache everything

https://support.cloudflare.com/hc/en-us/articles/200172256-H...

This downtime made me realize we weren't caching any html on cloudflare. I just turned it on and all our static sites are doing fine now (and our bills are smaller!).

If you're fancy you can even programmatically purge the cache when you do CI deploys using the cloudflare API.

seanwilson · on Sept 14, 2017

> Quick tip for those using static (jekyll, hugo...) sites on s3: If you have cloudflare in front of it, you can turn on aggressive html caching

> If you're fancy you can even programmatically purge the cache when you do CI deploys using the cloudflare API.

You need to be really careful with this.

I'm not sure how you've got things set up but isn't this going to lead to issues unless you have a cache busting or revalidation strategy? You can purge Cloudflare's cache but that's not going to purge the cache on browsers that already visited your site and cached a page already. You might get cases where an old cached HTML page is asking for page resources that don't exist anymore or pages break because the user is seeing the old HTML with the new CSS/JavaScript.

Also, if you can log in to your site, it can potentially cache one user's logged in page content and share it with others if your Cache-Control settings don't include "private".

You could set things up to make sure the browser and then the Cloudflare cache always asks your server first if a page has been updated recently (revalidation) but your server has to be configured right for this (e.g. etag or modified date usage).

Cloudflare is actually meant to have a feature that keeps a static version of your site to show in the event of outages:

https://www.cloudflare.com/always-online/

scrollaway · on Sept 14, 2017

Unfortunately afaik Always Online does not work in the event of S3 errors because S3 is available, merely answering with HTTP error codes.

seanwilson · on Sept 14, 2017

Hmm, can you explain further? What kind of error code? So Always Online only kicks in when there's no response and not e.g. 500 error codes?

scrollaway · on Sept 15, 2017

Correct, AFAIK Always Online kicks in when the upstream cannot be reached (eg. connection timeout, unreachable dns etc).

elithrar · on Sept 15, 2017

You're both sort-of correct. AO kicks in on specific 5xx error codes: 502, 504, and then Cloudflare's 52x codes. It doesn't kick in on 500 or 503, or 4xx errors.

Ref: https://support.cloudflare.com/hc/en-us/articles/202238800-W...

sfeng · on Sept 14, 2017

This is also a good time to get your static files deployed onto Azure or Google Cloud as well, and use your CDN or Cloudflare to failover.

teej · on Sept 14, 2017

Won't help if your cache is empty and Cloudflare can't access S3. I originally discovered this service outage when an image in cloudflare was returning a 500 error about 40 minutes ago.

scrollaway · on Sept 14, 2017

Yes and no. In outage periods, S3 access is often degraded because throttling is greatly increased, leading to 503s. Meaning if you have that cache layer on top of it, you dodge the issue.

Where it doesn't help is if Cloudflare itself is having issues (all it does is move the weak link one layer up in this case). But that is easy enough to disable (assuming the cloudflare API or dashboard is up).

milankragujevic · on Sept 15, 2017

I don't think this is a good idea if you have lots of visits, as Cloudflare might ask you to upgrade to a Business account which is very expensive. It's excellent for small sites and blogs which are static, but for anything visited often, like a Docs site for some popular project, it might get dropped from CF's network. Don't quote me on that, I haven't experienced it, but I heard that they do ask people to upgrade, for example Brian Fritz of omdbapi.com, who had to close down the service to the public because of huge amounts of traffic, and that's just with text, not serving images or video...

samstave · on Sept 14, 2017

This should be a default feature of Cloudflare going forward if they experience massive S3 issues through their customers...? Right?

shallot_router · on Sept 14, 2017

Definitely not. If HTML caching were enabled by default, I would've run into all sorts of weird issues with a lot of my Cloudflare-protected sites.

scrollaway · on Sept 14, 2017

I'm not sure why it would be a default feature. HTML caching should most definitely not be enabled by default.

However, I do agree that it should be possible to turn on without pagerules for these sorts of scenarios.

dalanmiller · on Sept 14, 2017

Does this require paid subscription to Cloudflare?

scrollaway · on Sept 14, 2017

It does not! It does take up a page rule though. You can have up to three page rules per domain for free.

pmlnr · on Sept 14, 2017

Or you can host a mirror at home on a raspberry pi and redirect the dns when you have to.

dna_polymerase · on Sept 14, 2017

Because DNS record propagation is absolutely great for this. Works within seconds!\s

pmlnr · on Sept 15, 2017

There is a TTL you can set in your record. Yes, you can set it to a second.

However, the case of a static website is not really hitting me as something can't afford a few seconds downtime...

dna_polymerase · on Sept 15, 2017

In my experience I can set the TTL to whatever value I want, last mile providers don't really give a shit about that and just cache it for how long they want.

kccqzy · on Sept 14, 2017

Programmatically purging caches is slow and costs extra. Just use normal cache-busting techniques like a different URL after a deployment.

EDIT: Oops. Misread cloudflare/CloudFront.

manigandham · on Sept 14, 2017

Cloudflare and all other modern CDNs do this instantly and for free. Only the legacy ones take longer (usually minutes at most) and might charge a small extra.

jsjohnst · on Sept 14, 2017

Akamai takes far longer than a few minutes if you use all their POPs. It's in the 6-12hr range.

jsizzle · on Sept 14, 2017

This is not true. Purges on akamai take less than ten minutes for entire cp codes or specific objects. Deploying new cache rules is less than an hour in many cases

jsjohnst · on Sept 14, 2017

How many POPs are you using? I can assure you, I am not wrong here and have internal knowledge of why if you are truly using all POPs it takes so long.

notyourday · on Sept 14, 2017

That's not correct. Site delivery on Akamai globally is purged within several minutes

jsjohnst · on Sept 14, 2017

Look, I know what I'm talking about. Just because you can purge faster because of a different deployment doesn't mean that's the case for everyone. "Fast Purge" is not available for every type of deployment.

nothrabannosir · on Sept 15, 2017

> Look, I know what I'm talking about.

Thanks for replying, but unless you're willing to explain to the rest of us why so, why bother? I can't make heads or tails of this pissing contest thread.

notyourday · on Sept 15, 2017

You are confused.

"Site Delivery" is purged globally within 7 minutes.

"Fast purge" enabled products (site delivery) is purged globally in under 20 seconds.

All assets for websites are site delivery. Only media services family of products are not site delivery. Those purge within a couple of hours. If you are someone who uses media services products, you know how to force Akamai to go back to origin globally instantaneously and how to trigger invalidation. In media services you should never serve stale.

jsjohnst · on Sept 15, 2017

As I said, it depends on your deployment. Thanks for the reply and confirming what I said was correct, that under some deployments it can be a multi-hour purge.

sumobob · on Sept 15, 2017

the new fast purge utility is much faster too

sgl75 · on Sept 14, 2017

Changing URLs is not really an option for html pages of a static site.

_hyn3 · on Sept 14, 2017

Only for the root of a site, but there are better cache expiration options as noted elsewhere anyway.

For inner pages, browsers do consider /your-page-url?v=1505418887 to be a completely different page from /your-page-url?v=1505418888.

However, just expiring cache and/or properly setting HTTP cache control headers at most CDN's is a cleaner and more correct option.

sgl75 · on Sept 14, 2017

I'm not following. How does that help my visitor see the latest version of the page when coming from a search engine or external site?

_hyn3 · on Sept 14, 2017

Exactly, it doesn't. That's why you should use the actual cache mgmt options from your CDN provider, or just wait for your (properly configured) cache control headers to expire the cache (bear in mind that intermediate caching proxies don't always obey those TTLs or dates either).

I was just pointing out that some people adjust their menu links (etc) to include some sort of dynamic variable (ie /?ts=xxx). I don't recommend this sort of scheme, though, except in unusual circumstances: IMO, it's fragile, leaky, and inefficient.

tokenizerrr · on Sept 14, 2017

No it's not? It's free to just purge everything.

hamandcheese · on Sept 14, 2017

Thats fine for assets, but not at all ideal for pages.

mrb · on Sept 14, 2017

Every time there is an outage at some cloud provider, I enjoy knowing that my site has maintained 100% availability since its launch. I run 3 redundant {name servers, web servers} pairs on 3 VPS hosted at 3 different providers on 3 different continents. Even if the individual providers are available only 98% of the time—7 days of downtime per year—my setup is expected to still provide five nines availability (details: http://blog.zorinaq.com/release-of-hablog-and-new-design/#co...)

Edit: It's not about bragging. It's not about the ROI. I want to (1) experiment & learn, and most importantly (2) show what is possible with very simple technical architectures. HN is the ideal place to "show and tell" these sorts of projects.

devhead · on Sept 14, 2017

to be fair, it's a static html blog; while most of us who are now experiencing issues are running a bit more complicated set of services.

edit: punctuation

mrb · on Sept 14, 2017

Actually dynamic (blog comments.)

But yeah, you are right it's a very, very simple site.

devhead · on Sept 14, 2017

sorry, i thought i read you were using flat files for those.

brandon272 · on Sept 14, 2017

I guess the part I'm confused about here are the DNS records and DNS pinning. If your zone returns 1.1.1.1 and 2.2.2.2 and 3.3.3.3 as IP addresses to use, and I'm browsing your site while resolving to 1.1.1.1 and 1.1.1.1 goes down -- your site will appear down for me, correct?

My browser won't automatically try 2.2.2.2 or 3.3.3.3, or would it?

ErikDub · on Sept 14, 2017

Yes both Chrome and Firefox would automatically try the next IP when I checked a couple of months ago. We used to use this for load balancing, returning the IPs of our servers in random order. Even when one of the servers would go down we wouldn't loose any traffic as browser would just use the next one.

fulafel · on Sept 14, 2017

Not the op, but won't browsers fall back to the other IPs? In olden days some browsers would not handle multiple IPs per DNS record correctly, hopefully it works now.

jsmthrowaway · on Sept 14, 2017

It still depends, and it's a combination of two behaviors: (1) what a browser's resolver does when it receives multiple A/AAAA records, with some selecting the first unconditionally and forcing authoritatives to play spin-the-data and (2) failure behavior, both between timeouts and positive failures -- the difference between layer 3, 4, and 7 failures also comes into play. What happens if the connection resets after a single byte? What happens if a positive refusal comes back? What happens with a warm cache for the domain? etc. etc. The failure matrix explodes very quickly, and I seem to recall there being several hundred scenarios to test last time I looked into this seriously.

Last time I researched this, behavior was quite different across the board, and it's something one should test extensively when designing HA for HTTP. In some situations, the same browser on another platform will defer to the system resolver versus its own, for example, which will potentially change behavior #1 even for the same browser. Mobile is starting to perform weird tricks with TCP, too, so you really have to dig into this one to do it right. Then throw in HTTP/2 and you've magically created yourself about a decade of justifiable work ;)

brandon272 · on Sept 14, 2017

Not sure, that's what I'm wondering. Sounds like a good solution if they automatically do fallback. I also wonder how browsers behave depending on how the server at a particular IP address is responding, as they can respond in different ways. (i.e. a server might respond with an error, accept a request but timeout on a response, or appear totally unreachable)

Any HA solution I've seen that attempts to reliably achieve this five nines capability relies on network-level things like virtual IP's and what not. And I don't consider it a five nines solution if only some customers can access it. How the browser behaves in this case could be critical depending on how your visitors use the site. I would not consider a site "up" if it's only available to some people and not others.

jsmthrowaway · on Sept 14, 2017

> And I don't consider it a five nines solution if only some customers can access it.

Well, that depends on the SLA/SLO, which is really what "nines" is speaking to. Intuitively I agree, but it can, realistically, not be the case and be "valid". Doesn't make it right. Just is.

d3ad1ysp0rk · on Sept 14, 2017

The ROI on keeping a personal blog up at 5 9s is awful... I understand the desire from a geeky perspective, but it's only really useful for personal enjoyment of the challenge, or bragging.

scott_karana · on Sept 14, 2017

S3 is only three nines, and EC2 is 99.95%.

That's almost certainly lower than combining reputable VPS providers geo-redundantly, and likely a much higher cost for the "convenience".

And it's not just for geeky pride. You learn the most when things break. Far too many funded startups run poorly architected apps in a single AWS EZ, unlike GP.

https://aws.amazon.com/ec2/sla/

https://aws.amazon.com/s3/sla/

killbrad · on Sept 16, 2017

What is EZ and GP?

scott_karana · on Sept 16, 2017

I meant to type "AZ", availability zone.

GP is "grandparent", the poster who started the line of discussion. (mrb)

jsmthrowaway · on Sept 14, 2017

[flagged]

mrb · on Sept 14, 2017

«.01RPS»

Off by 6 orders of magnitude: actually 10,000 rps sustained for a few hours (2,500 page hit/sec × 4 req per page). These are my—admitedly very rare—peaks when the blog gets slashdotted.

jsmthrowaway · on Sept 14, 2017

[flagged]

johnfn · on Sept 14, 2017

> This isn't a competition (you'd lose)

This is particularly bad discourse imo.

293984j29384 · on Sept 14, 2017

This blog hurts my eyes to look at...

javabean22 · on Sept 15, 2017

Who cares?

ksenzee · on Sept 14, 2017

Nice to see this getting reported on the AWS status page during the actual event, for once.

mi100hael · on Sept 15, 2017

It was around 10 mins before anything showed up, actually, and 'status1' for a while after that.

ksenzee · on Sept 15, 2017

For anyone else, I'd say that's letting us know a bit late. For AWS? Huge improvement.

zedpm · on Sept 14, 2017

Kudos for Amazon sorting out their status page. The last time this happened, the status page didn't show anything until hours after the outage began. I just noticed this one maybe 15 minutes ago, and 5-10 minutes later they acknowledged the problem.

On the other hand, I'm feeling a strong case of "Not this shit again". Wondering if US-East-1 is more trouble than it's worth, as these outages seem to happen mostly there.

sitharus · on Sept 14, 2017

Us-east-1 is where all the shiny new features are released and where most customers are. From experience the other regions are much more reliable.

Twirrim · on Sept 14, 2017

> Us-east-1 is where all the shiny new features are released

Not true at all. It'll be among the first for newer features, in as much as the US regions usually are among the first, but it's certainly not the first place code is deployed to.

All services in AWS start out software deployments in smaller regions where the blast radius of a software bug is likely to be smaller.

lightbyte · on Sept 14, 2017

I thought us-east-1 had the most issues because it is the oldest region and thus uses the oldest hardware.

BillinghamJ · on Sept 14, 2017

It’s primarily just due to the incredible scale of the region. It’s easily twice as large as any other, so any issues relating to scale appear there first and are fixed before they affect any of the other regions.

lozenge · on Sept 14, 2017

It might also be related to IAM and some other stuff being "based" there.

yeukhon · on Sept 14, 2017

What do you mean "based" there?

BillinghamJ · on Sept 15, 2017

As in, most global services run primarily within us-east-1, and have only subsystems running in each region.

For example, it's likely that the actual master data stores for IAM are solely within us-east-1, but the key data is cached and services run in each region.

Similarly, Cloudfront is theoretically a global service, but only ACM certificates set up within us-east-1 can be used with Cloudfront.

moduspol · on Sept 14, 2017

That may be changing. FIFO SQS queues were only available in the Ohio region until several months later.

Looks like November 17 to June 14.

[1] https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sq...

[2] https://aws.amazon.com/about-aws/whats-new/2017/06/amazon-sq...

linsomniac · on Sept 14, 2017

Oregon seems to have most or all of the features. We have a task to move from us-east-1 to oregon, but haven't completed it yet.

kainosnoema · on Sept 14, 2017

We launched our production workload in us-west-2 (Oregon) 4 years ago for this exact reason and it was one of the best decisions we made. Our staging servers in us-east-1 have experienced all of the outages between then and now, but production hums right along. Of course we're ready to failover to us-east-1 if us-west-2 has trouble, but it's been far more stable (knock on wood).

WaxProlix · on Sept 14, 2017

Working for a consultancy and we pretty much always default to doing initial (MVP) deploys in Oregon these days. Feels like a no brainer at this point.

haimez · on Sept 15, 2017

If us-east and us-west regions are only relevant to you because of the relative reliability, then I envy your situation because latency doesn't exist in your world.

dgemm · on Sept 15, 2017

us-east-2 is a thing, just saying.

sorenbs · on Sept 14, 2017

It has been tribal knowledge for at least five years to stay away from us-east-1

Oregon sees almost the same speed of adoption of new features with way better reliability.

rayboy1995 · on Sept 14, 2017

We just switched to US West Oregon and I'm not looking back.

andrewaylett · on Sept 14, 2017

In us-east-1. Don't worry me like that! It's definitely not down _globally_ or else I'd have been paged by now...

joombaga · on Sept 14, 2017

And we're only getting it for some buckets in us-east-1

hsod · on Sept 14, 2017

Ha, I got paged anyway since one of our monitors hits the S3 API which is hosted in us-east-1

gboudrias · on Sept 14, 2017

Does this mean your infrastructure is actually dependent on multiple regions? Just curious.

mattrjacobs · on Sept 14, 2017

If you want to understand the impact of an S3 outage (or many other kinds) ahead of time, Gremlin (http://gremlininc.com) has built a tool to run these scenarios on your infrastructure.

Happy to answer any questions about the tool, either here or over email.

(Disclaimer: I work for Gremlin)

temuze · on Sept 14, 2017

Now let's see who learned from the last major outage 200 days ago :)

pilom · on Sept 15, 2017

This was us! Added CloudFlare in front of every bucket with our own "cdn.example.com" DNS name and changed every reference to s3 in code. Didn't even know S3 had problems today until I saw this on HN. (note: This outage taught us to set up monitoring of the S3 assets separate from the "cdn")

alexandros · on Sept 14, 2017

Lots of people confirming this on Twitter[1], and we're also seeing it at resin.io

[1]: https://twitter.com/AlecSanger/status/908402829349572608

vim_wannabe · on Sept 14, 2017

The great thing about AWS is that you shift the burden of responsibility to Amazon.

notyourday · on Sept 14, 2017

https://www.whoownsmyavailability.com/

cddotdotslash · on Sept 14, 2017

Good time to setup cross region replication with multiple CouldFront origins. I made a tutorial site to show you how: https://spwa.io

bm1362 · on Sept 14, 2017

They've fixed the status indicator, it's now a static asset on the status page:

* http://status.aws.amazon.com/images/status1.gif

discordianfish · on Sept 14, 2017

I've just launched https://pub.latency.at/ which give you free Prometheus metrics for various cloud endpoints like S3. Doesn't look too bad right now though.

It's also hosted on S3 but still up and running. (The main service is independent of S3 anyway after setup)

discordianfish · on Sept 15, 2017

I've added a play-with-docker button to give you a prometheus+grafana instance scraping this.

Spun up this few hours ago, will be gone soon but if anyone wants to check it out without waiting for the metrics to trickel in:

http://pwd10-0-33-3-3000.host3.labs.play-with-docker.com/das...

Slippery_John · on Sept 14, 2017

As always I recommend people practice Chaos Engineering [0] to minimize the impact of such events. Even if the complexity of cross-region failover is to expensive, having some sort of graceful failure is preferable to simply dying. Netflix's Simian Army [1] toolset is particularly useful here, though I don't see an OSS version of Chaos Kong (which simulates the failure of an entire region) sadly.

[0]: http://principlesofchaos.org/ [1]: https://github.com/Netflix/SimianArmy/wiki

RomanPushkin · on Sept 15, 2017

It happens pretty often, so I started awesome list: https://github.com/ro31337/awesome-aws-alternatives

Please contribute if you know alternatives.

ak217 · on Sept 14, 2017

What do people use for mocking S3 behavior with high fidelity, to test exponential backoff and other resilience strategies in this situation?

I've seen a handful of test libraries but none of them seem to make realistic up-to-date error injection a priority.

silasb · on Sept 14, 2017

Maybe minio.

YawningAngel · on Sept 14, 2017

We use S3-proxy internally and are pretty happy with it.

rob-olmos · on Sept 15, 2017

https://github.com/andrewgaul/s3proxy ? (no hyphen)

rhelsing · on Sept 14, 2017

I'm getting issues as well. Here's to hoping this issue is just a hiccup and not like the last outage.. otherwise, it may be time to seriously consider alternatives. Anyone know of a good solution for a self-hosted s3-esque service?

toomuchtodo · on Sept 14, 2017

https://www.minio.io/

https://www.openstack.org/software/releases/ocata/components...

https://github.com/ceph

If you're serving a static site out of S3, put Cloudflare (ugh) in front of it and enable the feature to serve your cached content when the origin is down [1]. When S3 goes sideways, as everyone is learning, it goes down hard and you're probably not going to be able to make changes to objects, read object metadata, etc.

[1] https://support.cloudflare.com/hc/en-us/articles/200172256-H...

rhelsing · on Sept 14, 2017

minio looks great for my purposes. thanks!

justinclift · on Sept 14, 2017

Yep, Minio is really good. :)

mulmen · on Sept 14, 2017

How are you going to self-host for less money and higher reliability than S3 at anywhere near a competitive price?

res0nat0r · on Sept 14, 2017

I would turn on bucket replication and sync everything to another region, or move your primary bucket to us-west-2 or something other than us-east-1 and replicate somewhere else.

mnutt · on Sept 14, 2017

When serving files from s3, if you care about availability it's worth using your own subdomain rather than pointing directly at S3. You can't have a subdomain point to different buckets in different regions directly, but you can point your subdomain to a CDN and have your CDN's origin point to different buckets in different regions.

benwilber0 · on Sept 14, 2017

> it may be time to seriously consider alternatives

they have multi-az support if you use it

zedpm · on Sept 14, 2017

Not exactly. S3 doesn't have AZs at all, it's only split on regions. Further, a bucket can exist in only one region. You can set up cross-region replication, but you of course need to flip the bucket coordinates in all your applications to fail over. It's not nearly as easy as Multi-AZ support in things like RDS.

yeukhon · on Sept 14, 2017

> split on regions

I think you really meant s3 objects are redundant in each region, which actually is spanned across multiple DCs.

zedpm · on Sept 15, 2017

I meant that a bucket exists in only one region and that you have to replicate to a different bucket if you want to do anything to improve S3 availability.

rficcaglia · on Sept 14, 2017

true, but some internal AWS features themselves rely on S3 and apparently us-east-1. Those are hosed right now, too. You can't even modify bucket metadata via AWS Console or cli tools. So like checking permissions or logging right now is not possible, or CloudTrail.

fernandopj · on Sept 14, 2017

CodeCommit, CodePipeline and EBS on my environments are all affected. I have a domain down and can't push a new deploy. They are all multi-az.

BillinghamJ · on Sept 14, 2017

S3 is natively cross-AZ. You can’t opt out of that.

Active/active cross-region is possible, but far more complex.

xenophonf · on Sept 14, 2017

How does one set that up?

sumitkumar · on Sept 14, 2017

good time for a filecoin plug?

rdtsc · on Sept 14, 2017

Still 99.99% availability just the averaging time was increased by another 100 years :-)

ceejayoz · on Sept 14, 2017

S3 commits to being 99.9% available (~9 hours a year). You may be thinking of the object durability numbers.

ajoy · on Sept 14, 2017

Most of the recent issues seems to be in this particular colo (us-east-1), which is also their oldest facility. Maybe its prudent to move resources to other newer colos or pay for multi-zone availability.

twistedpair · on Sept 14, 2017

Amen. We moved everything to US-West-2 years ago and never looked back. When we see VA on fire, we just sigh.

yeukhon · on Sept 14, 2017

Well, if your users are in the east cost / London, you want to build something closer without spinning up one in Europe yet, us-east-1 is your choice. So.. YMV.

chosenken · on Sept 14, 2017

Yup, its taking all our stuff down now too. We are working on moving our static site to S3 behind cloudfront, testing it now. Its down.

Our Airflow tasks that use S3, they are all down.

Basically, we are down :)

scrollaway · on Sept 14, 2017

Unrelated to the downtime but what do you use Airflow for? We've been considering using it for our ETL, but I'm even wondering if we could replace our Jenkins instance with it (Jenkins currently only runs cronjobs and one-off tasks; for CI we use travis).

chosenken · on Sept 14, 2017

We use airflow mainly for ETL work, backfills, and batch processing. We do a lot of work with clickstream type data, be it taking data from analytics.js and loading it into redshift, or taking analytic data from redshift and loading into Google analytics. We have developed many pipes and connectors for airflow that allow us to connect to many data sources, both at the source and sink ends. I mainly work on the DevOps team, running our infrastructure and working on back end systems, so my knowledge of airflow is more high level. I just keep the system up so it can run ;P FYI I work for a startup out of Cincinnati, OH called Astronomer, you can find us at https://www.astronomer.io

Also, we weren't like all down, we just saw lots of time out issues when reading/writing to S3.

scrollaway · on Sept 14, 2017

Your site's ssl cert is not setup correctly :)

Very interesting re redshift => google analytics though, I've never heard of it done in that direction.

Do you think airflow is suitable for one-off/cron task management as well?

chosenken · on Sept 14, 2017

Actually yes. Airflow has cron support baked in, when you run a task you can give it a cron schedule. It then takes care of running your tasks when scheduled.

As for the ssl, I messed up my link. We are in the process of moving the site, so some of the redirects are not working perfect at the moment >.>

bpicolo · on Sept 14, 2017

I get an SSL warning on first connect to your site. Chrome is unhappy that it's signed for *.hubspot and not astronomer.io

rabidonrails · on Sept 14, 2017

New update from AWS: 12:21 PM PDT We can confirm that some customers are receiving throttling errors accessing S3. We are currently investigating the root cause.

veb · on Sept 14, 2017

Is Slack using S3? I can't seem to upload anything.

acejam · on Sept 14, 2017

Yes, Slack uses S3 for attachments and shared items.

giarc · on Sept 14, 2017

I had a long "Upload Processing" flashing across screen in web app. I restarted tab and image was uploaded but in wrong order.

jsjohnst · on Sept 14, 2017

Yes, they do.

jaredstenquist · on Sept 14, 2017

Unable to deploy with Elastic Beanstalk currently. Internal 500 error. They put up the Blue Diamond icon, so this must be serious.

baq · on Sept 14, 2017

knowing them the datacenter must be flooding with lava.

cordite · on Sept 14, 2017

Can't create a new app / env with elastic beanstalk either.

Internal validation error

klinskyc · on Sept 14, 2017

Thought I was going crazy. We're seeing 'Please reduce your request rate.' for a lot of our requests as well

imrehg · on Sept 14, 2017

Definitely seems like that... And we should back off and reduce our request rates, HN, otherwise we'll make it worse!

rficcaglia · on Sept 14, 2017

I tried CTRL+ALT+DEL but it didn't help!! :)

drusepth · on Sept 14, 2017

Turning it off and back on again as we speak.

danso · on Sept 14, 2017

Before the popularity of SSD in the average computer, and the theoretically low downtimes for AWS, telling someone to just restart their computer a few times until the website shows up might actually have "fixed" it.

FarhadG · on Sept 14, 2017

I keep getting BSOD. AWS should be using Macs

barillax · on Sept 14, 2017

We're seeing the same thing. Been tracking it via Twitter for the last 15 minutes or so: https://twitter.com/search?q=s3%20slow%20down

philhartmanonic · on Sept 14, 2017

I've gotten a lot of those errors while using the AWS CLI for S3, but it was intermittent. I was using the sync command, so I just kept kicking it back off again as soon as it finished, and eventually got everything up there.

fernandopj · on Sept 14, 2017

From status page: (https://status.aws.amazon.com)

11:58 AM PDT We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

fernandopj · on Sept 14, 2017

12:21 PM PDT We can confirm that some customers are receiving throttling errors accessing S3. We are currently investigating the root cause.

fernandopj · on Sept 14, 2017

Now it is CodeCommit too:

12:28 PM PDT We are investigating increased error rates for Git Push, Git Pull and API calls in the US-EAST-1 Region.

xfax · on Sept 14, 2017

Does anyone else have some of their buckets missing from the console list?

quickConclusion · on Sept 14, 2017

Our buckets are there, just we cannot GET content, with error messages about reducing the rate.

ceejayoz · on Sept 14, 2017

I'd fully expect an S3 outage to affect that display.

runesoerensen · on Sept 14, 2017

"12:49 PM PDT We are now seeing recovery in the throttle error rates accessing Amazon S3. We have identified the root cause and have taken actions to prevent recurrence."

devhead · on Sept 14, 2017

looks like s3 apis are starting to respond correctly now.

hopefully this raises awareness on how important planning for failure is before you make a design choice to introduce a dependency.

parthdesai · on Sept 14, 2017

Yup, it is down for our site. Some of the images are not being loaded. (US-east-1)

Console says: Failed to load resource: the server responded with a status of 503 (Slow Down)

coreyw · on Sept 14, 2017

We are seeing increased error rates for GET and HEAD requests. So far we have not been having any issues with PUT requests. We are in us-east-1.

enkay · on Sept 14, 2017

Not sure if related but had unusual SES timeouts and "too many requests" errors around the same time this was posted.

toddwprice · on Sept 14, 2017

Down for us as well. `An error occurred with the message 'Please reduce your request rate.' when writing an object`

hyperanthony · on Sept 14, 2017

Confirmed this is also impacting CodeDeploy in us-east-1, makes sense since it has S3 dependency on revision location.

1001101 · on Sept 14, 2017

I can see our us-east-1 buckets, and just pulled a file out. WFM, FWIW. IAM got me a few weeks ago though. :(

crgwbr · on Sept 14, 2017

We're see 'Please reduce your request rate' frequently on read and write operations.

stormcode · on Sept 14, 2017

It's effecting heroku deploys that rely on build-packs hosted on S3 (like ruby build-pack).

insomniacity · on Sept 14, 2017

That's really disappointing - why isn't something so critical to heroku deployments distributed across multiple AZs and regions?

maccam912 · on Sept 14, 2017

"Slow Down"s for me. We started seeing some weird S3 behavior since about 11:45 PST.

redthrowaway · on Sept 14, 2017

We're getting 503s on s3 assets. Definitely having issues. Cloudfront is still good.

xfax · on Sept 14, 2017

Does anyone else have some of their buckets missing from the console list?

klinskyc · on Sept 14, 2017

Seeing the "reduce your requests" error message on our end

bdibs · on Sept 14, 2017

Everything seems to be running normally again, at least for me.

scrollaway · on Sept 14, 2017

https://news.ycombinator.com/item?id=15251374

tuna · on Sept 14, 2017

https://www.whoownsmyavailability.com/

peruvian · on Sept 14, 2017

Happening here too (NYC, us-east-1).

sgl75 · on Sept 14, 2017

Seems to be resolved.

daxorid · on Sept 14, 2017

Status page claims resolution, but errors are still ongoing for us.

pierrebeaucamp · on Sept 14, 2017

Seems like Github might be affected as well? Can't comment or merge PRs.

pm90 · on Sept 14, 2017

Github isn't hosted on AWS btw.

jwilk · on Sept 14, 2017

At least downloads are hosted on S3.

pierrebeaucamp · on Sept 14, 2017

nah, works now. Probably unrelated

cordite · on Sept 14, 2017

Can't receive orders :D

stuffaandthings · on Sept 14, 2017

yup

EB interface is affected too

vacri · on Sept 15, 2017

I wish that AWS would settle on a standard timezone (preferably UTC). Troubleshooting the fallout had me mentally converting their PDT status pages with console graphs in both UTC and 'local browser' time[1]. All for a region located in EDT.

[1] I think they even have a graph somewhere whose axes are in UTC, but whose tooltips are in local browser time, but I can't recall for sure.

jaysunn · on Sept 14, 2017

We are seeing errors in US-EAST:

error: S3ServiceException:Please reduce your request rate.,Status 503,Error SlowDown,Rid

lasermike026 · on Sept 14, 2017

Oh boy.

CityLims · on Sept 14, 2017

Russia? China? Jealous of our cat pics much?!

Achshar · on Sept 14, 2017

I am having trouble on my vultr VPN as well. Any chance it's related?

namidark · on Sept 14, 2017

rficcaglia · on Sept 14, 2017

Definitely affecting S3 bucket ops, static site hosting on s3, and even things like CloudTrail. Anything that uses S3 East VA my guess. Rather severe impact. Maybe less focus on avocados is what's called for...