Quick tip for those using static (jekyll, hugo...) sites on s3: If you have cloudflare in front of it, you can turn on aggressive html caching by creating a page rule: example.com/* => Cache level => Cache everything
This downtime made me realize we weren't caching any html on cloudflare. I just turned it on and all our static sites are doing fine now (and our bills are smaller!).
If you're fancy you can even programmatically purge the cache when you do CI deploys using the cloudflare API.
> Quick tip for those using static (jekyll, hugo...) sites on s3: If you have cloudflare in front of it, you can turn on aggressive html caching
> If you're fancy you can even programmatically purge the cache when you do CI deploys using the cloudflare API.
You need to be really careful with this.
I'm not sure how you've got things set up but isn't this going to lead to issues unless you have a cache busting or revalidation strategy? You can purge Cloudflare's cache but that's not going to purge the cache on browsers that already visited your site and cached a page already. You might get cases where an old cached HTML page is asking for page resources that don't exist anymore or pages break because the user is seeing the old HTML with the new CSS/JavaScript.
Also, if you can log in to your site, it can potentially cache one user's logged in page content and share it with others if your Cache-Control settings don't include "private".
You could set things up to make sure the browser and then the Cloudflare cache always asks your server first if a page has been updated recently (revalidation) but your server has to be configured right for this (e.g. etag or modified date usage).
Cloudflare is actually meant to have a feature that keeps a static version of your site to show in the event of outages:
You're both sort-of correct. AO kicks in on specific 5xx error codes: 502, 504, and then Cloudflare's 52x codes. It doesn't kick in on 500 or 503, or 4xx errors.
Won't help if your cache is empty and Cloudflare can't access S3. I originally discovered this service outage when an image in cloudflare was returning a 500 error about 40 minutes ago.
Yes and no. In outage periods, S3 access is often degraded because throttling is greatly increased, leading to 503s. Meaning if you have that cache layer on top of it, you dodge the issue.
Where it doesn't help is if Cloudflare itself is having issues (all it does is move the weak link one layer up in this case). But that is easy enough to disable (assuming the cloudflare API or dashboard is up).
I don't think this is a good idea if you have lots of visits, as Cloudflare might ask you to upgrade to a Business account which is very expensive. It's excellent for small sites and blogs which are static, but for anything visited often, like a Docs site for some popular project, it might get dropped from CF's network. Don't quote me on that, I haven't experienced it, but I heard that they do ask people to upgrade, for example Brian Fritz of omdbapi.com, who had to close down the service to the public because of huge amounts of traffic, and that's just with text, not serving images or video...
In my experience I can set the TTL to whatever value I want, last mile providers don't really give a shit about that and just cache it for how long they want.
Cloudflare and all other modern CDNs do this instantly and for free. Only the legacy ones take longer (usually minutes at most) and might charge a small extra.
This is not true. Purges on akamai take less than ten minutes for entire cp codes or specific objects.
Deploying new cache rules is less than an hour in many cases
How many POPs are you using? I can assure you, I am not wrong here and have internal knowledge of why if you are truly using all POPs it takes so long.
Look, I know what I'm talking about. Just because you can purge faster because of a different deployment doesn't mean that's the case for everyone. "Fast Purge" is not available for every type of deployment.
Thanks for replying, but unless you're willing to explain to the rest of us why so, why bother? I can't make heads or tails of this pissing contest thread.
"Site Delivery" is purged globally within 7 minutes.
"Fast purge" enabled products (site delivery) is purged globally in under 20 seconds.
All assets for websites are site delivery. Only media services family of products are not site delivery. Those purge within a couple of hours. If you are someone who uses media services products, you know how to force Akamai to go back to origin globally instantaneously and how to trigger invalidation. In media services you should never serve stale.
As I said, it depends on your deployment. Thanks for the reply and confirming what I said was correct, that under some deployments it can be a multi-hour purge.
Exactly, it doesn't. That's why you should use the actual cache mgmt options from your CDN provider, or just wait for your (properly configured) cache control headers to expire the cache (bear in mind that intermediate caching proxies don't always obey those TTLs or dates either).
I was just pointing out that some people adjust their menu links (etc) to include some sort of dynamic variable (ie /?ts=xxx). I don't recommend this sort of scheme, though, except in unusual circumstances: IMO, it's fragile, leaky, and inefficient.
Every time there is an outage at some cloud provider, I enjoy knowing that my site has maintained 100% availability since its launch. I run 3 redundant {name servers, web servers} pairs on 3 VPS hosted at 3 different providers on 3 different continents. Even if the individual providers are available only 98% of the time—7 days of downtime per year—my setup is expected to still provide five nines availability (details: http://blog.zorinaq.com/release-of-hablog-and-new-design/#co...)
Edit: It's not about bragging. It's not about the ROI. I want to (1) experiment & learn, and most importantly (2) show what is possible with very simple technical architectures. HN is the ideal place to "show and tell" these sorts of projects.
I guess the part I'm confused about here are the DNS records and DNS pinning. If your zone returns 1.1.1.1 and 2.2.2.2 and 3.3.3.3 as IP addresses to use, and I'm browsing your site while resolving to 1.1.1.1 and 1.1.1.1 goes down -- your site will appear down for me, correct?
My browser won't automatically try 2.2.2.2 or 3.3.3.3, or would it?
Yes both Chrome and Firefox would automatically try the next IP when I checked a couple of months ago. We used to use this for load balancing, returning the IPs of our servers in random order. Even when one of the servers would go down we wouldn't loose any traffic as browser would just use the next one.
Not the op, but won't browsers fall back to the other IPs? In olden days some browsers would not handle multiple IPs per DNS record correctly, hopefully it works now.
It still depends, and it's a combination of two behaviors: (1) what a browser's resolver does when it receives multiple A/AAAA records, with some selecting the first unconditionally and forcing authoritatives to play spin-the-data and (2) failure behavior, both between timeouts and positive failures -- the difference between layer 3, 4, and 7 failures also comes into play. What happens if the connection resets after a single byte? What happens if a positive refusal comes back? What happens with a warm cache for the domain? etc. etc. The failure matrix explodes very quickly, and I seem to recall there being several hundred scenarios to test last time I looked into this seriously.
Last time I researched this, behavior was quite different across the board, and it's something one should test extensively when designing HA for HTTP. In some situations, the same browser on another platform will defer to the system resolver versus its own, for example, which will potentially change behavior #1 even for the same browser. Mobile is starting to perform weird tricks with TCP, too, so you really have to dig into this one to do it right. Then throw in HTTP/2 and you've magically created yourself about a decade of justifiable work ;)
Not sure, that's what I'm wondering. Sounds like a good solution if they automatically do fallback. I also wonder how browsers behave depending on how the server at a particular IP address is responding, as they can respond in different ways. (i.e. a server might respond with an error, accept a request but timeout on a response, or appear totally unreachable)
Any HA solution I've seen that attempts to reliably achieve this five nines capability relies on network-level things like virtual IP's and what not. And I don't consider it a five nines solution if only some customers can access it. How the browser behaves in this case could be critical depending on how your visitors use the site. I would not consider a site "up" if it's only available to some people and not others.
> And I don't consider it a five nines solution if only some customers can access it.
Well, that depends on the SLA/SLO, which is really what "nines" is speaking to. Intuitively I agree, but it can, realistically, not be the case and be "valid". Doesn't make it right. Just is.
The ROI on keeping a personal blog up at 5 9s is awful... I understand the desire from a geeky perspective, but it's only really useful for personal enjoyment of the challenge, or bragging.
That's almost certainly lower than combining reputable VPS providers geo-redundantly, and likely a much higher cost for the "convenience".
And it's not just for geeky pride. You learn the most when things break. Far too many funded startups run poorly architected apps in a single AWS EZ, unlike GP.
Off by 6 orders of magnitude: actually 10,000 rps sustained for a few hours (2,500 page hit/sec × 4 req per page). These are my—admitedly very rare—peaks when the blog gets slashdotted.
Kudos for Amazon sorting out their status page. The last time this happened, the status page didn't show anything until hours after the outage began. I just noticed this one maybe 15 minutes ago, and 5-10 minutes later they acknowledged the problem.
On the other hand, I'm feeling a strong case of "Not this shit again". Wondering if US-East-1 is more trouble than it's worth, as these outages seem to happen mostly there.
> Us-east-1 is where all the shiny new features are released
Not true at all. It'll be among the first for newer features, in as much as the US regions usually are among the first, but it's certainly not the first place code is deployed to.
All services in AWS start out software deployments in smaller regions where the blast radius of a software bug is likely to be smaller.
It’s primarily just due to the incredible scale of the region. It’s easily twice as large as any other, so any issues relating to scale appear there first and are fixed before they affect any of the other regions.
As in, most global services run primarily within us-east-1, and have only subsystems running in each region.
For example, it's likely that the actual master data stores for IAM are solely within us-east-1, but the key data is cached and services run in each region.
Similarly, Cloudfront is theoretically a global service, but only ACM certificates set up within us-east-1 can be used with Cloudfront.
We launched our production workload in us-west-2 (Oregon) 4 years ago for this exact reason and it was one of the best decisions we made. Our staging servers in us-east-1 have experienced all of the outages between then and now, but production hums right along. Of course we're ready to failover to us-east-1 if us-west-2 has trouble, but it's been far more stable (knock on wood).
Working for a consultancy and we pretty much always default to doing initial (MVP) deploys in Oregon these days. Feels like a no brainer at this point.
If us-east and us-west regions are only relevant to you because of the relative reliability, then I envy your situation because latency doesn't exist in your world.
If you want to understand the impact of an S3 outage (or many other kinds) ahead of time, Gremlin (http://gremlininc.com) has built a tool to run these scenarios on your infrastructure.
Happy to answer any questions about the tool, either here or over email.
This was us! Added CloudFlare in front of every bucket with our own "cdn.example.com" DNS name and changed every reference to s3 in code. Didn't even know S3 had problems today until I saw this on HN. (note: This outage taught us to set up monitoring of the S3 assets separate from the "cdn")
I've just launched https://pub.latency.at/ which give you free Prometheus metrics for various cloud endpoints like S3. Doesn't look too bad right now though.
It's also hosted on S3 but still up and running. (The main service is independent of S3 anyway after setup)
As always I recommend people practice Chaos Engineering [0] to minimize the impact of such events. Even if the complexity of cross-region failover is to expensive, having some sort of graceful failure is preferable to simply dying. Netflix's Simian Army [1] toolset is particularly useful here, though I don't see an OSS version of Chaos Kong (which simulates the failure of an entire region) sadly.
I'm getting issues as well. Here's to hoping this issue is just a hiccup and not like the last outage.. otherwise, it may be time to seriously consider alternatives. Anyone know of a good solution for a self-hosted s3-esque service?
If you're serving a static site out of S3, put Cloudflare (ugh) in front of it and enable the feature to serve your cached content when the origin is down [1]. When S3 goes sideways, as everyone is learning, it goes down hard and you're probably not going to be able to make changes to objects, read object metadata, etc.
I would turn on bucket replication and sync everything to another region, or move your primary bucket to us-west-2 or something other than us-east-1 and replicate somewhere else.
When serving files from s3, if you care about availability it's worth using your own subdomain rather than pointing directly at S3. You can't have a subdomain point to different buckets in different regions directly, but you can point your subdomain to a CDN and have your CDN's origin point to different buckets in different regions.
Not exactly. S3 doesn't have AZs at all, it's only split on regions. Further, a bucket can exist in only one region. You can set up cross-region replication, but you of course need to flip the bucket coordinates in all your applications to fail over. It's not nearly as easy as Multi-AZ support in things like RDS.
I meant that a bucket exists in only one region and that you have to replicate to a different bucket if you want to do anything to improve S3 availability.
true, but some internal AWS features themselves rely on S3 and apparently us-east-1. Those are hosed right now, too. You can't even modify bucket metadata via AWS Console or cli tools. So like checking permissions or logging right now is not possible, or CloudTrail.
Most of the recent issues seems to be in this particular colo (us-east-1), which is also their oldest facility. Maybe its prudent to move resources to other newer colos or pay for multi-zone availability.
Well, if your users are in the east cost / London, you want to build something closer without spinning up one in Europe yet, us-east-1 is your choice. So.. YMV.
Unrelated to the downtime but what do you use Airflow for? We've been considering using it for our ETL, but I'm even wondering if we could replace our Jenkins instance with it (Jenkins currently only runs cronjobs and one-off tasks; for CI we use travis).
We use airflow mainly for ETL work, backfills, and batch processing. We do a lot of work with clickstream type data, be it taking data from analytics.js and loading it into redshift, or taking analytic data from redshift and loading into Google analytics. We have developed many pipes and connectors for airflow that allow us to connect to many data sources, both at the source and sink ends. I mainly work on the DevOps team, running our infrastructure and working on back end systems, so my knowledge of airflow is more high level. I just keep the system up so it can run ;P FYI I work for a startup out of Cincinnati, OH called Astronomer, you can find us at https://www.astronomer.io
Also, we weren't like all down, we just saw lots of time out issues when reading/writing to S3.
Actually yes. Airflow has cron support baked in, when you run a task you can give it a cron schedule. It then takes care of running your tasks when scheduled.
As for the ssl, I messed up my link. We are in the process of moving the site, so some of the redirects are not working perfect at the moment >.>
New update from AWS: 12:21 PM PDT We can confirm that some customers are receiving throttling errors accessing S3. We are currently investigating the root cause.
Before the popularity of SSD in the average computer, and the theoretically low downtimes for AWS, telling someone to just restart their computer a few times until the website shows up might actually have "fixed" it.
I've gotten a lot of those errors while using the AWS CLI for S3, but it was intermittent. I was using the sync command, so I just kept kicking it back off again as soon as it finished, and eventually got everything up there.
"12:49 PM PDT We are now seeing recovery in the throttle error rates accessing Amazon S3. We have identified the root cause and have taken actions to prevent recurrence."
I wish that AWS would settle on a standard timezone (preferably UTC). Troubleshooting the fallout had me mentally converting their PDT status pages with console graphs in both UTC and 'local browser' time[1]. All for a region located in EDT.
[1] I think they even have a graph somewhere whose axes are in UTC, but whose tooltips are in local browser time, but I can't recall for sure.
Definitely affecting S3 bucket ops, static site hosting on s3, and even things like CloudTrail. Anything that uses S3 East VA my guess. Rather severe impact. Maybe less focus on avocados is what's called for...
https://support.cloudflare.com/hc/en-us/articles/200172256-H...
This downtime made me realize we weren't caching any html on cloudflare. I just turned it on and all our static sites are doing fine now (and our bills are smaller!).
If you're fancy you can even programmatically purge the cache when you do CI deploys using the cloudflare API.