Sysadmin: I can forgive outages, but falsely reporting 'up' when you're obviousl...

carbocation · on Feb 28, 2017

People were joking about this but it turns out to be true: they host the status icons on their service: https://twitter.com/awscloud/status/836656664635846656

devy · on Feb 28, 2017

Due to HN's flaky Cloudflare 503 Bad Gateway error, I noticed that Cloudflare is also being affected by S3 being down in a similar but subtle way. See their status page's broken logo on the upper left hand corner.[1] It was actually directly linking to a S3 URL: https://s3.amazonaws.com/statuspage-production/pages-transac...

[1]: https://www.cloudflarestatus.com/

nso · on Feb 28, 2017

https://www.cloudflarestatus.com/ http://www.trellostatus.com/

Looks like they are both using the same solution for their status pages. The icon for trellostatus did also fail to display.

frik · on Feb 28, 2017

Was HN affected by Cloudbleed?

I would rather access HN without Cloudflare as man-in-the-middle, especially over HTTPS.

jacquesm · on Feb 28, 2017

All websites that use Cloudflare were potentially affected by Cloudbleed which is what makes it such a terrible thing.

You will never know the exact damage, the only thing you can do to play it 100% safe is to rotate all credentials on sites using Cloudflare.

And you can't access HN without going through Cloudflare (unfortunately, but HN is having a hard enough time to keep up with traffic as it is, without Cloudflare it would perform a lot worse than it does).

shp0ngle · on Feb 28, 2017

Hahaha too bad.

Google Analytics, Cloudflare, AWS, those are things you can never escape from.

jacquesm · on Feb 28, 2017

Disagree on Google Analytics.

qeternity · on Feb 28, 2017

EDIT: Misread parent.

dsl · on Feb 28, 2017

Because your status page shouldn't depend on your own infrastructure. Literally the problem Amazon is having right now.

chinhodado · on Feb 28, 2017

I guess they don't want to host their status page on their own CDN in case it went down too.

devy · on Feb 28, 2017

But only the logo image hosted on S3 is broken though. It seems to be preventable if they host the logo image together with their status page.

swearfu · on Feb 28, 2017

Saw that too, sounds like a convenient excuse for being caught in a lie.

AWS Employee #1: Hey, people are catching on that our status page isn't accurate

AWS Employee #2: Tell them it's cause of S3

joshmanders · on Feb 28, 2017

I'd suspect the humiliation of hosting your own status page on the infrastructure it's monitoring would far outweigh the "lie".

swearfu · on March 1, 2017

People are much more forgiving of mistakes than they are of deception.

paulddraper · on Feb 28, 2017

The icons aren't hosted there (or if they are, they are cached). https://status.aws.amazon.com/images/status3.gif

The status information is hosted there.

samaysharma · on Feb 28, 2017

Looks like the dashboard is fixed https://twitter.com/awscloud/status/836662601090134017

JimboOmega · on Feb 28, 2017

The "red check mark is stored on S3" may have been sarcasm, but apparently there was a kernel of truth to it?!

Poor show when a service disruption means the status page can't be updated....

sirn · on Feb 28, 2017

While status icon being hosted on S3 is funny, I think it's more likely that it's not the icon itself that caused the status page to not getting updated, but rather the fault of service information (say, a JSON file) that used to generate the status page that was stored on S3. The banner could probably be configured locally, so they choose to update that for the time being (e.g. while moving the status bucket somewhere else).

buryat · on Feb 28, 2017

> Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard.

jpwgarrison · on Feb 28, 2017

I like how HN (and others) handle this - there should be a static link to a 3rd party source, like a twitter feed, at the top of any status page.

taobility · on Feb 28, 2017

if that simple, why the text desc for Details also didn't reflect the incident?

jaequery · on Feb 28, 2017

Is there any service that distributes your files to multiple cloud services at the same time? With this recent S3 outage, I'm now feeling uneasy to store files on S3 for mission critical apps.

outworlder · on Feb 28, 2017

The outage was on us-east-1. If you are hosting mission-critical files in a single region, S3 is not the problem.

boulos · on Feb 28, 2017

To be fair, most people don't use the region replication thing for S3. Of course this is why I push folks to use GCS's default Multi-Regional buckets, because when a regional outage does occur, it's awfully nice to at least have the option to spin up elsewhere. If your critical data is in Virginia today, all you can do is wait.

Disclosure: I work on Google Cloud, and we've had our fair share of outages.

RehnoLindeque · on Feb 28, 2017

I believe that IPFS together with Filecoin is intended to be something like this in a broader, free market sense. Unfortunately IPFS is probably far from ready for mission critical apps and Filecoin hasn't launched at all.

kangman · on Feb 28, 2017

ins3ption

MaxfordAndSons · on Feb 28, 2017

It's unbelievable that the status page is still showing green checkmarks, almost what, 2 hours into the outage?

edit: oh, it is actually because of the outage! So if they can't get a fresh read on the service status from s3, they just optimistically assume it's green... even though the service failing to provide said read... is one of the services they're optimistically showing as green XD

purplecones · on Feb 28, 2017

Hey can't change it due to the S3 issue. See their twitter post: https://twitter.com/awscloud/status/836656664635846656

Zenst · on Feb 28, 2017

Wow so they can't update the S3 status page currently due to S3 issues, including the status page processing to update it, which runs upon S3.

That raises many more questions about how well accounted outages have been in the past and equally reported. Then the design aspect that in itself highlights if you run things in the cloud, what fallback do you have if that goes wrong. So certainly the impact from this outage is going to echo for a while, with many questions being asked.

smarx007 · on Feb 28, 2017

Smells BS. As pointed out in https://news.ycombinator.com/item?id=13757284, text should have reflected the real situation. So the icons are either not the only problem or just an excuse.

douglasfshearer · on Feb 28, 2017

It seems status pages should be on entirely independent infrastructure, give the criticality of the information they provide. Perhaps even a separate domain.

sitkack · on Feb 28, 2017

Same related flaw as Three Mile Island. Fail closed and measure the output, not the intent.

dragonwriter · on Feb 28, 2017

> Because we [sysadmins] will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.

But isn't that the whole point of lying: to the less technical manager (often the only person whose view matters at major customers), the status board saying "up" means the problem is the sysadmins, not the vendor.

mjcl · on Feb 28, 2017

That works in the vendor's favor in the short term, but can screw them in the long term because you get staff who go the extra mile to avoid the vendor in the future, including structuring requirements to avoid them.

For example, by experience and gossip I know Wind stream has awful reliability, but they handwave that away. By including a requirement I knew they couldn't meet (dynamic E911), they were knocked out of a 200 site VoIP RFP early.

paulddraper · on Feb 28, 2017

Hurry, look now, so you can tell your grandchildren!!!

Greenish ELB, RDS.

Yellow EC2, Lambda.

Red S3, Auto Scaling.

EDIT: A few dozen services in us-east-1 are down/degraded.

rdtsc · on Feb 28, 2017

> but falsely reporting 'up' when you're obviously down is a heinous transgression.

When SLA's are in play and so are job performance scores and bonuses there is probably a strong incentive to fudge numbers. It can be done officially ("Ah but sub-chapter 3 of chapter X in the fine print explains this wasn't technically an outage") or unofficially.

vocatus_gate · on Feb 28, 2017

When I worked in Antarctica any outage affecting users that lasted over 50 minutes was considered an official "outage" and had to be reported to mission command. So of course ALL maintenance was rolled back/backed out if it came anywhere even close to 50 minutes, just so we wouldn't have to fill out the stupid outage paperwork.

primitivesuave · on Feb 28, 2017

Thank you for the insight. Could you and/or any sysadmin on here elaborate on what a "nail in the coffin" situation might look like? For example, is this current outage with inaccurate status updates enough to seriously consider migrating to another CDN provider? If so, which one would you migrate to?

i336_ · on March 2, 2017

Disclaimer, not a job-toting sysadmin quite yet, but here's my 2¢:

- Architectural SPOFs (single points of failure) need to be carefully weighed up in any design, and "ALL our files are on $single_provider" is one such huge red flag. Unfortunately these considerations are all too frequently drowned out by the ease of going with the least path of resistance.

For example GitHub occasionally goes down, which breaks a remarkable amount of infrastructure: a huge number of people don't know how to use Git, do full clones from scratch each time, and have no idea how to work without a server (even though Git is built to work locally); CI systems tend to want to do green-field rebuilds, so start out with empty directory trees and need to do full clones each build (I'm not sure if any CI systems come with out-of-the-box Git caching); GH-powered authentication systems fall apart; etc. Kinda crazy, scary and really annoying, but yeah.

In terms of "nail in the coffin", that depends on a lot of factors, including a subjective analysis of how much local catastrophe was caused by the incident; subjective opinions about the provider's reaction to the issue, what they'll do to mitigate it, perhaps how transparent they are about it; etc.

Ultimately, the Internet likes to pretend that AWS and cloud computing is basically rock-solid. Unfortunately it's not, and stuff goes down. There were some truly redundant architecture experiments in the 80s (for example, the Tandem Nonstop Computer, one of which was recently noted to have been running continuously for 24 years: https://news.ycombinator.com/item?id=13514909) but x86 never really went there, and superscalar computing is built on a sped-up version of the same ideas that connect desktop computers together, so while there are lots of architectural optical illusions, well, stuff falls apart.

- Everyone in this thread is talking about Google Compute Engine, but it really depends on your usage patterns and requirements. GCE is pretty much the single major competitor to AWS, although the infrastructure is _completely_ different - different tools, different APIs, different pricing infrastructure. The problem is that it's not like like MySQL vs PostgreSQL or Ubuntu vs Debian; it's like SQL vs Redis, or Linux vs BSD. Both work great, but you basically have to do twice the integration work, and map things manually. With this said, if you don't have particularly high resource usage, VPS or dedicated hosting may actually work out more cost-effectively.

TL;DR: you go back to the SPOF problem, where _you_ have to foot the technical debt for the reliability level you want. Yay.

discreditable · on Feb 28, 2017

This is why I always set up my own monitoring for services in addition to the provider's status page. Simple SmokePing graphs have saved me a ton of time when it comes to troubleshooting provider outages. It especially helps when I can show them exactly when there are problems.

jnordwick · on Feb 28, 2017

Why is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

johngalt · on Feb 28, 2017

The manager is not the bad guy. They are doing everything they should do in the scenario I presented. Checking into an outage affecting a critical system. Criticizing the sysadmin's findings based on the evidence that Amazon's status page disagrees. I don't expect a non-technical party to believe me over Amazon.

The bad guys are the providers who report false positives to preserve metrics.

rabidonrails · on Feb 28, 2017

Just commenting here because hopefully people can see: AWS status page updated: 1:44 CST

jnordwick · on Feb 28, 2017

Any is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?

rdiddly · on Feb 28, 2017

It's not a lie, it's an "alternative fact" about how totally like awesome AWS is!