Hacker News new | past | comments | ask | show | jobs | submit login

Sysadmin: I can forgive outages, but falsely reporting 'up' when you're obviously down is a heinous transgression.

Somewhere a sysadmin is having to explain to a mildly technical manager that AWS services are down and affecting business critical services. That manager will be chewing out the tech because the status site shows everything is green. Dishonest metrics are worse than bad metrics for this exact reason.

Any sysadmin who wasn't born yesterday knows that service metrics are gamed relentlessly by providers. Bluntly there aren't many of us, and we talk. Message to all providers: sysadmins losing confidence in your outage reporting has a larger impact than you think. Because we will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.




People were joking about this but it turns out to be true: they host the status icons on their service: https://twitter.com/awscloud/status/836656664635846656


Due to HN's flaky Cloudflare 503 Bad Gateway error, I noticed that Cloudflare is also being affected by S3 being down in a similar but subtle way. See their status page's broken logo on the upper left hand corner.[1] It was actually directly linking to a S3 URL: https://s3.amazonaws.com/statuspage-production/pages-transac...

[1]: https://www.cloudflarestatus.com/


https://www.cloudflarestatus.com/ http://www.trellostatus.com/

Looks like they are both using the same solution for their status pages. The icon for trellostatus did also fail to display.


Was HN affected by Cloudbleed?

I would rather access HN without Cloudflare as man-in-the-middle, especially over HTTPS.


All websites that use Cloudflare were potentially affected by Cloudbleed which is what makes it such a terrible thing.

You will never know the exact damage, the only thing you can do to play it 100% safe is to rotate all credentials on sites using Cloudflare.

And you can't access HN without going through Cloudflare (unfortunately, but HN is having a hard enough time to keep up with traffic as it is, without Cloudflare it would perform a lot worse than it does).


Hahaha too bad.

Google Analytics, Cloudflare, AWS, those are things you can never escape from.


Disagree on Google Analytics.


EDIT: Misread parent.


Because your status page shouldn't depend on your own infrastructure. Literally the problem Amazon is having right now.


I guess they don't want to host their status page on their own CDN in case it went down too.


But only the logo image hosted on S3 is broken though. It seems to be preventable if they host the logo image together with their status page.


Saw that too, sounds like a convenient excuse for being caught in a lie.

AWS Employee #1: Hey, people are catching on that our status page isn't accurate

AWS Employee #2: Tell them it's cause of S3


I'd suspect the humiliation of hosting your own status page on the infrastructure it's monitoring would far outweigh the "lie".


People are much more forgiving of mistakes than they are of deception.


The icons aren't hosted there (or if they are, they are cached). https://status.aws.amazon.com/images/status3.gif

The status information is hosted there.



The "red check mark is stored on S3" may have been sarcasm, but apparently there was a kernel of truth to it?!

Poor show when a service disruption means the status page can't be updated....


While status icon being hosted on S3 is funny, I think it's more likely that it's not the icon itself that caused the status page to not getting updated, but rather the fault of service information (say, a JSON file) that used to generate the status page that was stored on S3. The banner could probably be configured locally, so they choose to update that for the time being (e.g. while moving the status bucket somewhere else).


> Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard.


I like how HN (and others) handle this - there should be a static link to a 3rd party source, like a twitter feed, at the top of any status page.


if that simple, why the text desc for Details also didn't reflect the incident?


Is there any service that distributes your files to multiple cloud services at the same time? With this recent S3 outage, I'm now feeling uneasy to store files on S3 for mission critical apps.


The outage was on us-east-1. If you are hosting mission-critical files in a single region, S3 is not the problem.


To be fair, most people don't use the region replication thing for S3. Of course this is why I push folks to use GCS's default Multi-Regional buckets, because when a regional outage does occur, it's awfully nice to at least have the option to spin up elsewhere. If your critical data is in Virginia today, all you can do is wait.

Disclosure: I work on Google Cloud, and we've had our fair share of outages.


I believe that IPFS together with Filecoin is intended to be something like this in a broader, free market sense. Unfortunately IPFS is probably far from ready for mission critical apps and Filecoin hasn't launched at all.


ins3ption


It's unbelievable that the status page is still showing green checkmarks, almost what, 2 hours into the outage?

edit: oh, it is actually because of the outage! So if they can't get a fresh read on the service status from s3, they just optimistically assume it's green... even though the service failing to provide said read... is one of the services they're optimistically showing as green XD


Hey can't change it due to the S3 issue. See their twitter post: https://twitter.com/awscloud/status/836656664635846656


Wow so they can't update the S3 status page currently due to S3 issues, including the status page processing to update it, which runs upon S3.

That raises many more questions about how well accounted outages have been in the past and equally reported. Then the design aspect that in itself highlights if you run things in the cloud, what fallback do you have if that goes wrong. So certainly the impact from this outage is going to echo for a while, with many questions being asked.


Smells BS. As pointed out in https://news.ycombinator.com/item?id=13757284, text should have reflected the real situation. So the icons are either not the only problem or just an excuse.


It seems status pages should be on entirely independent infrastructure, give the criticality of the information they provide. Perhaps even a separate domain.


Same related flaw as Three Mile Island. Fail closed and measure the output, not the intent.


> Because we [sysadmins] will be the ones called to the carpet to explain why <services> are down when <provider> is lying about being up.

But isn't that the whole point of lying: to the less technical manager (often the only person whose view matters at major customers), the status board saying "up" means the problem is the sysadmins, not the vendor.


That works in the vendor's favor in the short term, but can screw them in the long term because you get staff who go the extra mile to avoid the vendor in the future, including structuring requirements to avoid them.

For example, by experience and gossip I know Wind stream has awful reliability, but they handwave that away. By including a requirement I knew they couldn't meet (dynamic E911), they were knocked out of a 200 site VoIP RFP early.


Hurry, look now, so you can tell your grandchildren!!!

Greenish ELB, RDS.

Yellow EC2, Lambda.

Red S3, Auto Scaling.

EDIT: A few dozen services in us-east-1 are down/degraded.


> but falsely reporting 'up' when you're obviously down is a heinous transgression.

When SLA's are in play and so are job performance scores and bonuses there is probably a strong incentive to fudge numbers. It can be done officially ("Ah but sub-chapter 3 of chapter X in the fine print explains this wasn't technically an outage") or unofficially.


When I worked in Antarctica any outage affecting users that lasted over 50 minutes was considered an official "outage" and had to be reported to mission command. So of course ALL maintenance was rolled back/backed out if it came anywhere even close to 50 minutes, just so we wouldn't have to fill out the stupid outage paperwork.


Thank you for the insight. Could you and/or any sysadmin on here elaborate on what a "nail in the coffin" situation might look like? For example, is this current outage with inaccurate status updates enough to seriously consider migrating to another CDN provider? If so, which one would you migrate to?


Disclaimer, not a job-toting sysadmin quite yet, but here's my 2¢:

- Architectural SPOFs (single points of failure) need to be carefully weighed up in any design, and "ALL our files are on $single_provider" is one such huge red flag. Unfortunately these considerations are all too frequently drowned out by the ease of going with the least path of resistance.

For example GitHub occasionally goes down, which breaks a remarkable amount of infrastructure: a huge number of people don't know how to use Git, do full clones from scratch each time, and have no idea how to work without a server (even though Git is built to work locally); CI systems tend to want to do green-field rebuilds, so start out with empty directory trees and need to do full clones each build (I'm not sure if any CI systems come with out-of-the-box Git caching); GH-powered authentication systems fall apart; etc. Kinda crazy, scary and really annoying, but yeah.

In terms of "nail in the coffin", that depends on a lot of factors, including a subjective analysis of how much local catastrophe was caused by the incident; subjective opinions about the provider's reaction to the issue, what they'll do to mitigate it, perhaps how transparent they are about it; etc.

Ultimately, the Internet likes to pretend that AWS and cloud computing is basically rock-solid. Unfortunately it's not, and stuff goes down. There were some truly redundant architecture experiments in the 80s (for example, the Tandem Nonstop Computer, one of which was recently noted to have been running continuously for 24 years: https://news.ycombinator.com/item?id=13514909) but x86 never really went there, and superscalar computing is built on a sped-up version of the same ideas that connect desktop computers together, so while there are lots of architectural optical illusions, well, stuff falls apart.

- Everyone in this thread is talking about Google Compute Engine, but it really depends on your usage patterns and requirements. GCE is pretty much the single major competitor to AWS, although the infrastructure is _completely_ different - different tools, different APIs, different pricing infrastructure. The problem is that it's not like like MySQL vs PostgreSQL or Ubuntu vs Debian; it's like SQL vs Redis, or Linux vs BSD. Both work great, but you basically have to do twice the integration work, and map things manually. With this said, if you don't have particularly high resource usage, VPS or dedicated hosting may actually work out more cost-effectively.

TL;DR: you go back to the SPOF problem, where _you_ have to foot the technical debt for the reliability level you want. Yay.


This is why I always set up my own monitoring for services in addition to the provider's status page. Simple SmokePing graphs have saved me a ton of time when it comes to troubleshooting provider outages. It especially helps when I can show them exactly when there are problems.


Why is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?


The manager is not the bad guy. They are doing everything they should do in the scenario I presented. Checking into an outage affecting a critical system. Criticizing the sysadmin's findings based on the evidence that Amazon's status page disagrees. I don't expect a non-technical party to believe me over Amazon.

The bad guys are the providers who report false positives to preserve metrics.


Just commenting here because hopefully people can see: AWS status page updated: 1:44 CST


Any is always the manager that is the bad guy in these scenarios? Haven't we grown up yet?


It's not a lie, it's an "alternative fact" about how totally like awesome AWS is!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: