A couple of outages that have affected my client i learned about first on HN. We were able to mobilize a team and get on top of it faster than any monitoring team at the client ( a state government ). I feel like HN should invoice us haha
Even when it's a false alarm it's usually something else is having a problem that is affecting many people and manifesting itself as a particular service being down.
As always, a note to infrastructure providers: HOST YOUR LIVE STATUS DETAILS ON OTHER INFRASTRUCTURE.
Of course, they won't. If they host is on someone else's then that might look bad (tacitly saying that a competitor is reliable and might be up when they are down) and if they hive off an extra copy of some of their infrastructure there will still be single points of failure either accidentally, by human error (someone somehow messing up both segments at once), or by design (possibly through management trying to save pennies when they noticed this extra bit of infrastructure on the balance sheet).
I agree with your lead statement and argued as such, but was overruled. Last I knew and understood, Heroku Status is static pages pushed out to Fastly, with the internal admin site (that does that work) running in a Heroku Private Space. If you look at the DNS, it still appears to be served by Fastly, and Heroku Private Spaces are generally pretty isolated infra, so I would be curious what the failure mode was here. But ultimately this is the fire you play with when you self-host your status site...
I am pretty sure Heroku has a Linode box or something for their status page. I may be wrong though. They may also have moved it but that would seem like an incredibly dumb idea.
I worked there and I vaguely remember something like this but it's been a long time.
If I had to guess this is probably a DNS issue (it's always DNS).
Ahh... were you sleeping in 2021? They went down in September, November, and December. Probably some other times earlier in the year as well. I stopped keeping track at how bad their service is because of how bad AWS's service is.
I recall at least one of them I could not deploy new versions of my app though, which is pretty bad. I don't believe I had any downtime at all due to heroku platform outages in 2021. i believe you if you say you did though!
This particular outage definitely seems to be of a rare level of severity.
I think apps still worked but I'm not sure. I know I couldn't create a privatelink to our AWS environment because the CLI kept failing, so we were locked out of a dev database. Not too bad.
Well, they aren't completely down now either. Heroku Shield is up. Dashboard is up. EU is reported as up.
As an aside, I spoke with our AE in early January on where they are going in dealing with AWS unreliability. One would think they would have a good answer. They don't.
You haven't been paying attention. We're very heavy Heroku users. Their dashboard/API has gone down a few times in the past year. Overall they've been reasonably good.
It's much better than it was 5+ years ago. Back then they had almost weekly downtime.
I am paying attention. My business runs on Heroku. I never said they don't go down... they go down quite a lot (much to my dismay), but never completely down like this one. Even their status page is down, which is a first for me. Must be bad.
Heroku runs on AWS, so it would be an AWS attack of some sort. But AWS in general seems to be up from what I can tell, so I'm feeling like it's not a cyberattack, and just a Heroku specific problem.
> Every organization in the United States is at risk from cyber threats
Heroku and AWS are organizations no?
> While there are not currently any specific credible threats to the U.S. homeland, we are mindful of the potential for the Russian government to consider escalating its destabilizing actions in ways that may impact others outside of Ukraine.
this morning at work we had a database connection issue for a few hours. I work for a school district, which is an organization. therefore, it was probably a Russian cyberattack.
=== Availability of Common Runtime apps 2022-02-24T16:50:47.249Z https://status.heroku.com/incidents/2402
investigating 2022-02-24T16:50:47.249Z (1 minute ago)
Engineers are looking into reports of connectivity issues to Common Runtime apps in the US and EU regions.
It was, but I don't know why. I'm curious to hear if Heroku releases any information about how this happened. Heroku's DNS was returning a single 100.64.x.x address which is in a reserved range.
What is the proof that AWS is down? Functional monitoring of AWS by metrist.io (I'm a co-founder) shows no AWS problems. Downdetector is not a reliable source.
It's solely based off social media and user reports. It's the "smoke" in the saying "where there's smoke, there's fire" with the caveat that in some cases there's actually no fire even if there's a decent amount of smoke.
Downdetector relies on user reports, so e.g. if a user's ISP is down and they can't get to Facebook, they might report Facebook being down (or vice versa). DD spikes are typically indicative of _something_, but it's not always the actual down service.
Gotcha. Although for a spike this large (over 1000) for a tech service (AWS vs Facebook) I'd give some credence to it. It could be that everyone who reported AWS as down is running on Heroku. Definitely possible. For comparison Azure [0] and Google Cloud [1] have spikes under 30.
It can be useful, but you have to take it with a grain of salt. A perfect example is the recent Facebook (Meta) outage. When that happened, Downdetector showed that ATT, Verizon, and T-Mobile all had issues. They didn't, it was just Facebook and users mentioned or otherwise claimed that it was their mobile carrier.
if the image tag is drawn dynamically then it is actually probably less bandwidth than the unicode character since the image can be cached at multiple locations including the browser.
Because if it is cached, then it is 0 bytes transferred. First request could be probably a few hundred bytes, but never needed again. And once it is at a CDN, there is never another request to the server.
At first the heroku status page was still all green for me... although took 30 seconds to load. I guess when the status page takes 30 seconds to load, that's an indicator!
I did figure, okay, a 30-second-to-load status page probably means my app outage is a heroku platform problem.
(Also an indicator the status page is sharing too much platform with the platform it's supposed to be reporting on? Also in this case an indication that the platform problems are pretty deep?)
Interestingly, my app logs (via papertrail, which is still up) show that some traffic is getting through continually through this current outage, although I can't (and my monitoring app can't either, which pinged me).
EU outage towards the end of last year was similarly bad, but lasted much longer. Asked Heroku for at least a refund for the dyno time when our apps were unavailable and was flatly told no.
I've only ever used heroku for free tier level personal projects, but as I understand it they use AWS to do the actual hosting. I can understand an outage affecting their deployment process, but what could cause running servers to go down?
As I typed that out I remembered that they handle DNS, load balancing, and databases, so I guess any one of them.
it indeed behaved like a routing issue of some kind (my app was still UP, and was still logging, just no traffic could get to it), and a heroku incident status line said "Engineers are recovering affected routing components," so, yup.
However, while I could not connect to my app, nobody I know could connect to my app, and my monitoring service could not connect to my app for a ping... my app logs showed that some traffic continued to connect throughout the outage. So it was not entirely universal. And was clearly a routing problem of some kind.
It may have been during an earlier part of the instance, but we just scaled multiple affected apps to a single dyno and all recovered. ps restart with multiple dynos and no effect.