Hacker News new | past | comments | ask | show | jobs | submit login
Heroku was Down
254 points by ericpauley on Feb 24, 2022 | hide | past | favorite | 141 comments
Incident: https://status.heroku.com/incidents/2402

Update: our apps appear back up after 23 minutes total downtime. Others are reporting applications still down.

Update: it appears most or all services have been restored.




HN is truly a market leader in status page technology


A couple of outages that have affected my client i learned about first on HN. We were able to mobilize a team and get on top of it faster than any monitoring team at the client ( a state government ). I feel like HN should invoice us haha


You might benefit from some better monitoring!


I came to HN to make sure it was actually down.


Except when it is a false alarm.

I've seen a few get to the front page.

But when we're right we're right.


10/9 times


HN: All of our amps go to 11.


Even when it's a false alarm it's usually something else is having a problem that is affecting many people and manifesting itself as a particular service being down.


I'll make my employer pay for it if HN starts a paid service. Honestly they should write a blog post about when down times were first reported on Hn.


HN was also down for a moment because too much traffic


I experienced that too


Such a lightweight tool at that! Barely any JS loaded.


*was


All our apps are down as well. And it seems (and I really have to almost laugh here) that https://status.heroku.com no longer loads.


As always, a note to infrastructure providers: HOST YOUR LIVE STATUS DETAILS ON OTHER INFRASTRUCTURE.

Of course, they won't. If they host is on someone else's then that might look bad (tacitly saying that a competitor is reliable and might be up when they are down) and if they hive off an extra copy of some of their infrastructure there will still be single points of failure either accidentally, by human error (someone somehow messing up both segments at once), or by design (possibly through management trying to save pennies when they noticed this extra bit of infrastructure on the balance sheet).


I agree with your lead statement and argued as such, but was overruled. Last I knew and understood, Heroku Status is static pages pushed out to Fastly, with the internal admin site (that does that work) running in a Heroku Private Space. If you look at the DNS, it still appears to be served by Fastly, and Heroku Private Spaces are generally pretty isolated infra, so I would be curious what the failure mode was here. But ultimately this is the fire you play with when you self-host your status site...


I am pretty sure Heroku has a Linode box or something for their status page. I may be wrong though. They may also have moved it but that would seem like an incredibly dumb idea.

I worked there and I vaguely remember something like this but it's been a long time.

If I had to guess this is probably a DNS issue (it's always DNS).


I've been with Heroku for 7 years and this is the first time in recent memory that I've seen Heroku completely go down. Nothing at all works.


Ahh... were you sleeping in 2021? They went down in September, November, and December. Probably some other times earlier in the year as well. I stopped keeping track at how bad their service is because of how bad AWS's service is.


My app didn't go down during any of those times.

I recall at least one of them I could not deploy new versions of my app though, which is pretty bad. I don't believe I had any downtime at all due to heroku platform outages in 2021. i believe you if you say you did though!

This particular outage definitely seems to be of a rare level of severity.


Their authentication system completely broke I think in January...for hours.


Oh right, I remember that now.

Didn't bring my deployed app down though! No user-facing outage for me.

This time my app was down for about 35 minutes.


I think apps still worked but I'm not sure. I know I couldn't create a privatelink to our AWS environment because the CLI kept failing, so we were locked out of a dev database. Not too bad.


I'm well aware of their other outages. But I don't remember another time when they were completely down like this, even down to their status page.


Well, they aren't completely down now either. Heroku Shield is up. Dashboard is up. EU is reported as up.

As an aside, I spoke with our AE in early January on where they are going in dealing with AWS unreliability. One would think they would have a good answer. They don't.


You haven't been paying attention. We're very heavy Heroku users. Their dashboard/API has gone down a few times in the past year. Overall they've been reasonably good.

It's much better than it was 5+ years ago. Back then they had almost weekly downtime.


I am paying attention. My business runs on Heroku. I never said they don't go down... they go down quite a lot (much to my dismay), but never completely down like this one. Even their status page is down, which is a first for me. Must be bad.


they had their big layoff in 2020 or 2021 though so I wouldn't expect it to improve


So... what are the chances this is related to a cyber attack from Russia? Also our apps are down


Now is a good time for everyone to double-check their backups and go over their disaster recovery plans. I know I am.

I mean, yesterday was a good day. But today's the best you've got if you didn't do it then.


Practically zero. Why would Russia care about Heroku?


Heroku runs on AWS, so it would be an AWS attack of some sort. But AWS in general seems to be up from what I can tell, so I'm feeling like it's not a cyberattack, and just a Heroku specific problem.



Collateral damage (but probably DNS lol)


Maybe not heroku per se but something hosted on heroku?


what would lead you to this conclusion?



what does this have to do with Heroku? why would it be targeted? especially given that they're running on AWS?


> Every organization in the United States is at risk from cyber threats

Heroku and AWS are organizations no?

> While there are not currently any specific credible threats to the U.S. homeland, we are mindful of the potential for the Russian government to consider escalating its destabilizing actions in ways that may impact others outside of Ukraine.


this morning at work we had a database connection issue for a few hours. I work for a school district, which is an organization. therefore, it was probably a Russian cyberattack.

like what


> what would lead you to this conclusion?

I'm just posting what would lead somebody to that conclusion. Nobody is definitively saying anything right now.


yes, and I'm just pointing out how ridiculous of a conclusion this is.


I wasn't gonna say it...


Heroku apps are down, but API access is available.

Might be a good time to run: heroku pg:backups:download


% heroku status

› Warning: Our terms of service have changed:

https://dashboard.heroku.com/terms-of-service Apps: Yellow Data: No known issues at this time. Tools: No known issues at this time.

=== Availability of Common Runtime apps 2022-02-24T16:50:47.249Z https://status.heroku.com/incidents/2402 investigating 2022-02-24T16:50:47.249Z (1 minute ago) Engineers are looking into reports of connectivity issues to Common Runtime apps in the US and EU regions.


oh I didn't know that was a CLI command, nice!


No update to their own status or their Twitter (@herokustatus) after 10+ minutes. Pro.


Let me bet: it's DNS.

It's always DNS.


It was, but I don't know why. I'm curious to hear if Heroku releases any information about how this happened. Heroku's DNS was returning a single 100.64.x.x address which is in a reserved range.


They typically publish post-mortems, I think? Not 100% sure. They definitely do in-house.

We'll have to wait and see I guess.


AWS is down too, this is likely the cause since Heroku runs on it. https://downdetector.com/status/aws-amazon-web-services/


What is the proof that AWS is down? Functional monitoring of AWS by metrist.io (I'm a co-founder) shows no AWS problems. Downdetector is not a reliable source.


What makes Downdetector unreliable? It's showing a huge spike right now.


It's solely based off social media and user reports. It's the "smoke" in the saying "where there's smoke, there's fire" with the caveat that in some cases there's actually no fire even if there's a decent amount of smoke.


Downdetector relies on user reports, so e.g. if a user's ISP is down and they can't get to Facebook, they might report Facebook being down (or vice versa). DD spikes are typically indicative of _something_, but it's not always the actual down service.


Gotcha. Although for a spike this large (over 1000) for a tech service (AWS vs Facebook) I'd give some credence to it. It could be that everyone who reported AWS as down is running on Heroku. Definitely possible. For comparison Azure [0] and Google Cloud [1] have spikes under 30.

[0] https://downdetector.com/status/windows-azure/ [1] https://downdetector.com/status/google-cloud/


Their (metrist) claim is that DD is human reported and therefore unreliable.

Metrist monitors via bots


It can be useful, but you have to take it with a grain of salt. A perfect example is the recent Facebook (Meta) outage. When that happened, Downdetector showed that ATT, Verizon, and T-Mobile all had issues. They didn't, it was just Facebook and users mentioned or otherwise claimed that it was their mobile carrier.


Neat! Wish you had a clickable demo rather than just screenshots.


Thanks for the feedback, we can do that soon.


maybe your service isn't as reliable as you think it is ;)


AWS status page is all green.... Just kidding! I know the information content of the AWS status page is literally zero, it's always green!


Nah, it's just on a 12 hour delay between updates.


AWS must be only partially down, all of it is running fine for me in us-east-1. Elastic beanstalk ftw! :)


If us-east-1 is up I have a heard time believing ANYTHING aws is down


AWS status page is up, but going VERY slow, and says no issues.


I still don't get why they use images

    <td class="bb pad04 top center" style="width: 32px">
      <img src="/images/status0.gif">
    </td>
Instead of unicode:https://www.htmlsymbols.xyz/unicode/U%2b2705


It's because they can host the images in S3 buckets, so that their status page goes down when S3 is down.

/s

But, this really happened in 2017.


if the image tag is drawn dynamically then it is actually probably less bandwidth than the unicode character since the image can be cached at multiple locations including the browser.


Less bandwidth than the 1-4 bytes of a codepoint in UTF-8? How do you figure?


Because if it is cached, then it is 0 bytes transferred. First request could be probably a few hundred bytes, but never needed again. And once it is at a CDN, there is never another request to the server.


.... no way.

Cache freshness checks involve a lot of headers, which take up way more bandwidth than one unicode character.


if you care about cache freshness - you could set the cache header to 10 years and it never be an issue again.


As usual whenever there's an AWS outage


No, AWS is not down.


At first the heroku status page was still all green for me... although took 30 seconds to load. I guess when the status page takes 30 seconds to load, that's an indicator!

I did figure, okay, a 30-second-to-load status page probably means my app outage is a heroku platform problem.

(Also an indicator the status page is sharing too much platform with the platform it's supposed to be reporting on? Also in this case an indication that the platform problems are pretty deep?)

Interestingly, my app logs (via papertrail, which is still up) show that some traffic is getting through continually through this current outage, although I can't (and my monitoring app can't either, which pinged me).


Status Page was running a bit slowly for me connecting to developers.salesforce.com.

Link here to latest snapshot:

https://postimg.cc/McZHpCWg

No issues noted there.


EU outage towards the end of last year was similarly bad, but lasted much longer. Asked Heroku for at least a refund for the dyno time when our apps were unavailable and was flatly told no.


Anyone still seeing outages? Our services are all back up.


All our applications are still down.


We're back up. Total 47 mins of downtime.


We are still down as well.


I've only ever used heroku for free tier level personal projects, but as I understand it they use AWS to do the actual hosting. I can understand an outage affecting their deployment process, but what could cause running servers to go down?

As I typed that out I remembered that they handle DNS, load balancing, and databases, so I guess any one of them.


All it takes is a load balancer misconfiguration on a core service to take a large scale service down.


it indeed behaved like a routing issue of some kind (my app was still UP, and was still logging, just no traffic could get to it), and a heroku incident status line said "Engineers are recovering affected routing components," so, yup.


Our app is currently 100% down at the router layer, and dashboard requests are mostly failing.


Routing and logs are down for me as well


Heroku just sent an email to us confirming issues. Links to a status page but that isn't loading.

https://status.heroku.com/incidents/2402


It appears to only be affecting apps that have multiple dynos.

While Heroku is busy fixing the issue if you scale your app to a single dyno (assuming it can handle the traffic) it should restore availability.


My single-dyno app was affected.

However, while I could not connect to my app, nobody I know could connect to my app, and my monitoring service could not connect to my app for a ping... my app logs showed that some traffic continued to connect throughout the outage. So it was not entirely universal. And was clearly a routing problem of some kind.


We also had single-dyno apps affected.


It may have been during an earlier part of the instance, but we just scaled multiple affected apps to a single dyno and all recovered. ps restart with multiple dynos and no effect.


I did too. I'm curious if you also have pre-boot enabled? I think that may set the routing configuration the same as if multiple dynos are up.


My 6 apps are all single dyno. Not sure if I have pre-boot enabled or not though....


I had single dynos affected.


https://status.heroku.com/incidents/2402

(It took a few minutes to get this link to work for me)


It would be nice if they updated their status page. sigh


Our Heroku applications are all down, it seems to be that it is a Heroku Router/DNS issue.

We can access our logs and the application is running, just no incoming requests.


My heroku app was back up again as of about 100 seconds before this comment timestamp.

My app was down about 35 minutes total, according to my monitoring.


Our app is currently down as well and we host on Heroku. We received a down alert from our monitoring service at 11:32AM EST.


Yes it appears to be down for us as well.


Rust's crate repository (crates.io) seems to be down too. I wonder if they're connected...


Ours are up. US East. Heroku Shield.


Our apps are all down as well, Heroku Status not loading either (or just incredibly sluggish).


Gotta love asking in slack "is uh... literally everything down right now?"


Well that was a fun one to wake up to. We were down for ~33 minutes overnight.


2/3 of our apps have been timing out of synthetic checks since 8:30a PT


I was able to get my apps up faster by restarting the dynos


Yes, all apps are down


It is hard-down for me


Cypress is also down


All 6 of my apps & dashboard are down too.


Some of my apps are back up. Others are not.


Down along with the status page right now


My Heroku app is completely down as well


Mines back up, at least 20min down time


Yep, all 6 of my Heroku apps are down.


Interestingly enough, my 2 Squarespace sites weren't affected. Anyone know who they host with?


Things just came back up for me


And we're back? I think.


Yes all of my apps are down.


Same here, hard down for us


US down for us, EU is up



Functional monitoring of AWS by metrist.io (I'm a co-founder) shows no AWS problems. Downdetector is not a reliable source.


"User reports indicate problems at Amazon Web Services" User reported outage sounds like people confusing the Heroku outage with AWS.


status, dashboard and landing page are down for me. (socal.)


Our app is down too :(


Down for us as well.


Back up just now.


Mine is still up.


reporting from Nicaragua, our apps are down!


We're down


Down for us too


Back up now!


Same here


Same here


Same here


Same here


my apps are back up now.


yes, site is down




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: