Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is GitHub down?
306 points by mikebonnell on June 29, 2023 | hide | past | favorite | 184 comments
Not loading for me at all, but status page shows green across the board.



Everytime github goes down, and my push/pull is rejected, I immediately assume they’ve discovered I’m incompetent and fired me. And I’m the head of engineering at my company.


I think the same thing every time my credentials to our issue tracker expires and I have to log in again.

I am the lead dev on two projects.


Is there a name for firing trauma? I'm like this ever since I was scapegoated.


I got logged out of our slack today, which I'm the primary owner of, and I was also wondering this.

I've also never been fired, so, it isn't always linked to trauma from past firings.


I’m no mental health professional, but that sounds like literal PTSD, to me.


There’s a long list of signals that can trigger folks into layoff panic.


Sudden self-awareness of bias?

https://news.ycombinator.com/item?id=36508656


Can you describe what happened that qualifies as "scapegoated"?


Borderline imposter syndrome?


> And I’m the head of engineering at my company.

Haha! Happy to see impostor syndrome goes all the way to the top of the hierarchy.


Something that I've found that helps me with impostor syndrome is to read and talk about it.

Check out this Ted talk from the co-founder of Atlassian.

https://www.ted.com/talks/mike_cannon_brookes_how_you_can_us...


At a friend's company, the CEO had a calendar invite of "Fire Dan", for April 1. Dan went to confirm it was an April Fools' joke. It wasn't!


This is something my boss would post in his calendar publicly without even thinking of it. I guess it helps me to get ahead of it, if it were to ever happen to me, but having the rest of the company able to see it is pretty cold and unfeeling.


wtf. that's pretty cold.


Sounds kind of like those dreams where you can only run slowly, punch with noodly arms and trying to turn on a light just has a dim effect.


Ooof, I felt that. My project management system logs me out a few times a year and each time it happens my heart rate elevates.


thats not very healthy given how unreliable github has become in recent years.

E.g. just yesterday for a short time frame of a few hours maybe half a day or so they had a bug where some closed PRs where shown in the personal which show created _not closed_ PRs.

Or github CI having spurious job cancellations or sometimes on a job failing waits until some (quite long) timeout is reached before reporting it.

Or it temporary being (partial or fully) down for a few hours.

Or it's documentation even through rather complete somehow managing to be often rather inconvenient to use. Oh wait that's not a bug, just subtle bad design, like it's PR overview/filters. Both cases of something which seems right on the first look, but starts being more and more inconvenient the more you use it. A trend I would argue describes GitHub as a whole rather well.


Something internal must be going on at Microsoft. My company's PowerBI service has had some major performance issues over the past week


You’re not alone.


I get this feeling when I get kicked out of Google services at a different time than my usual 7 days (Monday) log out. I'm an admin of the Google services we use, and I still get that stomach dropping feeling.


I resonate with this.


Well im IT Teamlead and imposter syndrom hits me hard too. Always wonder when the day will come.


I hope you are not contributing production code.


I feel this in my bones. Every. Single. Time.


Cannot login to slack


Seems to be a fairly catastrophic failure. https://github.com/ fails to load. https://www.githubstatus.com/ shows all green as of this writing. Nothing on the twitter yet https://twitter.com/githubstatus

edit: The outage is now acknowledged on the status page https://www.githubstatus.com/

edit: EU folks appear to have things working so it looks like a regional network fault


Strange stuff, as it works completely fine for me in the EU? I just posted comments to several issues.

Edit: Front page still loads and I am logged in. Everything is as normal. Status page shows everything is down. Lol.


Switched on a VPN in EU and it started loading. I can get back to what I was doing now ;).


yes, seems to be a network issue, not a service issue


That was my guess. Something on the frontend like a load balancer or proxy blocking traffic, but everything behind that was doing fine.


Sounds like a regional network fault then



Status page is red now, it probably only checks once every couple minutes.


EU here. Actions are failing to run. Rest is kinda ok.


Pages hosted on github pages also show the unicorn 503 page. However, https://pages.github.com/?(null) loads.


Looks like they finally updated the second status page to show the outage.


GitHub pages are down too, although funnily enough https://pages.github.com is up


That is funnily.


Even the status page isn’t loading for me currently


Looks like they've updated it now.


status.github.com was a timeout error for me. githubstatus.com is the rainbow unicorn.

Lunch time.


for some reason www.githubstatus.com works while githubstatus.com doesn't


Status page is fully red now.


140.82.113.0/24 is visible in the global routing table:

  route-views>sh bgp 140.82.113.0
  BGP routing table entry for 140.82.113.0/24, version 62582026
  Paths: (19 available, best #4, table default)
The route is verified by RPKI so it's not a route hijack.

Edit: deleted traceroute


Your requests made it farther than mine - mine get to charter in nyc and die there

    6  lag-26.nycmny837aw-bcr00.netops.charter.com (24.30.201.130)  158.033 ms
       lag-16.nycmny837aw-bcr00.netops.charter.com (66.109.6.74)  29.575 ms
       lag-416.nycmny837aw-bcr00.netops.charter.com (66.109.6.10)  30.077 ms
    7  lag-1.pr2.nyc20.netops.charter.com (66.109.9.5)  81.351 ms  37.879 ms  27.877 ms
    8  * * *


I’m in US east coast with a dev box in Helsinki. My dev box can still hit github.com, but I can’t at home.


Curious aside: That sounds like quite the roundtrip for day to day work. How do you cope with that, used to IntelliJ IDEs? ;D


Surprisingly, not that bad ;) just a cheap hetzner box.


What IP does it resolve to in Helsinki?


From Finland, but not Helsinki: 140.82.121.3


yep, same


This is so cool! I'm not at all familiar with any of this network stuff. Any good resources for learning these tools and when to use them?

Sorry to bother!


TCP/IP Illustrated is a good start.


Thank you!


github.com for me returns 140.82.121.3 which routes fine in the uk, returning from

lb-140-82-121-3-fra.github.com

which from the distance and name I would assume is a frankfurt based load balancer. I get there from BT -> Zayo

I can reach that IP from Washington too, but github returns 140.82.114.3 and 140.82.114.4 from DNS at 1.1.1.1 on a Level3 handoff in Washington

Spot checks around the place show the first returned IP as pingable across the world

Bangkok, Dhaka, Jakarta - 20.205.243.166

Seoul - 20.200.245.247

Nairobi - 20.87.225.212

Kabul, Dakar, Amman, Amman, Cairo - 140.82.121.3

Moscow, Riga, Istanbul - 140.82.121.4

Miami - 140.82.114.3


Same from Finland, and same route. (Except my ISP instead of BT).


They do have other peering -- that IP from my ISP in Jakarta routes onto Hurricane Electric in Singapore and then to github. From Sao Paulo I go to Atlanta, USA, then to Paris and Frankfurt on twelve99/Telia


found the neteng guy


I'm able to reach on 192.30.252.0/22.


https://www.githubstatus.com/

Just flipped to red.

See here: https://www.githubstatus.com/incidents/gqx5l06jjxhp

>Investigating - We are currently experiencing an outage of GitHub products and are investigating.

>Jun 29, 2023 - 17:52 UTC


As the person in charge of one such page, I’d like to take the opportunity to remind folks that many— if not most—of these status pages are hand-updated, and most bosses absolutely hate anyone having to update them to anything but green.


Sounds like an anti-pattern or SLA dodge to me.


Sometimes, but sometimes it’s just a precaution against automatic false alarms causing a huge panic.


See, I prefer panic. People don't pay enough attention to systems as it is. May just yeet together a bunch of parts and never bother to learn to actually troubleshoot or maintain, or reason through things.


https://codeberg.org open source GitHub without Microsoft (it's a German non-profit). You can also host your own lightweight https://forgejo.org instance

In case anyone questions whether centralizing literally everything onto GitHub is a good idea, at least as a mirror for things you depend on


gitea is also a great self hosted alternative!


I think that's what Forgejo forked from (and Gitea, in turn, forked from Gogs). I am not involved so don't know the details, but yeah basically all of these will do. I ran my own in the Gitea era and was happy with it, 10x lighter and easier than gitlab, I expect Forgejo has a similar experience.



So down right now...I wonder why they still use https://www.githubstatus.com/ that reports everything is alright when it's not!


https://downdetector.com/status/github/ is a far more reliable source - it's just powered by user reports and often will show issues long before the status page ever receives an update.


Keep in mind that downdetector can be brigaded and/or show knock-on problems instead of root causes. e.g. A couple weeks ago there were fairly major spikes across a rather huge variety of services on there, but it turned out that it was actually Comcast that was having trouble, rather than any of the “down” services.


Pretty much every company has been shown to have fake status pages at this point.


Status pages are updated by humans and the humans need to (1) realize there's a problem and (2) understand the magnitude of the problem and (3) put that on the status page.

It's not fake, it's just a human process. And automating this would be error prone just the same.


Very good points. Meanwhile I have clients asking me why they can't have a status page to which I reply: you can, but ultimately to be completely fail proof it will be a human updating it slowly. To which they reply: but GitHub or X does it...

Very infuriating, that.


There's some nice tooling these days for this. E.g. https://firehydrant.com/ and https://incident.io both make this a faster, more embedded process.


And Jeli.io for this! With the Statuspage integration, you can set the status, impact, write a message for customers, and select impacted components all without leaving Slack. Statuspage gets updated with a click of a button.


Hey, incident.io CEO here! Thanks for mentioning us.


I wouldn't necessarily call them fake, but the issue often has to be big enough for most companies to admit to it. AWS often has smaller outages that they will never acknowledge.


Also (2b) convince their boss that the “optics” are better to update sooner than later.


Pretty much. They want the burden of proof for SLAs to fall on the customer, not on themselves. If a customer has to prove that an outage specifically affected them, they are much less likely to have a successful case against the failure to meet their SLA.

(Not directed at GitHub specifically, but at bogus status pages.)


fake and not automated are pretty different


From my experience, GitHub is the best out there when it comes to updating their status page.


Really? Why?

That's so disappointing.


Two technical reasons capstoned by driving business motivation:

-False positives -Short outages that last a minute or three

Ultimately, SLA's and uptime guarantees. That way, a business can't automatically tally every minute of publicly admitted downtime against the 99.99999% uptime guarantee, and the onus to prove a breach of contract is on the customer


Maybe the status page is down - it needs a status page to tell us if the status page is down


If it takes someone to manually change it from green to red, that does seem to defeat the purpose.


No it doesn't. The amount of false alarm alerts you can get with internet based monitoring is more than 0. You could have a BGP route break things for one ISP your monitoring happens to use. You could have a failover event happening where it takes 30 seconds for everything to converge. I have multiple monitors on my app at 1 minute intervals from different vendors and ALWAYS a user will email us within 5 seconds of an issue. It's not realistic for a company to have automatic status updates trigger things without a person manually reviewing them because too many things can go wrong on the automatic status update to cause panic.


Most paid status monitoring services cover BGP route problems and ISP issues by only flagging an event if it's detected from X geographically diverse endpoints.

For the 30 seconds where you wait for failover to complete: that is a 30 second outage. It's not necessarily profitable to admit to it, but showing it as a 30 second outage would be accurate


TCP default is more than 30 seconds. The internet itself has about a 99.9% uptime. If one company showed every 30 second blip on their outage page all their competitors would have that screenshot on the first page of their pitch deck even if they also had the same issue. 2-5 minutes is reasonable for a public service to announce an outage.


Forgot about that centurylink BGP infinite loop route bug they had where it took down their whole system nationwide. A lot of monitoring services showed red even though it was one ISP that was done.


Who would panic? If nobody notices it's out because it's not, then nobody is going to be checking the status page. And if they do see the status page showing red while it's up, it's not like they're going to be unhappy about their SLA being met.

Maybe you want human confirmation on historic figures, but the live thing might as well be live.


Not really, things fail in unexpected ways. Automated anomaly detection is notoriously error prone, leading to a lot of false positive and false negatives, in the trivial case of monitoring a single timeseries. For a system the size of GitHub, you need to monitor a whole host of things and if it's quasi impossible to do one timeseries well, there's basically no hope of doing automated many timeseries anomaly detection with a signal-to-noise ratio that's better than "humans looking at the thing and realizing it's not going well".

There's stuff like this that can't be automated well. The automated result is far worse than the human-based alternative.


Unknown unknowns means you can have catastrophic system failures that automated alerts don't detect.


I bet they could teach Co-Pilot to create a PR to make the change, and build some GitHub actions to automatically merge those changes.


Yep, and when money comes into play when you're supposed to meet SLAs, you certainly don't want it being automatic.


Possibly, but sometimes with failures this bad you can't get to the page to update it.


There was that hilarious multi-hour AWS failure a while back where the status page was updated via one of their internal services... and that service went down as part of the outage.


To all the "same" and "not for me" posters: the very least you could add is a location


I feel sympathy for all the engs at companies I've implemented CI/CD based on Gh Actions in recent years. It's not like I didn't tell them that Github showed to be somewhat unreliable in the recent years and in contrast to their claim "it's just the build pipeline, not the product" I think it is a horrible incident if you're not able to deploy to production and have barely any ad-hoc backup.

I'm always evangelizing Argo or Flux and some self-hosted Gitlab or gitea, but seems like they all prefer to throw their money at Github as of now.


Tradeoffs and tolerances need to be considered.


Since nobody can work, I'll just leave this here: "I must have put a decimal point in the wrong place or something. Shit! I always do that. I always mess up some mundane detail."


Loads for me

Unless it is cached

Edit: I could even login


Same. I'm in Europe and it loads slowly but it gets there.


And it seems like loading HN is even slower


Probably because everyone in the US is piling on HN to see if github is down


Same, in EU and is just as normal.


Yes, github status says all is operational but hacker news is faster. Go figure.


At this point I don't trust self-owned status pages at all - those crowd-sourced ones where users report issues are much faster to respond to outages that may never even go reported by status pages.


My team ran some code that crushed our Github actions very shortly before this outage. Nervous laughter around the office that it was us.


I am laughing so hard right now. My best friend mocked me for using git.lain.faith to host my code, saying it wasn't reliable. Well, well, well. In the last year GitLain hasn't gone down once.

I know he was still right in a way, who knows when git.lain.faith will just disappear. But still. I'm going to send some texts to bother him right now, hahaha.


I heard from somebody at GitHub that this one will make a good incident report. No other details or estimates for recovery time.


Update

We have identified the root cause of the outage and are working toward mitigation Posted 4 minutes ago. Jun 29, 2023 - 18:02 UTC


Seems like an Oopsie! If they found it so quickly.


Seems like it's coming back online in fits and starts


I noticed Github's OIDC token changed about a half an hour ago. Security incident?



I never got that memo. Found out when it broke something.


Interesting observation, but I'd be surprised if that was related to a regional network fault


I must ask, how did you notice that?!




Not down for me accessing from Hong Kong. I suspect this is a regional outage.


Unsurprising. There is at least one outage with GitHub every single month. [0]

[0] https://news.ycombinator.com/item?id=35967921


Won't load for me.

https://githubstatus.com shows all green, but it's not the case...


Github status page doesn't even load for me ... "We're having a really bad day, the unicorns have taken over"


Are these status pages updated manually? At the very least it should be able to detect that the home page doesn't even load and turn itself red.


Yaay! I just pushed and my new commit showed up in CI!


I can't push from my desktop, and https://github.com/ spins in the browser


Yup. Totally down, happened right as I opened a PR.


You broke it.


I just noticed that artifacts download didn't work although the web site was up. There was some varnish proxy error.


If they used AI to rebuild their system and migrated it Azure, I bet they would stop having all the problems.


I can't tell if this is sarcasm or not.


I'm pretty sure if I e-mailed my sales rep right now they would tell me that Azure Dev Ops doesn't have these problems.


Still can't tell if this is sarcasm or not.


That's how you know it's good sarcasm


Yep, looks like it's down. Can't pull/push and can't even get the web to load at all.


Unable to load Github website or push using git-remote-https as of the last several minutes.


Same. Spins for a while then dreaded `ERR_CONNECTION_TIMED_OUT` chrome error page.


Looks like we are back online.


Down for me as well, Operation Timed Out errors on all attempted SSH connections


No it isn't, it's working absolutely fine and has been all afternoon.


githubstatus.com disagrees. Heres the specific incident: https://www.githubstatus.com/incidents/gqx5l06jjxhp

I think I'll believe them when they say it's down, no offense.


Uhm, okay.

Rather than looking at a rather noddy status page, have you tried using it?


Yes and everything times out.


Are you sure it's Github and not your ISP or something? I've just pushed commits to a bunch of repositories in the past half hour.


This should be a weekly ASK HN; seems to happy pretty frequently at this point


It works for me at the moment


First noticed when trying to pull a helm chart - get a 503 backend error page.


Not a single day passes without a MAJOR outage in a Microsoft owned service.


often people point out how unreliable self-hosted services are, well, hosted services are just as unreliable if not more.

this, ladies and gentlemen, is why you should always self host critical infrastructure


Depends on what you want. If you want uptime, then sure. If you want to be able to blame someone then no.

If you are down for 1 hour a year on self hosting, but Office 364 is down 3 days a year, your CEO is going to be more understanding of the Office outage as all his golf buddies have the same problem, and he reads about it in the NYT.

But in any case zero downtime is difficult, that's why you need two independent systems. I had a a 500 microsecond outage at the weekend when a circuit failed which caused an business affecting incident, not a big one fortunately, as it was only some singers, but it was still one that was unacceptable -- had it happened at a couple of other events in the last 12 months it would have been far more problematic. Work has started to ensure it doesn't happen next year.


Yep it's down. Why do they even bother with the status page


Up in Africa (Morocco).


And just when I was about to get into flow state...


Status page updated showing all red https://www.githubstatus.com/


Same here, it cant send a verification sms.


Use the downtime to purchase a Yubikey/FIDO2.


Its down for me


Cannot use oauth2 in algoexpert :/


I can't even load the status page.


Status page loads for me, it just incorrectly says all green: https://www.githubstatus.com/


GitHub should monitor their status page traffic for spikes, which probably mean something is wrong somewhere, even if they themselves haven't noticed yet.


That requires manual acknowledgement. Probably requires an approval from a VP or some high level exec to change that status.


Maybe it's right and we're all wrong


Interesting, I'm used to using status.github.com, which got hit by whatever issue is hitting the main site.


they just updated it, now its all red


looking on the bright side, at least we'll get an interesting post-mortem to read in a day or two.


Actions won't start for me


Yes. Can't even pull ;(


Good time to grab a beer!


Down for me right now.


Azure Strikes Again!


It shouldn't matter. Nobody should be using github post-2018.


seems to be a network issue, not a service issue


Being up on the localhost interface doesn't count!


Works on my machine!


Works for me?


Same for me


Same here


Yep


Same here.


its all down currently


Same




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: