Everytime github goes down, and my push/pull is rejected, I immediately assume they’ve discovered I’m incompetent and fired me. And I’m the head of engineering at my company.
This is something my boss would post in his calendar publicly without even thinking of it. I guess it helps me to get ahead of it, if it were to ever happen to me, but having the rest of the company able to see it is pretty cold and unfeeling.
thats not very healthy given how unreliable github has become in recent years.
E.g. just yesterday for a short time frame of a few hours maybe half a day or so they had a bug where some closed PRs where shown in the personal which show created _not closed_ PRs.
Or github CI having spurious job cancellations or sometimes on a job failing waits until some (quite long) timeout is reached before reporting it.
Or it temporary being (partial or fully) down for a few hours.
Or it's documentation even through rather complete somehow managing to be often rather inconvenient to use. Oh wait that's not a bug, just subtle bad design, like it's PR overview/filters. Both cases of something which seems right on the first look, but starts being more and more inconvenient the more you use it. A trend I would argue describes GitHub as a whole rather well.
I get this feeling when I get kicked out of Google services at a different time than my usual 7 days (Monday) log out. I'm an admin of the Google services we use, and I still get that stomach dropping feeling.
Your requests made it farther than mine - mine get to charter in nyc and die there
6 lag-26.nycmny837aw-bcr00.netops.charter.com (24.30.201.130) 158.033 ms
lag-16.nycmny837aw-bcr00.netops.charter.com (66.109.6.74) 29.575 ms
lag-416.nycmny837aw-bcr00.netops.charter.com (66.109.6.10) 30.077 ms
7 lag-1.pr2.nyc20.netops.charter.com (66.109.9.5) 81.351 ms 37.879 ms 27.877 ms
8 * * *
They do have other peering -- that IP from my ISP in Jakarta routes onto Hurricane Electric in Singapore and then to github. From Sao Paulo I go to Atlanta, USA, then to Paris and Frankfurt on twelve99/Telia
As the person in charge of one such page, I’d like to take the opportunity to remind folks that many— if not most—of these status pages are hand-updated, and most bosses absolutely hate anyone having to update them to anything but green.
See, I prefer panic. People don't pay enough attention to systems as it is. May just yeet together a bunch of parts and never bother to learn to actually troubleshoot or maintain, or reason through things.
I think that's what Forgejo forked from (and Gitea, in turn, forked from Gogs). I am not involved so don't know the details, but yeah basically all of these will do. I ran my own in the Gitea era and was happy with it, 10x lighter and easier than gitlab, I expect Forgejo has a similar experience.
https://downdetector.com/status/github/ is a far more reliable source - it's just powered by user reports and often will show issues long before the status page ever receives an update.
Keep in mind that downdetector can be brigaded and/or show knock-on problems instead of root causes. e.g. A couple weeks ago there were fairly major spikes across a rather huge variety of services on there, but it turned out that it was actually Comcast that was having trouble, rather than any of the “down” services.
Status pages are updated by humans and the humans need to (1) realize there's a problem and (2) understand the magnitude of the problem and (3) put that on the status page.
It's not fake, it's just a human process. And automating this would be error prone just the same.
Very good points. Meanwhile I have clients asking me why they can't have a status page to which I reply: you can, but ultimately to be completely fail proof it will be a human updating it slowly. To which they reply: but GitHub or X does it...
And Jeli.io for this! With the Statuspage integration, you can set the status, impact, write a message for customers, and select impacted components all without leaving Slack. Statuspage gets updated with a click of a button.
I wouldn't necessarily call them fake, but the issue often has to be big enough for most companies to admit to it. AWS often has smaller outages that they will never acknowledge.
Pretty much. They want the burden of proof for SLAs to fall on the customer, not on themselves. If a customer has to prove that an outage specifically affected them, they are much less likely to have a successful case against the failure to meet their SLA.
(Not directed at GitHub specifically, but at bogus status pages.)
Two technical reasons capstoned by driving business motivation:
-False positives
-Short outages that last a minute or three
Ultimately, SLA's and uptime guarantees. That way, a business can't automatically tally every minute of publicly admitted downtime against the 99.99999% uptime guarantee, and the onus to prove a breach of contract is on the customer
No it doesn't. The amount of false alarm alerts you can get with internet based monitoring is more than 0. You could have a BGP route break things for one ISP your monitoring happens to use. You could have a failover event happening where it takes 30 seconds for everything to converge. I have multiple monitors on my app at 1 minute intervals from different vendors and ALWAYS a user will email us within 5 seconds of an issue. It's not realistic for a company to have automatic status updates trigger things without a person manually reviewing them because too many things can go wrong on the automatic status update to cause panic.
Most paid status monitoring services cover BGP route problems and ISP issues by only flagging an event if it's detected from X geographically diverse endpoints.
For the 30 seconds where you wait for failover to complete: that is a 30 second outage. It's not necessarily profitable to admit to it, but showing it as a 30 second outage would be accurate
TCP default is more than 30 seconds. The internet itself has about a 99.9% uptime. If one company showed every 30 second blip on their outage page all their competitors would have that screenshot on the first page of their pitch deck even if they also had the same issue. 2-5 minutes is reasonable for a public service to announce an outage.
Forgot about that centurylink BGP infinite loop route bug they had where it took down their whole system nationwide. A lot of monitoring services showed red even though it was one ISP that was done.
Who would panic? If nobody notices it's out because it's not, then nobody is going to be checking the status page. And if they do see the status page showing red while it's up, it's not like they're going to be unhappy about their SLA being met.
Maybe you want human confirmation on historic figures, but the live thing might as well be live.
Not really, things fail in unexpected ways. Automated anomaly detection is notoriously error prone, leading to a lot of false positive and false negatives, in the trivial case of monitoring a single timeseries. For a system the size of GitHub, you need to monitor a whole host of things and if it's quasi impossible to do one timeseries well, there's basically no hope of doing automated many timeseries anomaly detection with a signal-to-noise ratio that's better than "humans looking at the thing and realizing it's not going well".
There's stuff like this that can't be automated well. The automated result is far worse than the human-based alternative.
There was that hilarious multi-hour AWS failure a while back where the status page was updated via one of their internal services... and that service went down as part of the outage.
I feel sympathy for all the engs at companies I've implemented CI/CD based on Gh Actions in recent years. It's not like I didn't tell them that Github showed to be somewhat unreliable in the recent years and in contrast to their claim "it's just the build pipeline, not the product" I think it is a horrible incident if you're not able to deploy to production and have barely any ad-hoc backup.
I'm always evangelizing Argo or Flux and some self-hosted Gitlab or gitea, but seems like they all prefer to throw their money at Github as of now.
Since nobody can work, I'll just leave this here: "I must have put a decimal point in the wrong place or something. Shit! I always do that. I always mess up some mundane detail."
At this point I don't trust self-owned status pages at all - those crowd-sourced ones where users report issues are much faster to respond to outages that may never even go reported by status pages.
I am laughing so hard right now. My best friend mocked me for using git.lain.faith to host my code, saying it wasn't reliable. Well, well, well. In the last year GitLain hasn't gone down once.
I know he was still right in a way, who knows when git.lain.faith will just disappear. But still. I'm going to send some texts to bother him right now, hahaha.
Depends on what you want. If you want uptime, then sure. If you want to be able to blame someone then no.
If you are down for 1 hour a year on self hosting, but Office 364 is down 3 days a year, your CEO is going to be more understanding of the Office outage as all his golf buddies have the same problem, and he reads about it in the NYT.
But in any case zero downtime is difficult, that's why you need two independent systems. I had a a 500 microsecond outage at the weekend when a circuit failed which caused an business affecting incident, not a big one fortunately, as it was only some singers, but it was still one that was unacceptable -- had it happened at a couple of other events in the last 12 months it would have been far more problematic. Work has started to ensure it doesn't happen next year.
GitHub should monitor their status page traffic for spikes, which probably mean something is wrong somewhere, even if they themselves haven't noticed yet.