Hacker News new | past | comments | ask | show | jobs | submit login
The Consul outage that never happened (about.gitlab.com)
100 points by dankohn1 on Nov 16, 2019 | hide | past | favorite | 54 comments



> There is just no substitute for understanding how everything works.

Great line, and one I strongly agree with. I love the tooling we have available today like Terraform, Ansible, etc. but the experience I gained by keeping bare metal servers alive and happy with nothing more than a shell has undoubtedly made me a much better admin.


Agree 100%. I'm an SRE team of one and I'm looking for an additional team member. I've interviewed about a dozen people so far and one thing I've noticed is that a lot of young engineers do not know the basics. They really can't tell me how to log into a Linux box and troubleshoot. I think that's a shame. You really lose a lot of insight if you don't understand how the underlying pieces work.


An interesting side-effect of having awesome tooling. When you have a service that is taking too much memory and can see directly on a Grafana dashboard that the reason your service keeps getting OOM'd because of too many Goroutines you can avoid even having to go into a box to debug.

When I used to be on call, I remember very distinctly an incident where there was a ton of restarts on containers for a set of nodes (This was on K8s, so not immediately obvious). The operations team mentioned to me that in the logs (K8s logs here) there was a lot of errors being generated along the lines of "Discovery failing, restarting".

I asked if we had a tcpdump if we could see what was happening, and the answer I got was "There is no tcpdump in our container." Meaning that we didn't have the binary in the actual container image.

For anyone with any knowledge of Kubernetes, you know that you can easily just SSH to the machine running the pod, get the current port it's being ran on and then run a tcpdump on there. However, the fact that all of this couldn't be tied together to come up with a tcpdump to understand that caused this issue to persist for an extra 3 hours when the issue was a flapping NIC that was misconfigured.

Without understanding how your systems actually work under the hood, you will be running them with your hands tied. It's not good enough to understand one abstraction, you need to be prepared at anytime to peel back the layers of abstraction and get your hands dirty in a different layer.


Even worse I joined a new company in a senior technical role and it’s seen as a negative that I prefer ssh/strace/tcpdump to debug problems.


It arguably is.

A good SRE needs to understand systems, as in automation of n computers. Focusing on and prefering single system tools where you have to take manual action points to an immaturity in dealing with complex distributed systems.

However, it's a common problem and most of the folks buying complicated distributed tracing systems don't have particular skills in using them either, so your skills are valuable, even if there could be better ways to do it.

Similarly if you focus on hiring SREs who know shell commands well, you might lose more pertinent skills such as knowing what terraform is actually doing, general programming skills, CI/CD and an understanding of cloud APIs.

Horses for courses; the more we know the better. Look for both sets of skills in your teams and cross train as much as possible.


Arguably those who demonstrate understanding of low level fundamentals have the drive to understand as many layers as possible.

Capturing AWS API calls through sslproxy not only implies you know what terraform is doing but you also have a higher probability of solving difficult problems.

All code boils down to an execution layer and having inspection ability at that layer will always be valuable.


This was a fascinating read but I feel like I’m left with some more questions about what’s going on under the hood in gitlab cloud:

1. I thought gitlab was hosted in google cloud, but there’s a lot of references to e.g., a hand rolled consensus system and self-managed database clusters. I’m wondering if this event changes the math on build vs. buy at all for gitlab, sounds like a lot of money has gone into this solution. How did that solution come up, is it about some specific Postgres features that aren’t available in googles hosted dbs or pricing?

2. Again on the google cloud angle, why are servers being hand-managed and rebooted? Elasticity in the cloud would make me think that the safest option would be to stand up parallel infrastructure (like in a DR plan) and migrate traffic. Was this just about speed of solution rollout? Does gitlab have plans to harden DR plans so that you can execute in cases like this? Whenever someone says they’re “in the cloud” and yet unable to treat servers like cattle, I get a bit worried.


I led my first from the ground up greenfield infrastructure + development project on Consul+Nomad+Fabio and I became a big Hashicorp fanboy then.

But, if I were in a cloud environment, I definitely wouldn’t use any of those products. As far as cloud, I only have experience with AWS, but there is nothing Consul+Nomad+Fabio would give me that I couldn’t do better with managed services. I’m sure GCP abd Azure are the same.


Then comes the day when you need a hybrid system that spans clouds, and your managed services have suddenly become a hindrance. That is where these services are valuable.

I've run systems with Consul + a VPN that we did zero downtime migration of first from AWS to GCP, and then to managed servers, and once you do that you start appreciating platforms that you can replicate. This was prior to Nomad.

Note that this of course means you can use managed services if you're careful and they're set up in ways that lets you do seamless migration - most importantly avoiding any proprietary features.

[the way we did the migrations were nothing fancy: establish a VPN between the two sites; use load balancers updating from the health checks for all services; add the new servers into the cluster; drain the old ones once the new ones were up; point DNS at the new external load balancers; but being able to easily have Consul span different cloud environments made it a lot easier]


So you’re avoiding “proprietary features” of the cloud. Congratulations! Now you have the worse of all worlds. You’re spending more on resources than you would at a colo, spending just as much on babysitting infrastructure and you’re not saving any money or time by outsourcing “the undifferentiated heavy lifting”.

You can’t imagine how many times I have seen “software architects” use the “repository pattern”, interfaces, and be sure not to use any “proprietary features” just in case one day their CTO decides to move away from their six figure a year Oracle installation to Mysql.

If being cross platform or cloud is a requirement in the beginning, you have to constantly test on all of your supported platforms. If not, it’s a waste of time to design for some unknown eventuality most of the time.

In reality, at a certain size, you’re always tied into your infrastructure and few companies move from one cloud provider to another. The expense, the chance of regressions are too high and the rewards are too low.

But to the Consul+Fabio use explicitly. I’ve duplicated the functionality:

The KV Store: A DynamoDB table behind an API. The API can be used from anywhere.

Health Checks/restarts/service discovery: Fargate (Serverless docker) with a load balancer/target groups that do health checks and automatically launch new Docker instances or just regular EC2 instances behind an autoscaling group.

Nomad: we used Nomad because I didn’t want to introduce Docker and it can orchestrate anything. These days I would just use Docker or of it was simple enough lambda triggered by either scheduled events or other AWS events like queues or SNS messages.


I tend to avoid cloud deployments in the first place, exactly because it is usually far more expensive than even managed hosting, even if you don't go all the way to a colo. And I've yet to do a migration off cloud services that didn't result in substantial cost reductions (including when factoring in devops costs).

But sometimes it's impossible to convince people of this, even with detailed cost breakdowns, and/or people have other reasons for using cloud providers, some more rational than others. My current system is hosted on AWS because it's convenient and it won't grow large enough (in terms of resource use) anytime soon to be a cost issue relative to the engineering resources put into developing it. But what you can do in those instances is to plan out an architecture that ensures you maintain flexibility and minimizes the impact of changes while taking advantage of what you can.

That does not imply avoiding all proprietary solutions is always the right choice. It means avoiding solutions that are particularly onerous to transition off, and to limit different services exposure to the proprietary aspects. How far you take this certainly greatly depends on how likely you are to migrate.

My experience is different than yours - a very substantial proportion of the systems I've been contracted to work on have involved migrations between providers sooner or later.

(But part of that may well have been architectures that made migrations simple and cheap. In one case the company in question was explicitly constantly chasing the cheapest deal, and so were on AWS and then GCP thanks to huge amounts of free credits, and moved to Hetzner the moment they'd used them up; the credits added up to at least two magnitudes more than the cost of the migrations)

E.g. for anything deployed in AWS I'll happily use RDS because it's no harder to migrate off an RDS Postgres instance than migrating any other postgres instance, and the servers talking to it does not need to know whether they're talking to an RDS based instance or something that has been moved to another cloud provider.

And I'll happily use many services that are fully AWS proprietary too like SNS ans SQS because their interfaces are narrow enough that migration tends to be relatively easy, and that the actual interaction with AWS can be narrowed down to a limited set of services.

The point is not to be able to move on a seconds notice, but to avoid pointless coupling.

Being cross platform/cloud is rarely a requirement; not making decisions that make it harder to migrate in the future when faced with choices where picking the cloud agnostic solution is about the same cost/complexity as the cloud specific one, often is.

Using things like Fargate can fit perfectly fine into that kind of architecture.


We just run the same thing our customers run on-premise. Some companies that allow you to run one instance of whatever they make, would usually tie that to a single cloud provider. That simplifies their operation but also reduces the potential customers.

Not everyone may be using the same cloud provider, not everyone may be able to run on the cloud or even may have OS/distro restrictions.

As always it's a matter of tradeoffs.


Keep in mind that Gitlab can also be deployed on-premise, using (presumably) the same HA mechanism:

https://docs.gitlab.com/ee/administration/high_availability/...

Locking a product and company like Gitlab into a single cloud vendor's proprietary orchestrator would be a massive risk.

Sure, if you're an early stage SaaS startup whose only metric is growth, use whatever gets your infrastructure up and running with as little downtime as possible. In any other situation, locking yourself into a single cloud provider is a risk that needs to be carefully evaluated.


1. The consensus system you are talking about is Raft (https://raft.github.io/) and it's baked into Consul.

2. There are two ways to interpret parallel infrastructure. I will post my thoughts on both. 2a. Standing up a parallel Consul cluster. This is problematic because typically what people do with Consul is to put a Consul agent on every server (or pod) which registers its services for discovery to the Consul servers. When you make a parallel Consul cluster you need to also restart the Consul agents on every other service. They only mention Postgres in this blog post but there can potentially be a LOT of other servers registered to the Consul cluster. 2b. Standing up everything it takes to run Gitlab in parallel and then diverting traffic. Honestly sounds great. The reason a team wouldn't do this is due to either a) not having infrastructure code which allows for one-click deployment of whatever it takes to have Gitlab running. or b) it's actually pretty expensive to do if you're not Google or Amazon. The blog post mentions 255 clients (and 5 Consul servers). That's a lot of servers to rebuild!

Now, I would love to hear from anyone else who uses Consul, because I have my own thoughts on how they decided to handle the issue. I will focus my attention entirely on the Consul and not the Postgres portion.

The blog post mentions two limitations: 1) Reloading the configuration of the running service, which worked fine and did not drop connections, but the certificate settings are not included in the reloadable settings for our version of Consul. 2) Simultaneous restarts of various services, which worked, but our tools wouldn't allow us to do that with ALL of the nodes at once.

We don't need to reload. We can run a rolling systemctl restart whicn ansible is perfect for. The nice thing about this is that their stop-gap solution is to disable TLS verification. This means that servers with TLS verification ON in the meantime should be able to continue validating certs while other servers can have a rolling restart that disables TLS verification one server at a time. If we want to minimize downtime we would do every non-leader server in the cluster, then finally the leader, then every client in a serial manner. With 260 servers to deal with it would be slow but it shouldn't break Raft at any point. There is no reason for quorum to be broken. The gossip will still be communicated over TLS, just that some of the servers/clients wouldn't be validating the certs.

Then, we would follow exactly the same process for rolling out valid certificates with TLS validation turned back on. One non-leader server at a time, then the leader, then every client.

I could be missing some critical piece here, and it looks like the Gitlab team did run a lab test before making their change in prod. It's easy to miss a possibility when under pressure, and also easy for an online commentator like myself to think they are so much smarter. They still managed to get out of the crisis with no downtime and congrats to the operators who pulled it off!


I think what you are missing is that validating servers/clients would not allow non-validating servers to rejoin the cluster (i.e.: servers/clients with validation enabled will validate both outgoing and incoming connections).

As I see it, by the time you restart the leader (and hence quorum switches to the non-validating portion of the cluster) all of your clients will suddenly fail (they are still validating, and there's no good server for them to connect to). Conversely, if you restarted the clients first they would all become unavailable before the quorum switch happened.


I was speaking more to 2b, and it shouldn’t be expensive because the other servers should spin down after you’ve migrated. You’re maybe paying for a day of overlap if that, and you’re paying for massively reduced risk.

Fair point on the consensus stuff. I was keying in on the patroni system but reading up more looks like that’s less novel than on first read (the “framework” lines in the library readme had me worried)


I had the exact same thought on the consul rolling restarts. I’ve done exactly this before. I’m assuming there was some other issue I’m not getting as to why that wouldn’t have worked for them.


> After looking everywhere, and asking everyone on the team, we got the definitive answer that the CA key we created a year ago for this self-signed certificate had been lost.

The GitLab outages always make the company seem disorganized and sloppy, and unable to reflect on how to improve how they work. So they don't have a central place to store their CA, and even after an outage, did they improve anything about how they work?

It's ironic that the post seems geared towards recruiting, though I guess it's honest, you know what you're getting into with that team.


It would guess that the root cause for most outages that have a human factor is disorganization and sloppiness because if that wasn’t the case there wouldn’t be an outage.

It’s interesting to me that GitLab are so public and honest. I don’t think that appeals to everyone, but it is a unique selling point to some.


We used to (half) joke that in our “5 whys” process, #4 was often “because we were lazy [or in a hurry]”.


Being public and honest is always cited when this happens to Gitlab. Which I can say because my fragile memory recalls a number of incidents. This should be alarming but apparently their psy ops is better than their dev ops because we all react with fondness and awe. Maybe I should do more of that at work!


I think that is because HN has a lot of people who knows first hand that very few places are free of these kind of issues.

In 25+ years of working in tech, I can honestly say I've never worked anywhere where there haven't been one or more serious issues where one or more parts of the cause was something everyone knew was a bad idea, but that slipped because of time constraints, or a mistaken belief it'd get fixed before it'd come back and bite people.

That's ranged from 5 people startups to 10,000 people companies.

Most of the time customers and people in the company outside of the immediate team only gets a very sanitized version of what happened, so it's easy to assume it doesn't happen very often.

Gitlab doesn't seem like the best ever at operating these services, but they also doesn't look any worse than average to me; which is in itself an achievement, as most of the best companies in this respect tends to be companies with more resources and that have had a lot more time to run into and fix more issues. For a company their age, they seem to be doing fairly well to me.


So they went off and implemented a brand new fancy service discovery tool for I bet a problem they didn’t have, but couldn’t do the basics of tracking 2kb of data for the CA. I don’t think that’s a age issues, that and there’s nothing that prevents companies of any size from self reflection on what they’re doing and what’s important.

Also what’s the point of transparency if you’re not getting critical feedback from it and learning?


I mean, I much prefer them telling us about all their stupid mistakes to keeping all of the stupid mistakes hidden.

I know every company makes stupid mistakes, but all of the ones Gitlab made are public, and there’s comparatively few.


That last phrase is what I disagree with. Every company makes stupid mistakes, but Gitlab seems to make a lot - more than average, compared to companies I've seen the insides of (of course a small sample).


For me, as soon as the company becomes bigger, the number of mistakes becomes sheer endless.


Yeah. “We rm -rf on production server, and our backups are useless, but we’re public and honest!” Sorry, not impressed.


This happens everywhere. You just don’t know about it precisely because companies are normally not public and honest about it.


It really doesn't happen everywhere.

Most places with decent devopss hygiene have defense-in-depth around their backups.

I've heard of people dropping production databases in big companies (but saved by backups).

There are some stories around the bitlocker blackmail thing that had similar impact, but that was with a malicious opponent.

The only thing similar I've heard for the notorious self modifying MIT program (for geo-political coding) in the 1990s which destroyed itself without backups.


Gitlab was "saved by backups" as well. They lost some data since the latest backup, which is rather common.


Most places don't have decent "devops hygiene".


> You just don’t know about it precisely because companies are normally not public and honest about it.

If a big company lost a ton of user data, I'd absolutely know about it, whether they have Apple-level secrecy or not.


The incident described did not result in loss of tons of user data, and neither will most incidents, whether you choose to be open about them or not.


What are you talking about?

> This incident caused the GitLab.com service to be unavailable for many hours. We also lost some production data that we were eventually unable to recover. Specifically, we lost modifications to database data such as projects, comments, user accounts, issues and snippets, that took place between 17:20 and 00:00 UTC on January 31. Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts.

https://about.gitlab.com/blog/2017/02/10/postmortem-of-datab...

Yes, most incidents from most companies don’t result in this kind of data loss, which is why GitLab stood out.


How do you know what most incidents result in? For example, when Github deleted their production database[1], they simply gave no numbers of affected users/repositories. We do know that the platform already had over 1M repositories[2], so 5000 affected seems perfectly possible, but their lack of transparency protected them against such claims. And that lack of transparency seems to me to be the norm.

[1] https://github.blog/2010-11-15-today-s-outage/

[2] https://github.blog/2010-07-25-one-million-repositories/


MySpace lost all its music from 2003 to 2015: https://news.ycombinator.com/item?id=19417640

Probably a few hundred TB or so. Maybe nearly a petabyte?


That’s the point: we know about that. Hard to believe “this happens everywhere” when we only know a few instances, and any instance would be picked up by media.


I've had to help clean up after any number of data losses or near losses that has never been made public; ranging from someone mkfs'ing the wrong device on a production server, to truncating the wrong table. In some cases afterwards having people writing awful scripts to munge log files (that were never intended for that purpose) to reconstruct data that were too recent for the last backup.

Of course there are people that avoid this, but I've seen very few places where their processes are sufficient to fully protect against it - a lot of people get by more on luck that proper planning. Often these incidents are down to cold hard risk calculations and people know they're taking risks with customer data and have deemed them acceptable.


I'm not sure that I agree. These things happen, being open about it, just makes you think like that.

Other companies just have a red/orange warning button when things go to shit. You don't know what really happened, you just see the "more positive than real" summary.


> It is maintained by the Infrastructure group, which currently consists of 20 to 24 engineers (depending on how you count)

20 if you count in Base-12 and 24 if you count in Base-10?


It blows my mind they didn’t have sane PKI with that many resources. It seems like even the “small” initial team of a couple devs, a manager, and a director would’ve at least spun up a vault instance to use as a CA.

Also, easy for me say from the peanut gallery, but don’t understand why they couldn’t have done rolling consul restarts to update the configs, I’ve done this many times on consul clusters.


> It blows my mind they didn’t have sane PKI with that many resources. It seems like even the “small” initial team of a couple devs, a manager, and a director would’ve at least spun up a vault instance to use as a CA.

Not mine. Inhouse CA management is a true PITA, even multiple-thousand-people companies regularly fuck this up. I have experienced hours of outage because someone failed to renew the certificate for one of the thousands of pieces making up a Cisco network environment, and don't get me started on the drama that is root CA certificate rollover, experienced this in three companies and nowhere it was painless...


I have seen this more than a couple times, at big places with resources to manage it. Is it just me or does the TLS and PKI tooling just seem weak? I keep thinking there should be some badass tool that helps manage this sort of thing, is there something I don’t know about?


It's not just the tooling that's weak, it's also terminology and education. If you're not dabbling in crypto occasionally, half the OpenSSL manual and 100% of its codebase will be like hieroglyphs... leading to the fact that most organizations put the operation of their PKI to the one person who can successfully manage to get a working HTTPS cert after copypasting shit from Stackoverflow and wrangling with the validation tool of their certificate vendor.

What also really bothers me is that there is no way that (assuming I own the domain example.com) I can not get a certificate that allows me to sign resources below example.com and that are verifiable by clients without messing around with the system root trust store - and then, many pieces of software carry their OWN trust store totally independent from the OS one (especially Java, it's a true pain in the ass every two years to update that keystore so that LDAPS works again)...


Maybe a non-validating consul does not want to talk to a validating one?


Consul issues have bitten me at two companies, and I heard word of it being the culprit for some serious outages elsewhere. One possible takeaway here is to remove it.


Totally agree. It can really shit the bed hard it can. I had an 0.8 cluster that crashed with a never ending leadership election. Only option was to burn the whole cluster to the ground and reinitialise it from scratch. That issue seems to have gone away since 1.0 but I’m not sure I can sleep after running it for a couple of years. Every time I bounce a cluster node for patching I clench.

I could write a book on the problems I’ve had with vagrant as well. Lost at least a day a week to that in the last month.


The worst outages will always be those involving core infrastructure.


Consul seems to be more prone to issues than one would hope though. Imo the feature set is not worth the increased complexity and operational burden. There are simpler ways of handling service discovery and configuration without running your own consensus based cluster.


Genuine question: can you explain a couple of these simpler ways please?


Central authorities are typically simpler than gossip and consensus systems. They have failure modes too, of course, but those failure modes are better understood and potentially easier to manage.

Sometimes you can't avoid the need for distributed consensus, but you can box it inside a well defined abstraction like leader election, and then do everything else in a traditional client-server way.


If they were going through all this trouble and worry why not create a new CA and drop the crrts from it on the hosts? That’s the work of just a few minutes (plus some bash scripting to mass generate your host certs). If they had already accepted that they were going to restart the services on all the hosts anyway it would have saved them having to restart all the services again in the future when they need to drop more certs.


Yeah I was really wondering why they didn't just add a new CA. It would be really weird if something as critical as consul was one the few program in the world that only accept a single root certificate.


Maybe I am saying something stupid, but infrastructure services should be able to use a dynamic set of keys.

If the first doesn't work, you try the second, and then the third and so so.

Similarly the clients, we should be able to dynamically adds certificate.

Our own key expires and our services are all about to drop connection seems something that should not happen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: