You know, this seems like a perfect example of what a "simple" approach to scaling is: realizing that CI is the most likely thing to cause issues in the future due to how popular it is and then splitting the database along that particular boundary, due to the actual need that has arisen.
You don't always need to experiment with bleeding edge tech or very complicated multi-leader clusters (though those also have their own use cases at a certain scale), sometimes just splitting the whole thing, especially when using something as solid as PostgreSQL, is enough!
We did experiment with a few ideas outside of this decomposition method. In the end what you talked about here in your comment is exactly correct. And we value boring solutions in the end: https://about.gitlab.com/handbook/values/#boring-solutions.
While I agree that this is a wonderfully simple case the main problem is that it isn't always that easy. As long as you get get away slicing tables into different "shards" and aren't to join-heavy it isn't too baf. However it still adds complexity to the application code that is a constant maintenance cost. And eventually if you keep scaling you will get to the point where you need to shard a single table, and that is very painful. Both in the risk and difficulty of the conversion and further ongoing development costs.
I do dream of the day when distributed databases are the "default" for new projects. Likely running just a single instance at the beginning. But then you have a seamless path forward when (and if) you need it. Even at medium size three small nodes will be easy to manage and give easy HA and allow you to do version updates safely and with no downtime.
I don't think we are there yet, but there are a few contenders in the running but Postgres is tough competition. Its years of stability and predictability give it huge points even if it has the downsides of a centralized system. But I think the turning point is steadily approaching.
I am not sure your utopia is quite as utopian as you think.
I've been working on an internal service at a major tech company for a few years now. The project is awesome and super fun, the teams is great, etc. - but the overall size of the data and query load for key "working set" data store would easily be served by an RDBMS including (all likely future growth for 5+ years), yet we went for a (managed) distributed document store.
We are paying more money and dealing with all the complexity and limitations of a NoSQL document store, while we could have just used great ORM/SQL and enjoyed those sweet, sweet transactions and joins and the wealth of tooling that exists for mature RDBMS solutions. Sometimes, it would be great to simply choose the right tool for the job, even if it isn't sexy and doesn't help anyone add "distributed NoSQL database" to their CV...
I never said NoSQL. I think the relational model is incredibly valuable. I'm talking about updating the implementation, not the interface.
However there are some new databases that maintain a relational model (often fully compatible with PostrgeSQL or other relational DBs). The only real downside to these right now is stability and performance. Performance isn't a big deal when you can scale horizontally so I think when this "NewSQL" generation matures it will be nearly all upsides.
I'm not sure we will be stuck with SQL forever but strong consistency, SQL and the database-backed constraints and indexes are invaluable and I wouldn't want to be stuck without them.
There is definitely lots of progress to make in this space. From what I have seen they are all significantly slower on a single node and are not as battle-hardened. But I think with time we will have a few really nice options to pick from.
You also have FoundationDB, which seems quite battle hardened with Snowflake's, Apple's and Datadog's adoption.
Only provides a quite barebones Key Value interface to play around with though, so not anywhere near a drop in replacement to your traditional SQL database.
I didn't mention FoundationDB since it isn't relational. I think the relational model with automatic indexes and query optimization is a huge benefit. That being said of all of the options mentioned it does seem like the most battle-tested. I would love to see a relational layer built on top of it and see how that turns out.
Scaling like that always feels like applying band-aid on top of band-aid. Honestly I think that once you need to change your code to scale, you should take the opportunity to use another database right away.
DISCLAIMER: I do some paid work with with Open Core Ventures[0] and I've been a GitLab fan for a while so I'm quite biased.
Just tried to use some GitLab repos I have and ran into a well specified 503 error page:
> We are splitting our database into Main and CI!
>
> For more information about what we're up to, you can check out our blog article. For progress updates, please check our status page.
Don't know how I missed it, but it's pretty nice that the downtime is clearly specified, and the blog post (and related posts) were very informative.
Some random thoughts below:
> GitLab.com's database architecture uses a single PostgreSQL database cluster. This single cluster (let's call it main), consists of a single primary and multiple read-only replicas and stores the data generated by all GitLab features. Database reads can be scaled horizontally through read-only replicas, but writes cannot because PostgreSQL does not support active-active replication natively.
All you need is Postgres. IIRC Reddit had a similar starting story.
Excited to see how much faster gitlab.com will be and how much more reliable their CI gets (I don't have numbers on hand, but I think they've gone down less than GH Actions in recent memory at least -- maybe there's some recency bias there).
In a world where Citus is now fully open source[1], I wonder if it needs to just get pulled into contrib.
Citus in PostgreSQL main would be great. Currently Citus is one of many solution but being in main would make it the solution. PostgreSQL would officially be a distributed database, not just by stacking some plugins on it.
> Citus in PostgreSQL main would be great. Currently Citus is one of many solution but being in main would make it the solution. PostgreSQL would officially be a distributed database, not just by stacking some plugins on it.
Scaling concurrent transactions (locally and horizontally across machines) has been one of the valid complaints about postgres for so long, it would be amazing to see it solved so well/reliably in-tree.
This is one of the great things about pg to me -- the cottage industry of consultancies/companies that make truly awesome stuff and build viable businesses is seriously amazing. Most times they merge changes upstream (if only for the clout), and sometimes they just sprinkle gold dust on all of us.
As a sidenote, I have my M$ misgivings like any other dev of a certain age, but it is amazing that they chose to do that. Not only did they provide an exit to one of the awesome companies in the space, they also released such a huge benefit to the community for free. They'll probably get their return back retaining the Citus people inside Microsoft, since obviously they're the ones making the magic.
IIRC Gitlab uses Redis as well, it get used by CI for the streaming logs while a job is still running before it gets tossed up to object storage. Their statement that they only used a single cluster up to this point is a bit dated as well, I believe they've had a second cluster for nearly a year now for their new docker registry implementation. This new cluster for CI is more like a third.
You're correct. We also have Redis for a few things: Rails caching, Sidekiq jobs, session data, CI, and a few other things. You can learn more about GitLab.com's infrastructure architecture here [1] and more about what Redis is used for here [2].
> Disclaimer: working for GitLab, not in areas related to this
Just an FYI this is a disclosure rather than a disclaimer, you're disclosing you work for Gitlab rather than disclaiming legal responsibility for something (common mistake to mix these up).
Would be interesting to know if and to which extend the quite new Multiple Databases feature of ActiveRecord [1] (introduced with Rails 6) helped achieving this.
Thanks for asking - I have forwarded your question; sharing a summary what I learned below. Note that I am not a Rails engineer; for deeper questions I'd suggest commenting into the linked issues and tag engineers directly :)
Rails 6+ multiple database support is being used by GitLab. The major difference is that a hand-rolled DB load balancer is currently used. [0] tracks the effort to change the load balancer to the native Rails implementation for connection handling.
This comment summarizes [1] how things worked, and how Rails' support of multiple databases can help. It also provides flow diagrams and code snippets for better understanding.
Side note: The multiple database feature effort is also related to bringing Clickhouse as a datastore for Error Tracking [2] and more Observability data [3].
Is anyone also experiencing problems with purchasing additional CI minutes? I bought extra minutes yesterday and this morning, but they've not been added to my account. Before this worked instantly. It's blocking some of the development processes for us because the pipelines don't run right now, we've ran out of minutes. I'm curious whether it's related to splitting the database.
If anyone from GitLab is reading this, it's for the public repository of Baserow https://gitlab.com/bramw/baserow. Help would be much appreciated.
Sorry for the troubles, and thanks for sharing here. I have forwarded your comment to engineering teams to investigate if related to the migration.
To avoid blockers, I'd also suggest opening a support ticket [0] to let support and billing teams know and escalate. Include this HN comment and my name if you like.
Thanks for forwarding my comment, it's much appreciated! I opened a ticket a while ago ago, I'll add this comment to the conversation. We're working on setting up our runners in the meantime.
I have talked with the fulfillment team at GitLab [0], and there was a problem identified with workers not syncing the purchased CI minutes after the DB migration. Corrections have been made, and potentially you should see the synced CI minutes soon. Support teams have access and can check with you, I do not have access.
Setting up your own runners is a sensible approach, and allows you to scale them for your own needs too. Maybe this workshop about pipeline efficiency can help with more ideas and insights [1].
Thank you for the update! We already have our workers working, it was super easy to set it up. I can confirm that the pipeline minutes have been added to my account.
Why something like this requires a downtime? Could not both databases be utilised at the same time while backfilling older entries in parallel and once all data migrated flip a feature switch to stop writing to old db? Reades can be switched with a flip too, or some smart logic to check both databases if data is not found. This approach is hard to implement and overall migration would take longer, but considering many companies depending on Gitlab SC, this should be the preferred approach I think.
Swallowing an hour or two of pre-scheduled downtime can be worth it if you can significantly reduce the complexity and risks associated with a migration, and get it over and done with sooner. Particularly if you're already hurting from whatever it is that you're fixing, and it's about to cause you unscheduled downtime.
The comment [0] provides more insights into the planning and downtime requirements. The epic itself may be helpful too, it is linked from the blog post.
You don't always need to experiment with bleeding edge tech or very complicated multi-leader clusters (though those also have their own use cases at a certain scale), sometimes just splitting the whole thing, especially when using something as solid as PostgreSQL, is enough!