Hacker News new | past | comments | ask | show | jobs | submit login
Successfully Merging the Work of 1000 Developers (shopify.com)
339 points by rom16384 on Nov 20, 2019 | hide | past | favorite | 103 comments



This is exactly the kind of workflow that Bors (https://github.com/bors-ng/bors-ng) automates.

In addition to Bors, there are a number of apps and services that automate this kind of workflow. Here is an incomplete list: https://forum.bors.tech/t/other-apps-that-implement-the-nrsr...

Edited to add: Graydon Hoare (creator of Rust) called this the Not Rocket Science Rule Of Software Engineering (NRSROSE): "automatically maintain a repository of code that always passes all the tests" - https://graydon2.dreamwidth.org/1597.html

Disclaimer: I have contributed code to Bors.


Bors when introduced at my workplace was summarily disabled after 2 separate incidents where it was improperly applying commit messages (which we patched internally) and improperly applying patches (at which point we disabled it and explored other solutions). Combined with the merge latency overhead, I would be extremely hesitant to advocate running it at scale.


At this point there are like 4 implementations of Graydon's bors. Which one did you use?

The Servo and Rust projects have used a bors* for many years and have had few problems.

* Servo uses the https://github.com/servo/homu fork of the Homu rewrite of Graydon's original code, and Rust uses a slightly different fork at https://github.com/rust-lang/homu.


> Bors when introduced at my workplace was summarily disabled after 2 separate incidents where it was improperly applying commit messages

Seems a bit extreme to discard something after finding two bugs. Could you have fixed them instead?


We did fix one bug, but after the 2nd incident decided that using a tool that silently corrupts your repository is not worth the marginal benefit it provides against alternative more naive solutions. Most companies run post-merge checks so in the event of a bad merge, the offending commit can be isolated and reverted, or have a cron running at a regular cadence to isolate the issue to a set of commits. Having your repo reflect developer’s patches is really table stakes for a tool like this anyway.


The problem is not the bugs, it's that it was probably recommended to them as a reliable solution, loosing them hours of work. This can't build trust, especially for such an important part of infrastructure: you'll always wonder when it's going to hit you next.


Nope, that happening once is bad enough. Those aren't bugs like "oh shoot, my icon is the wrong color". It's really, really important that source control is not lying to you, or things will get very difficult very quickly.


The limitation of Bors for us was throughput, we were more interested in testing multiple simultaneous pull requests merging at the same time, rather than testing against the latest master.

With our CI times and PR volume, by the time any CI run completes, master would have drastically changed.


Sequentializing landing so that you can ensure passing tests is the main feature of bors. The normal GitHub CI flow is already what you wanted apparently.

In a bors workflow, bors is the only thing allowed to push to master, so master cannot get out of date.

The Rust projects solves throughput with rollups, which are semi-automated. It would be nice if someone would write fully automated rollup support into a bors, but alas, no one has tried that I know of.


You don't need absolute sequentiality for bors's guarantees. You can speculatively build and try to merge multiple PRs in parallel even though only one will "win". That's fine and not thundering herd stupidity of your build system is incremental so you can share work.

None of this is new at all, btw, I'm just regurgitating MVCC from postgresql.


I've read the description, but I fail to understand how this is different from just merging current master into the PR and running integration on that before merging in back into master? This can be done with 10 lines of groovy in Jenkins.

Also the exact workflow described with staging branch and batch merge is probably another 20-30 lines.


That process assumes nothing gets merged during the integration test run. The script could check to see if the target branch has changed in the mean time, and fail the merge if it has, then I think it would be the same.


This is what Travis CI did/does and I have not seen another CI platform offer the same.

Travis runs your test suite on the merge commit, not the head of the branch.


Travis tests both the development branch and the merge/integration branch.

https://docs.travis-ci.com/user/pull-requests/#double-builds...


The OpenStack project faced a similar problem a few years back, they produced Zuul[1] to solve the problem. I can't compare it to what Shopify produced, but Zuul is absolutely worth a look when it comes to solving large scale, high throughput, must always be green CI.

The linked page explains the speculative execution aspect, used to ensure every change is tested before merge, with the true state of master at the time of merge despite that state being different than it was when the CI run started ;)

[1]: https://zuul-ci.org/docs/zuul/user/gating.html


A bit of shameless self promotion: I built a more basic merge bot for GitHub that efficiently updates and merges PRs because we were wasting a ton of time keeping branches updated at work.

https://github.com/chdsbd/kodiak


We started using Kodiak after the Auto Rebase bot was discontinued a few weeks ago. Other than confusing me with a normal merge when I thought it had been configured to do squash-merges, it works great. Thanks for releasing it!


Thanks for the chart comparing it to many alternatives. It both makes it clear what new things Kodiak does and what alternatives one could look at.


This is how google manages these changes:

https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

this was written 3 years ago and scale has only grown. But at the time the numbers quoted in the article are 25,000 authors / day w/ 16,000 commits and another 24,000 commits / day from automated sources


I find it difficult to imagine why you would need 1000 developers in most single projects. The cost of managing so many people and work streams would seem to overwhelm whatever volume you could get out of them


The core shopify codebase is a monolithic one, hence the merging infrastructure described in the post. It's not a "single project" in any useful definition of the phrase.


Im interested how what looks to me like a fairly standard web store could have such an enormous amount if developers working on it. Is there some kind of insane complexity behind webstores I am not seeing here?


It’s about providing a platform for enabling global commerce. Our engineering blog has other posts that goes into technical depth about a lot of the challenges we face, and our solutions. For context, here is some info about Black Friday/Cyber Monday for us last year: https://engineering.shopify.com/blogs/engineering/preparing-...


I understand its a hugely popular service and there is a lot that goes in to something so big but there are bigger websites that run on much much smaller teams of developers. I just don't understand how there is even enough surface area on the app that 1000 people could be working on it at the same time. Is that number counting people working on the ops stuff or building other tooling not part of the main app?


Think of it this way. 100 different projects. Each with different business goals etc. Make some of the teams Platform teams for good measure.


Shopify is far more than just a "fairly standard web store"


When you have 100 devs working on the same codebase the bugs start to get so gnarly you need 900 more devs to fix them.


Yeah, honestly, as a person who has had the ill fortune of having to work with shopify, I find this kind of boasting kind of funny. Every admin console page load takes ten seconds and their backend can't handle a product with more than 100 variants. They clearly got too big too fast and just piled code on code on code until they ended up with what they have today. They should focus on upgrading the quality of their system, take 20 of those developers and start on a complete rebuild, and spend a little less time wrangling 1000 devs.


If I read this article right, if anyone breaks any part of the build it breaks for everyone? Doesn't sound very scalable. Shouldn't the main goal be to break up your continuous integration steps so that a person in one end of the company can continue working even if a person in the other end broke his build?

That way you can also add tags for flaky tests etc, to make your builds more reliable.

Edit: I didn't understand before what people had against monorepos, but I guess if you tie an continuous integration builds to each repo then having a monorepo becomes a huge pain point. Are there any open source tools to fix this?


There is an explanation about how we handle this case when I talk about the failure-tolerance threshold. I go deeper into this in my GitHub Universe talk where I also talk about an alternative (but costlier) solution through running parallel branches, but unfortunately that talk is not posted up yet.


Presumably when a merge breaks the main branch, the merge gets reverted and it is the task of whoever pushed that to fix it. In the mean time the CI server can continue with the next branch/commit in the queue.


It would be useful in this article to hear about what content is acceptable in a merge request. For example: can these all go straight to queue because they use feature flags? Are commits a "single piece of work", etc.

Not to sound like a downer but this is really an article about fixing a broken process because not running CI on branches before merging to master goes against best practices. Would have loved to actually hear about their work process as this whole article could be summed up as "not running CI on branches before merging their commits to master is a great way to ruin master".


Well, the problem is that master is a bottleneck. Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with. At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build). It's also wasteful to build each branch against current master because what is "current" will not be when the branch is ready to merge.

Perhaps this problem is what microservices are meant to solve. When you can't coherently integrate code fast enough, attack the bottleneck (master) by splitting it (multiple services).


Microservices don't really help with this. They just force you to think about your interfaces, but you should do that in a monolith too. If you interfaces are reasonably stable, merging is unlikely to break master if the branch was green before, if your interfaces change rapidly you get problems with microservices too, just one level higher up, where you try to integrate them into a usable product.


One of the things I think microservices does help with is thinking about systems being composed of components that are being developed at different velocities and different tolerances of risk.

Imagine an e-commerce site broken into a bunch of services including search and checkout. The search team is making updates daily, trying to improve ranking and drive conversion. The checkout team (assuming that the site is mature and has hit some design equilibrium) may only be releasing changes every couple of months, and if a bug is introduced, the financial impact is a lot higher.

By not bundling the outputs of very different teams together, you can help those that want to "move fast and break things" with their "moving fast" goal, and de-risk breaking everything by reducing the surface area of changes. Microservices-based architectures are a way to reduce friction caused by the structure of your organization and is one outcome of an Inverse Conway Maneuver.


They do help if a single team of 5-7 developers own a set of microservices; it's unlikely you will have tons of PRs to merge all at once in a single repository with a smaller team. Granted, the ownership is is a bit more clear when talking about a self-contained system that a team owns: https://scs-architecture.org/vs-ms.html

In the SCS literature, you would integrate via async mechanisms across SCSes, provide versioned interfaces, and enforce via consumer-driven contract testing like Pact: https://pact.io


> Perhaps this problem is what microservices are meant to solve.

Kinda. Microservices have always been an organizational solution; they're a way to shard your company's work output. Usually that's API contracts, but whatever mechanisms are bottlenecked on the work output is affected, including how many concurrent builds are running due to how many people are touching the code at the same time.


This paper might be of interest to you on this very subject:

https://eng.uber.com/research/keeping-master-green-at-scale/


We didn't have a merge queue at Google. You rebased if there was a merge conflict, ran through CI again, and hoped there wasn't another merge conflict. I think I ran into merge conflicts maybe once a year, if that.

I think the success of this system breaks down into several parts:

1) Yup, microservices. You could submit your proto change, which would affect all clients, before actually implementing the code that used the new feature. (Or after, in the case of renaming some field from foo to deprecated_foo and refactoring the clients to stop using that field.) That means you could wrangle that change without having to worry about it affecting your actual feature. (Typically proto changes did not cause any breakages since people were very conservative about what changes they would make. Nobody renames all the fields, invalidating dependent code, or renumbers the fields, invalidating all existing messages. You COULD do those things, but nobody ever did.)

2) Clear dependencies in the build system. The CI system only had to run a small set of tests for most changes, because it knew exactly what tests the change would affect. You had to go way out of your way to depend on code without informing the build system. This is very different from every CI system that I've seen outside of Google, which seem to default to running everything and hoping your programming language or build system magically tracks dependencies. It doesn't; Docker for example will happily use random images that it thinks haven't changed, without actually checking if it has changed. (Consider building your app on top of golang:latest. Go is updated, and docker may or may not pull that new base image. Meanwhile, docker will happily clear its build cache if you edit README.md and no code. The result is that 50% of the time you waste 10 minutes rebuilding stuff that didn't change, and 50% of the time you get an outdated build. And nobody seems to care at all!)

3) Being careful about keeping changes small. I don't know what the average CL size is, but I would aim for 100 lines changed rather than 1000 lines changed. This is something that surprised me post-Google, people go away and work for a week and you have a 2000 line PR to review. These are tough to merge and were relatively rare in my experience at Google. It is not always possible to make every change small, but that should be the norm. Figure out how much work you can do in a day, and try to make a CL/PR that is that size. A lot can churn in a week. A lot less churns in a day. If you respected steps 1 and 2, that means your tests will run fast and it's unlikely that your merge will fail between CI and actually merging. If you have 2000 lines of code across 8 services... you'll probably never get it merged. But I am sure that I have successfully merged ginormous changes before, it's just more work.

All in all, my takeaway from this article is that Shopify is huge but I'm surprised that specialized merge tooling was necessary. I wonder what the underlying problem is; do they really have a 1000 developer monolith? Do they not use a proper build system like Bazel?


Xoogler here. When I left in 2015 there were definitely teams that used merge queues (i.e. TAP presubmit). Generally these were teams with a more monolithic architecture, like YouTube that had a massive Python mono.


I guess TAP presubmit might be a merge queue... but it seems different from this. There was no requirement that some mechanical system checked that tests passed before your merged your CL. You could merge any code whenever it was approved. If you felt like running the tests, good for you. TAP presubmit is just that mechanical system that runs your tests before executing the merge. That seems like traditional CI to me, not a merge queue.

Jenkins with a Github plugin behaves almost exactly like this system. Every PR basically has tests run 3 times; once for the branch that the PR is on, once for your branch merged to master, and then once after you do the merge and submit it. TAP presubmit did the "once for your branch" and TAP did the post-merge CI.

TAP presubmit didn't really check that the resulting merge was sound, so you would see TAP presubmit pass, your change get merged, and then have the build break anyway because of the race condition. A merge queue would not have this race condition... so I'm not sure Shopify has one either. The more I think about it the more it sounds like they just rewrote Jenkins. (And for that, I don't blame them.)


I have never seen anyone automate what Bors does with Jenkins and have anything approaching decent UX. The closest I've ever seen is a permanent stage branch that sometimes has automatic promotion, little integration with reviews and inevitably breaks every few weeks until some poor soul debugs it.


Point 2 is very important and very hard to get right. For unit tests, there is a clear dependency on the code and you can easily just run a subset of the tests. But wouldn't you have to run any system and integration tests of the affected module, as it's not clear what effects the code change can have? This will blow up CI times again. How did Google deal with this?


Not sure if this actually answers the question, but - Bazel, the build system used at Google, creates dependency graphs (example: https://blog.bazel.build/2015/06/17/visualize-your-build.htm...), which I believe can be used to run tests on any code affected by a change.


Your integration test needed the system that you were integrating with, so you'd have to declare that as a dependency.

My philosophy was to always have integration tests run in the normal CI system. This basically meant creating a test binary that happened to link in the systems you were integrating with, and run tests against that. This is easier when everything is written in the same programming language, and for the cases where it wasn't, I was usually happy with "fakes". (https://testing.googleblog.com/2013/06/testing-on-toilet-fak...)

Other teams really loved the sandbox environment with live instances of everything. They would have some machinery outside the standard CI system to inject their code into this sandbox and run some tests, as well as machinery for keeping their sandbox up to date with production. (And adding test data, etc., etc., which all becomes very complex very quickly.)

Both methodologies have their downsides and upsides.

I generally prefer simplicity and speed; people should be able to run the tests on their workstation 100% of the time without having to set up any external resources. If you have an integration test binary that is built from the build system, this is possible. The downside is that config changes in production can break your system; since you are starting up your own instance of some other team's server, they could theoretically make some config change that breaks your integration. Even if you include their configuration in your in-memory version of their service, there was no guarantee that what is running in production is actually checked in yet. (Debugging in production, emergency rollback to an older prebuilt binary, etc.) These were rare and never caused me problems, however, and not having machinery to maintain a shadow environment meant it was easier to work on the code.

Having a sandbox environment was good because you could "check" (not test) big changes before putting them into production. You could try out your flag flip, database migration, mapreduce, or just load up the website in your browser and send your coworkers a link without affecting production data. And you could test your actual production binary in production-like conditions; as long as you sync'd production changes to your sandbox, your automated test probably ran against something that was very much like production. This let you check for more subtle things like performance regressions before deploying. (I worked on a system to do just that.)

The main problem I had with this method was that it was maintenance-intensive (big teams that used this had entire teams just to maintain the sandbox, and that begat sub teams that maintained the sandbox maintenance) and slow. Building and running another test during CI was relatively fast, but starting up a job in production and scaling it up was significantly slower. This meant that you needed a parallel set of tools to run some subset of this environment locally, and it was always painful. Not having your tests in the standard system meant that downstream dependencies wouldn't see test failures in your system when you made a change, so the "buildcop" would have to detect and fix that.

I found this to be too much overhead, but it is probably necessary when you are developing, say, a mobile application. You will have to write some sort of software to make it possible to try your in-progress code on your personal phone. You will probably want to be able to share links with coworkers. I generally like to push changes to production multiple times a day, and make sure that clients can handle a newer server and still work correctly. This way, as soon as a build passes tests, you can start giving it, say 0.1% of production traffic and keep an eye on the error rates, and promote that to production as quickly as possible. The biggest problem I've run into with this strategy is that 0.1% of Google's traffic is way more than enough for a good canary, but at other places I've worked... 0.1% of traffic might be one request over several days. In that case, you have to have staging and manually bug people to try it out. Sometimes I wonder if that kind of software is worth writing at all, to be perfectly honest. If you get one request a day, maybe just make it open a support ticket, and hire 2 support engineers instead of one software engineer. But I digress ;)


Tangentially:

I've seen several blog posts from Google about using fakes and 'hermetic servers' for testing. We use GCP for our product, and unfortunately, Google doesn't seem to care much about making this easy. For example, I think I saw only one or two languages for which the Google Storage client libraries provided "fakes" of a Google Storage server. For PubSub (and maybe one or two other services?) there is the PubSub Emulator, which is unfortunately in Java and isn't supported by any of the CLI tools.

For all their love of fakes and hermetic servers, it would be awesome if they provided them for all the GCP services.


Wow, thanks for the detailed reply. You mentioned a couple of implementations that I hadn't thought about. But I guess the short version would be, as so often: testing systems is hard, and there's no one-fits-all solution.


By virtue of having a queue of PRs that need to test & merge, you could pipeline this thing out pretty substantially.

The implication here being that a queue must be processed in-order, so you will ultimately have a perfect sequence of future commits to speculate against, and can incrementally build up each hypothetical future master state for a test build on one of any number of parallel build agents. As the queue depth grows, you would see higher and higher throughput.


> Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with.

Google does it with 50 times the developer count.

> At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build).

True, it is impossible to catch all errors like this, but you can catch almost every error by building and testing it against current master and then merge it with the master 20 minutes later when the build is done. I have seen maybe one build breakage a year being introduced due to this in projects I've worked on, so it isn't a big deal.


For even better accuracy you can use a tool that will run tests against speculative merge states. Zuul[1] is an open source project that supports it out of the box.

[1] https://zuul-ci.org/docs/zuul/user/gating.html


> building and testing it against current master and then merge it with the master 20 minutes later when the build is done.

And I'm pretty sure that is the way Google does it too. Test a commit against current master, if tests are green commit. Then run tests against master again (and I think this stage might not run for every single commit) to see if anything broke on the rare times there was an actual conflict. If that run was red, which should be rare, then you can have the system do a bisect to find the offending commit, or just run all the ones that haven't been individually tested.


You have no idea how Google solved it. Basically everyone with a Monorepo (except Google) implements it as a cargo cult best practice. Mindlessly copying Google without understanding how Google actually does it.


Yet it seems like large companies mostly prefer monorepos, so while it takes investment to have such a monorepo, it seems the benefits are worth the investment.


Google, Microsoft, Facebook and Twitter prefer monorepos but this is not indicative of most large orgs.

You'll notice that those listed have had to customize or creat new vcs's to meet their needs.

https://news.ycombinator.com/item?id=17605371 https://news.ycombinator.com/item?id=11789182 https://medium.com/@maoberlehner/monorepos-in-the-wild-33c6e...


This is appeal to accomplishment fallacy. Because large companies have a lot of money, whatever they do must be great. But this is false - they do what they do because they are large companies, not because it is a good idea.

At scale, managing complexity can require either a lot of coordination, or a lot of careful planning. Large companies (especially tech companies) don't do either well, so they pick architectures that remove choices, and iterate on them until they are workable. And they have the money and workforce to do it.


This is the problem external libraries were created to solve, in a time when it was a much harder problem.

Microservices are the same kind of solution, with the same gains and costs for this specific problem.


Shopify always runs CI on branches before merging to master. Everything this article describes is in addition to that, in order to deal with the problems the article talks about at "merge to master" time, like 2 merged PRs failing or a stale PR that passed on branch but fails on master due changes.

At this scale you need to be deploying constantly, otherwise deploys are hundreds of commits large and its impossible to triage - what PR in the deploy broke something, is it even safe to rollback, etc. That is the primary reason to automate deploys and manage the deploy queue.


It smells like a capacity planning error.

What's the minimum residency time to reliably detect problems with my PR? Add deployment time, double to account for jitter caused by humans being humans (forgetful, lunch, meetings, etc), and there probably are not enough hours in the day for 1000 people to be deploying the same monolith.

To increase residency time you can deploy separate units (You can have multiple deployment units even in a monorepo), and those also reduce the surface area of merges.

Honestly what are they doing with 1000 developers? Duplicated effort goes up considerably with a team and codebase of that size. If you forced me to hire that many people, I'd have a lot of them working on open source, trying to steward feature enhancements that help our process. Because otherwise they'd be running around writing proprietary versions of a bunch of shit that already exists and in a better more documented form.


And I'm not even a little surprised:

https://engineering.shopify.com/blogs/engineering/introducin....

Folks, when you hire enough devs, they feel empowered to rewrite the world. I have lived all sides of this phenomenon and rarely is it pretty.

Scaling is a concern that goes in both directions. Shopify has 1000 developers today. How screwed would they be if they suddenly had to drop to 600? Or even if there's a hiring freeze? What happens when the people who wrote these tools go work somewhere else?

When I do tool smithing work these days, it's always with an effort to provide the thinnest of shims around open source or commercial tools with healthy user communities, so that at the end of the day they have a larger pool of resources than what is in house. People move on. Money dries up. Mandates change.

"Being important" in a company is about how much you support new work, not how locked in people are to your old work. If you can't give your old work away then you're shackling yourself, both to your current responsibilities and to the company. I can't believe that I'm the only one who has ever stayed at a company out of guilt for how screwed they'd be if I left. But that quickly turns into resentment which is worse.

If you are important for new work, then you always get new challenges. You stay sharp and your resume looks good. If the company stops doing new work altogether, do you really want to stay there anyway? Plus you could always go back to one of your old projects.


Sorry if my comment was unclear. I consider the queue to be a “branch” as well. Many people use a “develop” branch instead of a queue in this instance. The queue appears designed to allow arbitrary selection rather than merging in order (though the new solution with CD seems generally in order)

Totally agree that CD is required with this many commits. It’s commonplace on teams with many fewer developers. Was surprised to see you folks roll your own workflows rather than using other systems.

Would also be interesting to see if you tag commits that go to master in instrumentation systems so you have visibility into production metrics and can correlate them with what code was running at the time.


Generally our metrics and exception reports are tagged with the sha and the deploy stage.


Good to hear, that’ll make change management less of a chore.

I think the main thing that was missing for me is the rationale behind building this system rather than building a workflow in one of the existing CI/CD tools. Was there a throughout bottleneck in existing tools? Was there something custom about your workflow that wasn’t supported elsewhere? I may be wrong but the workflow you landed upon seems pretty common so I’m curious as to why the need to build and maintain a tool in house for this?


Hi, Author here!

Pull requests are our unit of work, and the queue was created to support all pull requests. We do have feature flags as a tool, but we let our developers make the judgment call on how their changes should be rolled out.


Is anyone "signing off" on the deploys or is it fully automatic? I can't really imagine it being manual 40 times per day, but just wanted to hear.

How do you handle the scenario that some developer pushes a send_me_all_the_credit_card_details() function to the code base which does something 'evil'? Do you rely on the reviewer "doing their works properly" to handle that?

I'm not saying formal "signing off"-steps in processes handle it, but some companies does them for that reason.


We generally require 2 reviewers, and no sign-off on deploys. For PCI-compliant code things work a bit differently, but tries to follow this as closely as possible.


Interesting. It seems like you have a very flexible process of how to launch code which could contribute to issues with visibility and rollbacks.

I’m curious as to why you had a queue instead of a develop branch before moving to CD? Was this to allow arbitrary commits to be launched to production rather than getting them batched by time?


A `develop` branch has several disadvantages.

You will want to make your `develop` branch the default branch in git and on GitHub, to make sure pull requests automatically are targeted properly (not doing this would be a major UX pain). However, that means that when you `git clone` a repository you are not guaranteed to get a working version.

The `develop` branch can still be broken, which is a problem that needs to be addressed. While you can revert breaking changes (or force-push it to a previous known good sha), and you can automate this process, the pull request is already marked as merged at this point. This means that developers have to open a new PR whenever that happens.

With the queue approach, pull requests remain open until we are sure they integrate properly. Also, we have the opportunity to use multiple branches to test different permutations of PRs, so we can still progress and merge some PRs even if the "happy path" that includes all PRs does not integrate properly.


Thanks, I was hoping for more of this in the blog post. Since tools are just an expression of process/policy, it’s more interesting to here about the process and why than it is about building “yet another CD tool”. Appreciate the thoughtful and thorough response.

The major pain point I agree with on develop is changing the defaults to merge to that rather than master. It’s a shame this is not easier to do in git/github.

I’m not sure I agree with “develop can still be broken” as an issue that supports a queue. Whether it’s a queue or develop, one should run CI on each change to validate that merging it to master will not cause issues. It’s possible for both to be broken via the same scenarios just as it’s possible for master to be broken. Since CI runs before the branch is merged to develop and upon merge, a failure would “stop the world” and prevent more code from being merged unless that code fixes the failure.

I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

For clarity, I’m not arguing that a develop branch is the way to go, I think CD is much better.

Maybe I’m missing something big here but using multiple branches is permissible in other setups also. You can cherry pick a bunch of commits to a branch and test permutations but only certain branches get deployed to staging and production based on rules.

I’m glad that Shopify has found tools and a process that works. Honestly, I’m just having trouble comparing and constraining this to the other tools that are out there. The article never speaks about other approaches and whether or not they were considered and why you decided to go with a queue. It’s not clear to me if this was a case of improving the existing queue system because it was already in place or whether or not the queue was specifically chosen again because it was better than other alternatives (and why).


> I guess I’m not fully understanding how a queue prevents this. Since you don’t have a full picture of the state of master until something is merged from the queue, how do the CI checks in the queue prevent things that branch-based CI checks wouldn’t prevent in a “develop” branch? With branches and develop, pull requests remain open until they can be assured they merge properly with develop as well.

The trick of the merge queue is that it splits the "merging a branch / pull request" in two steps:

1. Create a merge commit with master and your PR branch as ancestors.

2. Update the `master` ref to point to the merge commit.

Normally when you press the "Merge Pull Request" button, it will do those two things in one go. By splitting it up in two distinct steps, we can run CI between step 1 and 2, and only fast-forward master if CI is green.

This means that master only ever gets forwarded to green commits. And because the sha doesn't change during a fast-forward, all the CI statuses are retained. Only when we fast-forward will GitHub consider to pull request merged, so we don't have to "undo" pull request merges when they fail to integrate. If the merge commit fails to build successfully, we leave a comment on the PR that merging failed, and the PR is still open.

When we have multiple PRs in the queue, we can create merge commit on top of merge commit, and run CI on those merge commits. When once of these CI runs comes back, we can fast forward master to it, and potentially merge multiple pull requests at once with this approach.


I think I see where you are coming from. Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach. You’ll have to check at merge time because getting up-to-date could take a while and master could have changed. Jenkins does this and it can be done in other CI/CD systems with a bit of custom code.

I’d imagine at 1,000 developers and with a monolithic codebase, you’re looking to minimize test runs both from a time and cost of runners perspective.

You may also want to look into Zuul or Bazel if cost of test suite runs is a factor in coming to this solution.


> Being as we use different tools, we wouldn’t allow a pull to be merged if it wasn’t up-to-date with master which is similar but a different approach

That wouldn't work for us due to the amount of changes we need to ship. If you rebase your branch and wait for CI to come back green, chances are another PR will have merged in the mean time, which means your rebased branch is no longer up to date with master. You end up stuck in a rebase cycle.

For this reason, we have no choice but to batch PRs, which is what the merge queue tool does. Faster CI will reduce this problem and we're working on that as well, but won't fully solve this.


That’s understandable. I’d imagine at some point you’ll need to decouple the monolith a bit in order to work effectively as you scale. Best of luck with the challenge.


The queue is simply an automated "develop" branch.


From what I gathered in the article, that’s the case now but before the queue required manual merges.


No, even with v1, the merge weren't manual. A bot would merge for you, but directly into master.

Now the bot merges into a temporary that is fast-forwarded as the new master if CI validates it.


Interesting.

Would you say this is more of a decision based around the constraints of using GitHub or more of the ideal process for Shopify’s needs?

I’m curious because the article doesn’t mention the core reasons that you chose to write your own CD tool versus the other options that exist. The workflow you describe seems readily available in most tools. Perhaps the throughput was causing other options to break?


The ideal process for Shopify’s needs based on the constraints we have to work with (CI speed, deploy speed, rate of changes, etc).


We (Shopify) still run the full CI on each development branch as the article mentions:

> We check if Branch CI has passed and if the pull request has been approved by a reviewer before adding the pull request to the queue


> Trying to build CI on every branch before merging to master just won't work with the scale they are dealing with. At 1000 developers, the rate of PRs coming in makes it impossible to determine what current master will be when the PR is ready to merge (i.e. when the branch has a green build). It's also wasteful to build each branch against current master because what is "current" will not be when the branch is ready to merge.

I'm starting to think most CI problems are just people not looking at the problem the right way. Here is the problem re-worded:

- When a PR has a green light and someone hits 'merge', it locks anything else from being to merge to master, and you merge your PR. When it finishes merging and deploying, now all the other PRs waiting have to rebuild themselves to see if they will merge with this new state of master. So 100s of PRs are rebuilding every time you merge one PR, and there's constant CI churn.

Here is why that problem exists:

- The system was designed for 1000 developers to all be writing to the same code base.

Here is how you solve that:

- Don't let 1000 developers all write to the same code base. Break the code down into discrete components that different small teams manage. The only bottleneck for that code base is that small team.

This small team is often called the two-pizza team, and their discrete components are often called microservices.


Google don't solve it your way: https://news.ycombinator.com/item?id=21586180


Yes, that's correct, Google invented its own proprietary distributed object store and distributed version control system and distributed Linux-only filesystem and distributed build-and-test-system to work with a single SDLC that its entire company must follow strictly to release anything, just so it could keep using a single repository.

What's your point?


Clearly given those costs, Google really believe in mono-repo, and presumably they have tried to back it up with internal stats?

Although hard to get stats without control group - maybe control group could be acquisitions?


You have no idea how Google solved it. Basically everyone with a Monorepo (except Google) implements it as a cargo cult best practice. Mindlessly copying Google without understanding how Google actually does it.


Literally the definition of CI is to run on master or release branches a few times a day, not on every dev branch.

https://en.wikipedia.org/wiki/Continuous_integration


The definition is sourced from here[0], and says "Each check-in is then verified by an automated build, allowing teams to detect problems early.". It's not a hard rule to "not run on every dev branch".

"continuous integration" is often just "npm ci && npm run build", and sometimes "npm run test" (or similar for your language). For products that don't make any remote API calls (or when they use a faker service), most of this is done on the same machine and costs very little to do on every commit, making it easier to precisely define which commit broke something.

0: https://www.thoughtworks.com/continuous-integration


Rather than being pedantic about the definition, maybe you could share your experiences with why that’s superior to validating each branch? Being dogmatic about a definition rather than experimenting with works best in production at your business seems illogical.


This sounds quite similar to https://bors.tech. If the authors are here, did you see this, and can you compare and contrast it with what you built? https://graydon2.dreamwidth.org/1597.html also has a good overview of the problem and the original bors.


Yes we have seen this before! The main difference is that throughput is extremely important for us, which we would not get worth Bors. Also, compatibility of multiple simultaneously merging PRs is the case that we are optimizing for, vs. compatibility with current master.


If you don't mind me asking, How long does a CI run take for you? How do you manage running CI with so many merges?

Our CI takes ~8 hours of machine/VM time, which is about 35 minutes of wall time with our current testing cluster (including non-distributed parts like building). We skip certain long tests during the day, so that brings wall time down to ~13 minutes. But we also test 2-3 branches with decent churn. So even if we're only doing post-merge CI based on the current state of master, we're still getting 5+ commits fairly often.

I want to get to a world where CI is run before and after each merge with master, on every commit (or push/pull), but it seems like it would take so much more resources and infrastructure than we currently have.


Very surprised they hadn't locked merge to master until recently.

I do like the emergency exception. The problem then becomes having a solid policy on when to use it.

Not really sure about the flakiness test (the 25% figure is totally random), as you could have a test that is failing only at a certain hour of the day and you would be furiously rerunning the test suite for nothing because it would always fail, wasting resources. It would be much better to at least single out the failing tests and rerun only those, dropping the ones that pass successfully until max tries are reached or the queue is empty.


Is it the problem that Gitlab merge trains aim to solve?

https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipeli...


This seems like a complicated way of shoehorning the Linux kernel mailing list workflow into GitHub.


Anyone know what build times are for the Shopify monolith?


I'm working on the test-infra team at Shopify. We got the CI down in the last couple of weeks to around 22m 50th percentile and 33m 95th percentile. At the moment we get most of our speed up by parallizing our builds steps a lot. But we hit a ceiling with that and are working on a project on selectively running tests.



Did I read it correctly that in first iteration they merged "NOT TESTED" code into master ?

Who would ever think it's a good idea ?


The branches that are being merged are tested, also in the first version. However, different branches can conflict with each other, and break the master builds. This was happening often enough for us to want to prevent this.

The simple way would be to rebase your branch (or merge in master). However, with the amount of changes that are being merged, by the time the CI result comes in for your rebased branch, the tip of master is already changed making the CI result obsolete. So we went for the queue approach instead.


Dont know if I respond to author, but either way the end solution was to merge into semi master and run CI before merging to master..

So overall you came to conclusion that what you explained even if can fail is pretty much the best way you can do it.. (rebase just custom..)

Overall DevOps principles are proven to simply work, you just have to follow them..


That’s what trying to get at in a “more polite” way. Many people use a “develop” branch to stage changes before master so they can run through CI. I’m curious as to why they used an untested queue instead. It seems like they wanted to cherry pick what commits made it to master rather than going in chronological order.

What’s odd is that they seem to characterize this as a tools issue rather than a process issue. There are plenty of CI/CD tools that allow for a similar or the same workflow as what they created. It’s also kinda scary that there’s not an emphasis on the overall SDLC and how the specific attributes of a branch or commit should/shouldn’t affect the process. You’d think at 1000 developers it’d be very important to define what is “launchable” as well. Haven’t worked with 1,000 on the same project but even with 20+, the standards and practices around development were always more important than the tooling. Tooling was just meant to represent workflows that were already defined.


It was tested. It's simply that between the time you pushed on your branch, and the time you merged, many other commits made it to the master branch, potentially breaking your branch.


That pretty much forces tooling to try to rebase and run CI either way ? If it cannot rebase its up to the developer to fix his branch.

Master should be always in a state of release at any moment.

I just cannot imagine it was an unknown practice for some.


Except that with the amount of activity on the repo it's simply impossible.

Every new merge on master would require to rebase several hundred branches being worked on or awaiting reviewed. Multiply this with the hundreds of commits merged on master every day and you end up with way too much CI jobs to run.


Its not impossible.. thats what they ended up with either way.. And how do I know its possible ? Well I have over 2000 developers working on the same codebase in my current company. I work as a devops and this conversation is stupid..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: