I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.
I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.
Are there any tips/tricks to speed up this startup time? I know stuff like ensuring that artifacts are not passed in if not needed can help a lot, but it seems that most of the execution time is simply spent waiting for docker to spin up a container.
The short answer is "do as little as possible". What this means in practice is breaking down every step of CI, figuring out the dependencies for that step, and then ordering the graph of dependencies such that you start as much as possible as early as possible. This process also usually shows you where things are slow and what the critical path is.
Unfortunately, doing this in most CI services is actually quite difficult. It usually means a complex graph of execution, complex cache usage, and being careful to not re-generate artifacts you don't need to.
In my experience, building this, at a level of reliability necessary for a team of more than a few devs, is hard. Jenkins can do it reliably, but doing it fast is hard because the caching primitives are poor. Circle and Gitlab can do it quickly, but the execution dependency primitives aren't great and the caches can be unreliable. Circle also has _terrible_ network speeds, so doing too much caching slows down builds. GitHub Actions is pretty good for all of this, but it's still a ton of work.
The best answer is to use a build system or CI system that is modelled in a better way. Things like Bazel essentially manage this graph for you in a very smart way, but they only really work when you have a CI system designed to run Bazel jobs, and there aren't many of these that I've seen. It's a huge paradigm shift, and requires quite a lot of dev work to make happen.
It's so surprising to me that this is such a poorly supported paradigm in commodity CI systems. Caching artifacts and identifying slow stages is like... super important for scaling CI for large enough orgs. We need better tools!
While it sounds cynical... it doesn't strike me as 'wrong' entirely. It's a non-trivial problem, but until some service provides great tools to handle this, and makes the experience 10x better (to encourage more use/experimentation/etc), everyone will keep offering the same experience all around. If a service could automatically cut build times by, say, 70%, that's a lot of revenue they may lose from charging for the build time. They could raise the price, or hope that enough new people get onboard to make up the loss... ?
CircleCI sort of do! They have something called Docker Layer Caching, which basically puts all the Docker layers from your previous build on the execution machine.
The problem is that it's a) very slow to download those layers from their cache storage, and b) very expensive. It works out to costing ~20 minutes of build time.
The problem I had with GitLab was that the mechanisms for controlling dependencies between stages were fairly basic. They only added them in ~2020 I think, and they weren't well documented.
Additionally, there's no cache guarantees between jobs within one execution. This means that you can't reliably cache an artifact in one job, and then share it with multiple downstream jobs. It mostly works, but it's hard to debug when it doesn't, especially if the cache artifact isn't versioned.
GitLab is "fine", and has some nice usability features for basic pipelines, but it's definitely not doing anything better than the other major providers with respect to these problems.
Dependency controls have improved quite a bit. They went through a couple of variations of this and the current solution is nice.
I haven't ever run into an issue with artifact hand off to this point though. Maybe it's one of the more rare concerns, but it's not something I've experienced (fortunately). I imagine it would be a concern to debug though.
I'm not sure how to speed up the spin-up-a-container time (at least not without more details), but I have two suggestions that may help mitigate it. Based on your wording ("waiting for docker to spin up a container"), the second one may not be relevant.
## 1. Do more in the job's script
If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)
## 2. Pre-build the job's image
If your job's script uses an off-the-shelf image, and has a lot of setup, consider building an image that already has that done, and using that as your job's image instead. For example, you might be using a `node` image, but your build requires pulling translations from a file in an S3 bucket, and so you need to install the AWS CLI to grab the translation file. Rather than including the installation of the AWS CLI in the script, build it into the image ahead of time.
> If you have multiple jobs that use (or could use) the same image, perhaps those jobs can be combined. It's definitely a tradeoff, and it depends on what you want from your pipeline. For example, normally you may have separate `build` and `test` jobs, but if they take, say (30s init + 5s work) + (30s init + 10s work), then combining them into a single job taking (30s init + 15 s work) _might_ be an acceptable trade-off. (These numbers are small enough that it probably isn't, but you get the idea.)
This is a good idea and something I will seriously consider
I'm already doing #2, but I'm glad to see others come to the same conclusion as me. :D
Make sure your docker build is being cached properly, and break infrequently running stuff into their own steps, then move them to the top of the docker file.
Crucially: Make sure that the large layers say they are "cached" when you rebuild the container. Docker goes out of its way to make this difficult in CI environments. The fact that it works on your laptop doesn't mean that it will be able to cache the big layers in CI.
Once you've done that, make sure that the CI machines are actually pulling the big layers from their local docker cache.
30-90 seconds to pull docker images for each run of a golang project's CI environment is too high. You might look into using "go mod vendor" to download the dependencies early in the docker build, then using a symlink and "--mod=vendor" to tell the tests to use an out-of-tree vendor directory. (I haven't tested this; presumably go will follow symlinks...)
My usual strategy is to ensure that the lengthy parts are executed only once. So for example one of the lengthy parts is environment setup for me too. So what I did is to put as much as possible on the docker image I build and then I start tests from image mostly ready to run. Of course something similar can be done during runtime. If starting the software you test takes long time, you could set it up only once and run multiple tests without tearing down the setup. Of course this has disadvantage of having possibly tainted environment and there is risk of making the tests depend on previous state. On the other hand this could also help discover problems that are hidden by always running tests on clean slate, so it's a tradeoff. And I have to note that I mostly do integration testing, so the long parts are probably in different places than for unit testing.
I had similar problems with CircleCI and it’s docker executor. We recently switched to GitHub actions and the following led to huge improvements:
- much faster network speeds.
- We no longer run on the docker executor. Instead we run on Ubuntu. These boot in a second or 2 pretty consistently.
- the bulk of our test suite was able to be pulled out of docker entirely (a lot of jest, and PHPUnit tests).
- we have a bigger suite of E2E PHPUnit tests that we spin up a whole docker compose stack for. These are slower but still manageable.
Parallelism is key in all of this too. Our backend test suite has a full execution time of something like 250 minutes, but we just split it over a bunch of small workers and the whole thing completes in about 8 minutes.
Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.
For me the controlling factor with build time and to a lesser extent production performance is to divorce visibility from vigilance. You can’t watch things 24/7 waiting to pounce on any little size or time regressions. You need to be able to audit periodically and narrow the problem to a commit or at least an hour in a day when the problem happened. Otherwise nobody will be bothered to look and it’s just a tragedy of the commons.
Graphs work well. Build time, test count, slow test count, artifact sizes, and so on.
> Pulling snapshots helps, particularly with slowdowns over time. Pulling deps is a problem that deserves its own initiatives.
I just had some success running android builds on a self-hosted github runner. One of the big setting up stages was having sdkamanger pull down large dependencies (SDK, emulator images etc.) on startup.
Forcing sdkmanager into http_only mode and pointing it at a properly-configured squid took a large percentage off the build time.
Similar story for the gradle build, where running a remote gradle cache node locally to the job means gradle steps get automatically cached without any magic CI pipeline steps.
dep caches are great until they aren't. We had to turn them off because we had some weird thing going on with 2 artifactory instances due to M&A issues and we were getting weird behaviors from it.
Being able to pull deltas sure is fast, but it also violates some of the principles of CI. Artifactory or similar tools can split the difference. As long as nobody is doing something dumb that is.
This is what I’m working on next week. The majority of time is spent building the first n numbers of our Dockerfiles (which aren’t cached in our test/deploy pipeline).
I’ll be baking some images with dependencies included, so the only stuff in the updated Dockerfile will be pulling the pre baked images from our registry and commands to build and run our app code.
We do the pre-baked dependency images too, and it's definitely workable, but I feel like it's a lot of overhead maintaining those— you have to build and distribute and lifecycle them, and it's extra jobs to monitor. Plus you now have an implicit dependency between jobs that adds complication to black-start scenarios. I wish tools like GitLab CI had more automated workflows for being able to automatically manage those intermediate containers, eg:
- Here's a setup stage, the resulting state of which is to be saved as a container image and used as the starting point for any follow-on stages that declare `image: <project/job/whatever>`
- Various circumstances should trigger me to be rebuilt: weekly on Saturday night, whenever repo X has a new tag created, whenever I'm manually run, whenever a special parallel "check" stage takes more than X minutes to complete, etc.
Ultimately, I think the necessity for all this kind of thing really just exposes how weak the layered container image model is— something like Nixery that can delivery ad-hoc environments in a truly composable way is ultimately a much better fit for this type of use-case, but it has its own issues with maturity.
Docker images in CI are typically just that: a tmpfs with a chroot and some network isolation. If you have it working once, youre pretty much guaranteed it will work again.
Doing this on bare metal with a script to clean the FS, ensure correct dependencies and maybe isolate the network (for safe unit tests), means you're just reimplementing much a non-trivial portion of docker or other container tools. Maybe that's worth it, but without justification, it just smells like risky NIH to me.
Docker on Debian 11 bare metal with gitlab-ci installed the "blessed" way (by adding gitlabs apt repos).
No optimisation to the baseOS other than mounting the /var/lib/docker on a RAID0 array with noatime on the volume and CPU mitigations disabled on the host
Compilation is mostly go binaries (with the normal stuff like go vet/go test).
Rarely it will do other things like commit-lint (javascript) or KICS/SNYK scanning.
the machines themselves are Dual EPYC 7313 w/ 256G DDR4.
Where do you keep your bare metal machines if I my ask? I wanted to do a similar setup a while ago (building/testing on Hetzner bare metal, deployments and the rest on AWS) but due to Amazon's pricing policy the cost of traffic would be enormous.
Not the person you asked, but we have something similar to what you described - our GitLab is self-hosted on Hetzner cloud and the build machine is a beefy bare metal machine in the same datacenter (plus an additional Mac in our office just for iOS). Built images are stored in GitLab repository and deployed from there.
We deploy to AWS (among others) and had no issues regarding traffic price since it's ingress into AWS.
I'm using gitlab-ci with it's docker executor, and overall I'm very happy with it.
I use it on some rather beefy machines, but most of the CI time is not spent compiling, it is spent instead on setting up the environment.
Are there any tips/tricks to speed up this startup time? I know stuff like ensuring that artifacts are not passed in if not needed can help a lot, but it seems that most of the execution time is simply spent waiting for docker to spin up a container.