Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Managed GitHub Actions Runners for AWS
117 points by jacobwg 9 months ago | hide | past | favorite | 57 comments
Hey HN! I'm Jacob, one of the founders of Depot (https://depot.dev), a build service for Docker images, and I'm excited to show what we’ve been working on for the past few months: run GitHub Actions jobs in AWS, orchestrated by Depot!

Here's a video demo: https://www.youtube.com/watch?v=VX5Z-k1mGc8, and here’s our blog post: https://depot.dev/blog/depot-github-actions-runners.

While GitHub Actions is one of the most prevalent CI providers, Actions is slow, for a few reasons: GitHub uses underpowered CPUs, network throughput for cache and the internet at large is capped at 1 Gbps, and total cache storage is limited to 10GB per repo. It is also rather expensive for runners with more than 2 CPUs, and larger runners frequently take a long time to start running jobs.

Depot-managed runners solve this! Rather than your CI jobs running on GitHub's slow compute, Depot routes those same jobs to fast EC2 instances. And not only is this faster, it’s also 1/2 the cost of GitHub Actions!

We do this by launching a dedicated instance for each job, registering that instance as a self-hosted Actions runner in your GitHub organization, then terminating the instance when the job is finished. Using AWS as the compute provider has a few advantages:

- CPUs are typically 30%+ more performant than alternatives (the m7a instance type).

- Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1, so interacting with artifacts, cache, container registries, or the internet at large is quick.

- Each instance has a public IPv4 address, so it does not share rate limits with anyone else.

We integrated the runners with the distributed cache system (backed by S3 and Ceph) that we use for Docker build cache, so jobs automatically save / restore cache from this cache system, with speeds of up to 1 GB/s, and without the default 10 GB per repo limit.

Building this was a fun challenge; some matrix workflows start 40+ jobs at once, then requiring 40 EC2 instances to launch at once.

We’ve effectively gotten very good at starting EC2 instances with a "warm pool" system which allows us to prepare many EC2 instances to run a job, stop them, then resize and start them when an actual job request arrives, to keep job queue times around 5 seconds. We're using a homegrown orchestration system, as alternatives like autoscaling groups or Kubernetes weren't fast or secure enough.

There are three alternatives to our managed runners currently:

1. GitHub offers larger runners: these have more CPUs, but still have slow network and cache. Depot runners are also 1/2 the cost per minute of GitHub's runners.

2. You can self-host the Actions runner on your own compute: this requires ongoing maintenance, and it can be difficult to ensure that the runner image or container matches GitHub's.

3. There are other companies offering hosted GitHub Actions runners, though they frequently use cheaper compute hosting providers that are bottlenecked on network throughput or geography.

Any feedback is very welcome! You can sign up at https://depot.dev/sign-up for a free trial if you'd like to try it out on your own workflows. We aren't able to offer a trial without a signup gate, both because using it requires installing a GitHub app, and we're offering build compute, so we need some way to keep out the cryptominers :)




At Notion we run our GitHub Actions jobs on ECS, and use auto-scaling to add and remove hosts from the ECS cluster as demand fluctuates throughout the day. We also age out and terminate hosts although they usually live for a few days to a week. I guess we had to pay some one time setup costs around configuring the ECS cluster and fiddling runner tags, but it seems to work pretty well. We have our own cache action although it’s not as fancy as depot’s, just a tarball in s3.

Overall it’s pretty simple terraform setup plus a couple dockerfiles. And we get to run in the same region as the rest of our infra that’s close to most of our devs (us-west-2).

ECS might sound more complicated than “just use ec2” but we don’t have to screw around with lambdas and the terraform is pretty simple, much simpler then the Philips-labs one. It’s about 1400 lines of Terraform across 2 files since ECS has so much stuff built in and integrates with auto scale groups well.


Thanks for sharing - Do you have any blogpost or YouTube video where you go deep into the details of the implementation? that would be a good one


This sounds like a good setup - if you don't mind me asking, what do you use for your auto scaling metric?

Also curious how much y'all isolate it from your other infra. I've thought about this but I've been torn on whether I'd set up a separate vpc for it.


Can you post the TF file?


How will you compete if GitHub talks to the Azure folks (who have the benefit of Azure scale) and gets better compute and network treatment for runners? Or is the assumption GH running remains perpetually stunted as described (which is potentially a fair and legit assumption to make based on MS silos and enterprise inertia)?

To be clear, this is a genuine question, as compute (even when efficiently orchestrated and arbitraged) is a commodity. Your cache strategy is good (will be interested in testing to tease out where is S3 and where is Ceph), but not a moat and somewhat straightforward to replicate.

(again, questions from a place of curiosity, nothing more)


Yep, it's a good question! At the moment, my thoughts are roughly:

GitHub's incentives and design constraints are different than ours. GitHub needs to offer something that covers a very large user-base, to cover the widest possible number of workflows, and they've done this by offering basic ephemeral VMs on-demand. CI and builds are also not GitHub's primary focus as an org.

We're trying to be the absolute fastest place to build software, with a deep focus on achieving maximum performance and reducing build time as much as possible (even to 0 with caching). Software builds today are often wildly inefficient, and I personally believe there's an opportunity to do for build compute what has been done for application compute over the last 10 years.

GitHub Actions workflows are more of an "input" for us then (similar to how container image builds have been), with the goal of adding more input types over time and applying the same core tech to all of them.


Good reply. It seems like you understand the market and where your product fits, which is half the battle.

Wishing you much success.


Thank you!


This is one of the best responses about market positioning that I've ever read. Best of luck to you!


This is a really clever product and I'd love to learn more - good luck.


I believe the solution is to decentralise, i.e. let the customer run the machines in their own AWS account (what I'm doing with RunsOn, link in bio if interested).

It is very hard for a single player to get favourable treatment from Azure / AWS / GCP to handle many thousands of jobs every day / hour.

I wish Depot all the luck, I think they've done good work wrt caching.


Corporate inertia might not be the only reason for excessive pricing.

They might simply charge for everything working out of the box convenience. Or even for not being aware there are other options.


Not mentioned down-thread, but GitHub’s incentive is to sell you CI minutes, and slow runners are shooting fish in a barrel.


I recently set up AWS Github runners with this terraform. It works well and you don't have to pay any extra in addition to AWS.

https://github.com/philips-labs/terraform-aws-github-runner


I helped set this up at my workplace and can second that it works fairly well, but it definitely does have scale issues (we tend to exhaust our GH org's API ratelimit and end up being unable to scale up sometimes, as well as seeing containers be prematurely terminated because the scale down lambda doesn't seem to always see them in the GH API) and it's definitely lacking a lot of tooling around building runner images and caching optimization that we ended up building in-house.

Definitely linking OP to my team now.


We looked at this Phillips solution originally in a previous org and eventually decided on Karpenter + Actions Runner Controller instead, configured with webhook aka push based triggers. It’s really the best solution for scale but it does take awhile to implement and tune to get right. If you have dedicated infra people I can recommend it. If you don’t, I would look to a more managed solution like OP’s offering


Yeah this is a good option if you'd like something to deploy yourself! You can also build an AMI from GitHub's upstream image definition (https://github.com/actions/runner-images/tree/main/images/ub...) if you'd like it to match what's available in GitHub-hosted Actions.

With Depot, we're moving towards deeper performance optimizations and observability than vanilla GitHub runners - we've integrated the runners with a cache storage cluster for instance, and we're working on deeper integration with the compute platform that we built for distributed container image builds - as well as expanding the types of builds we can process beyond Actions and Docker, for instance.

But different options will be better for different folks, and the `philips-labs` project is good at what it does.


One of the most interesting value adds for me is not any of the things mentioned by OP. I would like to have a managed hosted runner solution where I can have a buildkit cache also in the same data center that I manage where I don’t have to pay ingress/egress to that cache but also I don’t have to manage my own runner infra. I have done the whole self hosted Karpenter + Actions Runner Controller thing to achieve this and it is a lot of work to set up and tune to get right.

The problem is actually really that GitHub’s caching offering is very limited for anything except the most basic of use cases and also they don’t offer a way to colo your own cache with them so that you aren’t paying cloud fees back and forth. You have to use their machines, their storage and their protocol which is only really viable if your definition of caching is literally just “upload files here” and “check if the uploaded built file already exists”.

Yes, I’m aware that buildkit offers “experimental” GHA caching support. But given how fat image layers are it’s basically unusable for anything beyond a toy project that builds a couple layers on top of an alpine image (as of the time of writing this post GHA limits cache size to 10gb per repo. Fine if you’re building npm or pypi packages or whatever, but hilariously inadequate for buildkit layer caching)


What you are looking for is a local S3 cache, which buildx supports as a backend. Just make sure you have an S3 gateway connected to your VPC (and that your S3 bucket is in the same region than your runners!) and enjoy free bandwidth, unlimited cache size, and crazy fast network throughput.

https://runs-on.com/reference/caching/ https://runs-on.com/features/s3-cache-for-github-actions/


Looks neat but is there a way to guarantee that it’s colo’ed with GHA hosted runners and that I won’t pay ingress/egress? If not then I don’t see how it’s much different than simply putting up my own bucket aside from saving me the logistics around permissions etc.

Edit: I see. This solution you linked doesn’t use GHA hosted runners at all - it’s intended to be a turnkey self hosted runner solution. In other words, a direct competitor to the service linked in OP. That wasn’t super clear from your comment but after reading your links it is more clear. I do really like the pricing here, if it actually works as advertised it’s a pretty great value prop for a lot of orgs.


Oh yes, it can't work with GHA hosted runners otherwise you'll pay egress fees. From your first post I was assuming you were starting from the point of view that you would be running your own runners already.

It does work as advertised, try it :) And yes RunsOn is a direct competitor to the 5 YCombinator-funded companies operating in this space (Ubicloud, Warpbuild, Buildjet, Blacksmith, Depot).


Hey, yep we have this today! Depot's original product is a fully-managed container build service (https://news.ycombinator.com/item?id=34898253) that caches all BuildKit layers to SSDs, so there's no cache-to/cache-from and cache doesn't need to transfer over the network at all.

Our original version of that system used vanilla BuildKit + EBS volumes + orchestration, nowadays we've replaced EBS with a distributed ceph storage cluster for significantly faster IOPS and throughput and have modified BuildKit to be better suited for high-performance distributed builds.

Both the container build service and the Actions runners are in the same AWS VPC, so they get good network performance between the two and don't need to egress over the internet.


Other Depot founder here. This is a really great point.

You've hit on all the main points regarding Docker image cache in GHA. Persisting the massive layer cache over networks is incredibly slow and has weird limits (like 10GB per repo). We persist the layer cache to ceph volumes and orchestrate your cache so it's immediately available across builds with our first service, accelerated container image builds. Our GHA runners run right next to that same infra, so you don't have ingress/egress. All that can be hosted in your own AWS account (we're also open to running that in any general compute environment for folks who need it).


The depot.dev service has excellent caching for docker. It's almost like building locally.

Though the site (depot.dev) focuses on that aspect, this post doesn't.

@jacobwg - Do the runners in AWS get the same docker caching performance as depot.dev hosted runners?


At the moment, I think you'd want to use both products together, i.e. using `depot build` in place of `docker build` to move the container build portion to a Depot container builder.

I'd like to have a more automatic integration at some point - the challenge is that a lot of BuildKit's architecture performs best when many different build requests all arrive at a single build host, it is then able to efficiently deduplicate and cache work across all those build requests. So you really want the many different Actions jobs all communicating with the same BuildKit host.

We have some ideas for reducing the amount of change to Actions workflows to adopt ^ - longer term we're also working on our own build engine, to free those workloads from being confined to single hosts (be that single CI runners or single container builders).


Half the price of github is not great right now, this space is heating up! Ubicloud is 10x cheaper and https://runs-on.com is in the same ballpark by using spot instances. (Currently switching to RunsOn)


Yeah, I think our goal is to be the fastest at building software, not necessarily the cheapest. Part of that involves AWS, to have access to more powerful and elastic infrastructure, but that comes at a premium.

But besides just compute, I think the bigger long-term unlock for build performance is a new distributed compute engine, to free build workloads from single machines. We've started building this for our container build product, and plan to integrate Actions jobs as an input as well, starting with the cache integration we have today.


> - Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1, so interacting with artifacts, cache, container registries, or the internet at large is quick.

do you actually get the promised 12.5 Gbps? I've been doing some experiments and it's really hard to get over 2.5Gbit/s upstream from AWS EC2, even when using large 64 vCPU machines. Intra-AWS (e.g. VPC) traffic is another thing and that seems to be ok.


We do get the promised throughput, but it depends on the destination as you've discovered. AWS actually has some docs on this[0]:

- For instances with >= 32 vCPUs, traffic to an internet gateway can use 50% of the throughput

- For instances with < 32 vCPUs, traffic to an internet gateway can use 5 Gbps

- Traffic inside the VPC can use the full throughput

So for us, that means traffic outbound to the public internet can use up to 5 Gbps, but for things like our distributed cache or pulling Docker images from our container builders, we can get the full 12.5 Gbps.

[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...


> > - Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1

with that pull quote, I thought you were going to point out their use of us-fail-1. I struggle to think of a service that I care so little about its availability that I'd host it there, but CI/CD for sure wouldn't be one


Hey, @jacobwg, this looks great.

I couldn't find it anywhere on the page, but do you support Graviton3 (i.e. m7g instances) for GHA Runners? If the answer is no, are there any plans to support it in the future?

> start them when an actual job request arrives, to keep job queue times around 5 seconds

Did you have to fine-tune Ubuntu kernel/systemd boot to reach such fast startup times?


We do support Graviton! I actually _just_ enabled them today, which we're calling "beta" for the moment: https://depot.dev/docs/github-actions/overview#depot-support....

The challenge with Arm is actually just that GitHub doesn't have a runner image defined for Arm. For the Intel runners, we build our image directly from GitHub's source[0], and we're doing the same for the Arm runners by patching those same Packer scripts for arm64. It also looks like some popular actions, like `actions/setup-*`, don't always have arm support either.

So the disclaimers for launching Depot `-arm` instances at the moment is basically just (1) we have no idea if our image is compatible with your workflows, and (2) those instances take a bit longer to start.

On achieving fast startup times, it's a challenge. :) The main slowdown that prevents a <5s kernel boot is actually EBS lazy-loading the AMI from S3 on launch.

To address that at the moment, we do keep a pool of instances that boot once, load their volume contents, then shutdown until they're needed for a job. That works, at the cost of extra complexity and extra money - we're experimenting some with more exotic solutions now though like netbooting the AMI. That'll be a nice blog post someday I think.

[0] https://github.com/actions/runner-images/tree/main/images/ub...


Not affiliated, just guessing.

This 5 seconds might be the warm start, not cold. I.e. they likely have a pool of autoscaled, multi tenant workers


Yeah 5 seconds is from stopped to running, but to get that speed we need to pre-initialize the root EBS volumes so that they're not streaming their contents from S3 during boot. The GitHub Actions runner image is 50GB in size _just_ from preinstalled software!


A few months ago I had some issues with build performance. We're on the free plan with Github so using custom runners is not an issue. But I found a nice workaround:

- create a virtual machine with everything you need in gcloud (would work for aws as well). Pick something nice and fast. Suspend it.

- in your github action, resume the vm, ssh into it to run your build script, and suspend it afterwards.

Super easy to implement and easy to script using gcloud commands. It adds about 30 seconds of time to the build for starting the vm. On the machine, we simply pull from git and checkout the relevant branch. Doesn't work for concurrent builds but it's a nice low tech solution. And you only pay for the time the machine is up and running, which is a few minutes per day. So, you can get away with using vms that have lots of CPU and memory.


Interesting, makes a lot of sense to me as far as pricing too. However, I feel the video demonstration could greatly improve in terms of explaining and enthusiasm. It's super cool though and presentations/demos should showcase the full potential!


I used this to setup my runners on a dedicated server: https://github.com/vbem/multi-runners


A cool idea, but not sure the business case. I wrote a quick and dirty bash script which automates the process of adding 2x GitHub runners on instances (2 CPU cores and 4 GB memory each). Simply scale out horizontally. Since the instances are persistent you get docker image caching out of the box unlike hosted runners on GitHub. Also arm64 is fully supported.


For the AWS CDK folks, I’ve been very happy with this library. https://github.com/CloudSnorkel/cdk-github-runners. Love that I can use spot pricing and the c7g instances for cicd.


Congrats on shipping! We built something similar internally. Tweaking it for the right cost/availability/speed was interesting, but we now have it working to where workers are generally spun up from 0 faster than GitHub's own are.


Yeah, GitHub's runners, especially the ones with >2 CPUs, have surprisingly long start times!


Hey Jacob, awesome suggestion!

Are you building your base image from the GitHub runner-images repo?

Do you have any appetite for building self hosted EC2 agents for Azure DevOps and GitHub?

I'm happy to help if you are, I'm working on something similar myself for my employer.


Hey, we are building our base image from the runner-images repo! I'll send you an email!


Your website is surprisingly good. Often, show hn sites are pretty basic and a little off the mark, but this was clear. Pricing seems simple too. Great job. Will give it a try.


How do you ensure privacy/isolation between users if you have a pool of ready VMs that you re-use?


We don't re-use the VMs - a VM's lifecycle is basically:

1. Launch, prepare basic software, shut down

2. A GitHub job request arrives at Depot

3. The job is assigned to the stopped VM, which is then started

4. The job runs on the VM completes

5. The VM is terminated

So the pool exists to speed up the EC2 instance launch time, but the VMs themselves are both single-tenant and single-use.


My question is thus more on the on-disk data, you mention VM being terminated, does that data is wipped too and new VM starts on a brand new disk?


Correct yeah, each run starts on a brand new VM with a brand new disk - since these are EC2 instances with EBS volumes for their root disk, the whole instance and the EBS volume are deleted after the job finishes and are not reused.


Thanks for the reply


Can I use my own AWS account?


You can! The default is that we launch the runners on our AWS account, but we do also have a bring-your-own-cloud deployment option.

We have some docs on this for our container builder product - still need to write the docs for Actions runners too, though they use the same underlying system: https://depot.dev/docs/self-hosted/overview.


TL;DR: managed runners by construction constitute a major ongoing infosec liability.

A managed runner means not only entrusting a third party with your code but also typically providing it with enough data/network connectivity to make testing/validation feasible as a part of the build process. While this is doable per se, it introduces multiple major failure modes outside of data owners' control.

Failure scenario (hypothetical): you hydrate your test DB using live data; you store it in a dedicated secure S3 bucket, which you make accessible for the build process. Now the managed runner organization gets hacked because making resilient infra is hard, and the attackers intercept the S3 credentials used by your build process. Boom! Your live data is now at the mercy of the attackers.


It’s not wisdom to point out that using 3P software constitutes a threat vector. Personally, except in rare cases of unusual competence or unusual sensitivity, I believe that in-house CI will be more vulnerable than managed.


How does it compare to BuildJet?


We're both offering managed GitHub Actions runners - some of the differences include:

- Depot runners are hosted in AWS us-east-1, which has implications for network speed, cache speed, access to internet services, etc. (BuildJet is hosted in Europe - maybe Hetzner?)

- Also thanks to AWS: each runner has a dedicated public IP address, so you're not sharing any third-party rate limits (e.g. Docker Hub) with other users

- We have an option to deploy the runners in your own AWS account or VPC-peer with your VPC

- We're integrating Actions runners with the acceleration tech we've built for container builds, starting with distributed caching



GitHub has a colo presence in Frankfurt so pulling repos from Europe is quick.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: