Show HN: Managed GitHub Actions Runners for AWS

jitl · 2024-04-05T01:30:27 1712280627

At Notion we run our GitHub Actions jobs on ECS, and use auto-scaling to add and remove hosts from the ECS cluster as demand fluctuates throughout the day. We also age out and terminate hosts although they usually live for a few days to a week. I guess we had to pay some one time setup costs around configuring the ECS cluster and fiddling runner tags, but it seems to work pretty well. We have our own cache action although it’s not as fancy as depot’s, just a tarball in s3.

Overall it’s pretty simple terraform setup plus a couple dockerfiles. And we get to run in the same region as the rest of our infra that’s close to most of our devs (us-west-2).

ECS might sound more complicated than “just use ec2” but we don’t have to screw around with lambdas and the terraform is pretty simple, much simpler then the Philips-labs one. It’s about 1400 lines of Terraform across 2 files since ECS has so much stuff built in and integrates with auto scale groups well.

vicio · 2024-04-05T08:22:30 1712305350

Thanks for sharing - Do you have any blogpost or YouTube video where you go deep into the details of the implementation? that would be a good one

liamfd · 2024-04-05T06:09:01 1712297341

This sounds like a good setup - if you don't mind me asking, what do you use for your auto scaling metric?

Also curious how much y'all isolate it from your other infra. I've thought about this but I've been torn on whether I'd set up a separate vpc for it.

throwaway384638 · 2024-04-05T09:58:55 1712311135

Can you post the TF file?

toomuchtodo · 2024-04-04T20:15:07 1712261707

How will you compete if GitHub talks to the Azure folks (who have the benefit of Azure scale) and gets better compute and network treatment for runners? Or is the assumption GH running remains perpetually stunted as described (which is potentially a fair and legit assumption to make based on MS silos and enterprise inertia)?

To be clear, this is a genuine question, as compute (even when efficiently orchestrated and arbitraged) is a commodity. Your cache strategy is good (will be interested in testing to tease out where is S3 and where is Ceph), but not a moat and somewhat straightforward to replicate.

(again, questions from a place of curiosity, nothing more)

jacobwg · 2024-04-04T20:34:54 1712262894

Yep, it's a good question! At the moment, my thoughts are roughly:

GitHub's incentives and design constraints are different than ours. GitHub needs to offer something that covers a very large user-base, to cover the widest possible number of workflows, and they've done this by offering basic ephemeral VMs on-demand. CI and builds are also not GitHub's primary focus as an org.

We're trying to be the absolute fastest place to build software, with a deep focus on achieving maximum performance and reducing build time as much as possible (even to 0 with caching). Software builds today are often wildly inefficient, and I personally believe there's an opportunity to do for build compute what has been done for application compute over the last 10 years.

GitHub Actions workflows are more of an "input" for us then (similar to how container image builds have been), with the goal of adding more input types over time and applying the same core tech to all of them.

toomuchtodo · 2024-04-04T20:37:00 1712263020

Good reply. It seems like you understand the market and where your product fits, which is half the battle.

Wishing you much success.

jacobwg · 2024-04-04T21:19:41 1712265581

Thank you!

mason55 · 2024-04-05T11:25:30 1712316330

This is one of the best responses about market positioning that I've ever read. Best of luck to you!

JohnMakin · 2024-04-05T00:14:13 1712276053

This is a really clever product and I'd love to learn more - good luck.

crohr · 2024-04-04T20:54:26 1712264066

I believe the solution is to decentralise, i.e. let the customer run the machines in their own AWS account (what I'm doing with RunsOn, link in bio if interested).

It is very hard for a single player to get favourable treatment from Azure / AWS / GCP to handle many thousands of jobs every day / hour.

I wish Depot all the luck, I think they've done good work wrt caching.

playingalong · 2024-04-04T20:35:44 1712262944

Corporate inertia might not be the only reason for excessive pricing.

They might simply charge for everything working out of the box convenience. Or even for not being aware there are other options.

coredog64 · 2024-04-05T02:57:34 1712285854

Not mentioned down-thread, but GitHub’s incentive is to sell you CI minutes, and slow runners are shooting fish in a barrel.

werewrsdf · 2024-04-04T20:30:44 1712262644

I recently set up AWS Github runners with this terraform. It works well and you don't have to pay any extra in addition to AWS.

https://github.com/philips-labs/terraform-aws-github-runner

striking · 2024-04-04T21:57:28 1712267848

I helped set this up at my workplace and can second that it works fairly well, but it definitely does have scale issues (we tend to exhaust our GH org's API ratelimit and end up being unable to scale up sometimes, as well as seeing containers be prematurely terminated because the scale down lambda doesn't seem to always see them in the GH API) and it's definitely lacking a lot of tooling around building runner images and caching optimization that we ended up building in-house.

Definitely linking OP to my team now.

SOLAR_FIELDS · 2024-04-05T06:06:07 1712297167

We looked at this Phillips solution originally in a previous org and eventually decided on Karpenter + Actions Runner Controller instead, configured with webhook aka push based triggers. It’s really the best solution for scale but it does take awhile to implement and tune to get right. If you have dedicated infra people I can recommend it. If you don’t, I would look to a more managed solution like OP’s offering

jacobwg · 2024-04-04T21:02:58 1712264578

Yeah this is a good option if you'd like something to deploy yourself! You can also build an AMI from GitHub's upstream image definition (https://github.com/actions/runner-images/tree/main/images/ub...) if you'd like it to match what's available in GitHub-hosted Actions.

With Depot, we're moving towards deeper performance optimizations and observability than vanilla GitHub runners - we've integrated the runners with a cache storage cluster for instance, and we're working on deeper integration with the compute platform that we built for distributed container image builds - as well as expanding the types of builds we can process beyond Actions and Docker, for instance.

But different options will be better for different folks, and the `philips-labs` project is good at what it does.

SOLAR_FIELDS · 2024-04-05T06:13:40 1712297620

One of the most interesting value adds for me is not any of the things mentioned by OP. I would like to have a managed hosted runner solution where I can have a buildkit cache also in the same data center that I manage where I don’t have to pay ingress/egress to that cache but also I don’t have to manage my own runner infra. I have done the whole self hosted Karpenter + Actions Runner Controller thing to achieve this and it is a lot of work to set up and tune to get right.

The problem is actually really that GitHub’s caching offering is very limited for anything except the most basic of use cases and also they don’t offer a way to colo your own cache with them so that you aren’t paying cloud fees back and forth. You have to use their machines, their storage and their protocol which is only really viable if your definition of caching is literally just “upload files here” and “check if the uploaded built file already exists”.

Yes, I’m aware that buildkit offers “experimental” GHA caching support. But given how fat image layers are it’s basically unusable for anything beyond a toy project that builds a couple layers on top of an alpine image (as of the time of writing this post GHA limits cache size to 10gb per repo. Fine if you’re building npm or pypi packages or whatever, but hilariously inadequate for buildkit layer caching)

crohr · 2024-04-05T06:42:11 1712299331

What you are looking for is a local S3 cache, which buildx supports as a backend. Just make sure you have an S3 gateway connected to your VPC (and that your S3 bucket is in the same region than your runners!) and enjoy free bandwidth, unlimited cache size, and crazy fast network throughput.

https://runs-on.com/reference/caching/ https://runs-on.com/features/s3-cache-for-github-actions/

SOLAR_FIELDS · 2024-04-05T06:47:45 1712299665

Looks neat but is there a way to guarantee that it’s colo’ed with GHA hosted runners and that I won’t pay ingress/egress? If not then I don’t see how it’s much different than simply putting up my own bucket aside from saving me the logistics around permissions etc.

Edit: I see. This solution you linked doesn’t use GHA hosted runners at all - it’s intended to be a turnkey self hosted runner solution. In other words, a direct competitor to the service linked in OP. That wasn’t super clear from your comment but after reading your links it is more clear. I do really like the pricing here, if it actually works as advertised it’s a pretty great value prop for a lot of orgs.

crohr · 2024-04-05T07:01:13 1712300473

Oh yes, it can't work with GHA hosted runners otherwise you'll pay egress fees. From your first post I was assuming you were starting from the point of view that you would be running your own runners already.

It does work as advertised, try it :) And yes RunsOn is a direct competitor to the 5 YCombinator-funded companies operating in this space (Ubicloud, Warpbuild, Buildjet, Blacksmith, Depot).

jacobwg · 2024-04-05T07:17:11 1712301431

Hey, yep we have this today! Depot's original product is a fully-managed container build service (https://news.ycombinator.com/item?id=34898253) that caches all BuildKit layers to SSDs, so there's no cache-to/cache-from and cache doesn't need to transfer over the network at all.

Our original version of that system used vanilla BuildKit + EBS volumes + orchestration, nowadays we've replaced EBS with a distributed ceph storage cluster for significantly faster IOPS and throughput and have modified BuildKit to be better suited for high-performance distributed builds.

Both the container build service and the Actions runners are in the same AWS VPC, so they get good network performance between the two and don't need to egress over the internet.

kylegalbraith · 2024-04-05T07:07:21 1712300841

Other Depot founder here. This is a really great point.

You've hit on all the main points regarding Docker image cache in GHA. Persisting the massive layer cache over networks is incredibly slow and has weird limits (like 10GB per repo). We persist the layer cache to ceph volumes and orchestrate your cache so it's immediately available across builds with our first service, accelerated container image builds. Our GHA runners run right next to that same infra, so you don't have ingress/egress. All that can be hosted in your own AWS account (we're also open to running that in any general compute environment for folks who need it).

boundlessdreamz · 2024-04-05T06:19:43 1712297983

The depot.dev service has excellent caching for docker. It's almost like building locally.

Though the site (depot.dev) focuses on that aspect, this post doesn't.

@jacobwg - Do the runners in AWS get the same docker caching performance as depot.dev hosted runners?

jacobwg · 2024-04-05T07:25:11 1712301911

At the moment, I think you'd want to use both products together, i.e. using `depot build` in place of `docker build` to move the container build portion to a Depot container builder.

I'd like to have a more automatic integration at some point - the challenge is that a lot of BuildKit's architecture performs best when many different build requests all arrive at a single build host, it is then able to efficiently deduplicate and cache work across all those build requests. So you really want the many different Actions jobs all communicating with the same BuildKit host.

We have some ideas for reducing the amount of change to Actions workflows to adopt ^ - longer term we're also working on our own build engine, to free those workloads from being confined to single hosts (be that single CI runners or single container builders).

cocoflunchy · 2024-04-05T05:17:10 1712294230

Half the price of github is not great right now, this space is heating up! Ubicloud is 10x cheaper and https://runs-on.com is in the same ballpark by using spot instances. (Currently switching to RunsOn)

jacobwg · 2024-04-05T08:38:06 1712306286

Yeah, I think our goal is to be the fastest at building software, not necessarily the cheapest. Part of that involves AWS, to have access to more powerful and elastic infrastructure, but that comes at a premium.

But besides just compute, I think the bigger long-term unlock for build performance is a new distributed compute engine, to free build workloads from single machines. We've started building this for our container build product, and plan to integrate Actions jobs as an input as well, starting with the cache integration we have today.

madisp · 2024-04-04T21:48:09 1712267289

> - Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1, so interacting with artifacts, cache, container registries, or the internet at large is quick.

do you actually get the promised 12.5 Gbps? I've been doing some experiments and it's really hard to get over 2.5Gbit/s upstream from AWS EC2, even when using large 64 vCPU machines. Intra-AWS (e.g. VPC) traffic is another thing and that seems to be ok.

jacobwg · 2024-04-04T21:57:14 1712267834

We do get the promised throughput, but it depends on the destination as you've discovered. AWS actually has some docs on this[0]:

- For instances with >= 32 vCPUs, traffic to an internet gateway can use 50% of the throughput

- For instances with < 32 vCPUs, traffic to an internet gateway can use 5 Gbps

- Traffic inside the VPC can use the full throughput

So for us, that means traffic outbound to the public internet can use up to 5 Gbps, but for things like our distributed cache or pulling Docker images from our container builders, we can get the full 12.5 Gbps.

[0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

mdaniel · 2024-04-05T02:04:22 1712282662

> > - Each instance has high-throughput networking of up to 12.5 Gbps, hosted in us-east-1

with that pull quote, I thought you were going to point out their use of us-fail-1. I struggle to think of a service that I care so little about its availability that I'd host it there, but CI/CD for sure wouldn't be one

watermelon0 · 2024-04-04T20:22:52 1712262172

Hey, @jacobwg, this looks great.

I couldn't find it anywhere on the page, but do you support Graviton3 (i.e. m7g instances) for GHA Runners? If the answer is no, are there any plans to support it in the future?

> start them when an actual job request arrives, to keep job queue times around 5 seconds

Did you have to fine-tune Ubuntu kernel/systemd boot to reach such fast startup times?

jacobwg · 2024-04-04T20:48:52 1712263732

We do support Graviton! I actually _just_ enabled them today, which we're calling "beta" for the moment: https://depot.dev/docs/github-actions/overview#depot-support....

The challenge with Arm is actually just that GitHub doesn't have a runner image defined for Arm. For the Intel runners, we build our image directly from GitHub's source[0], and we're doing the same for the Arm runners by patching those same Packer scripts for arm64. It also looks like some popular actions, like `actions/setup-*`, don't always have arm support either.

So the disclaimers for launching Depot `-arm` instances at the moment is basically just (1) we have no idea if our image is compatible with your workflows, and (2) those instances take a bit longer to start.

On achieving fast startup times, it's a challenge. :) The main slowdown that prevents a <5s kernel boot is actually EBS lazy-loading the AMI from S3 on launch.

To address that at the moment, we do keep a pool of instances that boot once, load their volume contents, then shutdown until they're needed for a job. That works, at the cost of extra complexity and extra money - we're experimenting some with more exotic solutions now though like netbooting the AMI. That'll be a nice blog post someday I think.

[0] https://github.com/actions/runner-images/tree/main/images/ub...

playingalong · 2024-04-04T20:38:09 1712263089

Not affiliated, just guessing.

This 5 seconds might be the warm start, not cold. I.e. they likely have a pool of autoscaled, multi tenant workers

jacobwg · 2024-04-04T20:50:26 1712263826

Yeah 5 seconds is from stopped to running, but to get that speed we need to pre-initialize the root EBS volumes so that they're not streaming their contents from S3 during boot. The GitHub Actions runner image is 50GB in size _just_ from preinstalled software!

jillesvangurp · 2024-04-05T11:36:42 1712317002

A few months ago I had some issues with build performance. We're on the free plan with Github so using custom runners is not an issue. But I found a nice workaround:

- create a virtual machine with everything you need in gcloud (would work for aws as well). Pick something nice and fast. Suspend it.

- in your github action, resume the vm, ssh into it to run your build script, and suspend it afterwards.

Super easy to implement and easy to script using gcloud commands. It adds about 30 seconds of time to the build for starting the vm. On the machine, we simply pull from git and checkout the relevant branch. Doesn't work for concurrent builds but it's a nice low tech solution. And you only pay for the time the machine is up and running, which is a few minutes per day. So, you can get away with using vms that have lots of CPU and memory.

bpsh · 2024-04-05T00:30:26 1712277026

Interesting, makes a lot of sense to me as far as pricing too. However, I feel the video demonstration could greatly improve in terms of explaining and enthusiasm. It's super cool though and presentations/demos should showcase the full potential!

math0ne · 2024-04-04T23:21:26 1712272886

I used this to setup my runners on a dedicated server: https://github.com/vbem/multi-runners

nodesocket · 2024-04-05T02:45:42 1712285142

A cool idea, but not sure the business case. I wrote a quick and dirty bash script which automates the process of adding 2x GitHub runners on instances (2 CPU cores and 4 GB memory each). Simply scale out horizontally. Since the instances are persistent you get docker image caching out of the box unlike hosted runners on GitHub. Also arm64 is fully supported.

brycelarkin · 2024-04-05T00:26:38 1712276798

For the AWS CDK folks, I’ve been very happy with this library. https://github.com/CloudSnorkel/cdk-github-runners. Love that I can use spot pricing and the c7g instances for cicd.

timvdalen · 2024-04-04T20:38:39 1712263119

Congrats on shipping! We built something similar internally. Tweaking it for the right cost/availability/speed was interesting, but we now have it working to where workers are generally spun up from 0 faster than GitHub's own are.

jacobwg · 2024-04-04T21:07:14 1712264834

Yeah, GitHub's runners, especially the ones with >2 CPUs, have surprisingly long start times!

LilBytes · 2024-04-04T22:54:47 1712271287

Hey Jacob, awesome suggestion!

Are you building your base image from the GitHub runner-images repo?

Do you have any appetite for building self hosted EC2 agents for Azure DevOps and GitHub?

I'm happy to help if you are, I'm working on something similar myself for my employer.

jacobwg · 2024-04-05T07:29:39 1712302179

Hey, we are building our base image from the runner-images repo! I'll send you an email!

siborg · 2024-04-05T09:44:00 1712310240

Your website is surprisingly good. Often, show hn sites are pretty basic and a little off the mark, but this was clear. Pricing seems simple too. Great job. Will give it a try.

alas44 · 2024-04-04T20:32:07 1712262727

How do you ensure privacy/isolation between users if you have a pool of ready VMs that you re-use?

jacobwg · 2024-04-04T20:52:31 1712263951

We don't re-use the VMs - a VM's lifecycle is basically:

1. Launch, prepare basic software, shut down

2. A GitHub job request arrives at Depot

3. The job is assigned to the stopped VM, which is then started

4. The job runs on the VM completes

5. The VM is terminated

So the pool exists to speed up the EC2 instance launch time, but the VMs themselves are both single-tenant and single-use.

alas44 · 2024-04-08T10:39:23 1712572763

My question is thus more on the on-disk data, you mention VM being terminated, does that data is wipped too and new VM starts on a brand new disk?

jacobwg · 2024-04-08T11:36:06 1712576166

Correct yeah, each run starts on a brand new VM with a brand new disk - since these are EC2 instances with EBS volumes for their root disk, the whole instance and the EBS volume are deleted after the job finishes and are not reused.

alas44 · 2024-04-12T21:19:01 1712956741

Thanks for the reply

playingalong · 2024-04-04T20:09:59 1712261399

Can I use my own AWS account?

jacobwg · 2024-04-04T20:16:25 1712261785

You can! The default is that we launch the runners on our AWS account, but we do also have a bring-your-own-cloud deployment option.

We have some docs on this for our container builder product - still need to write the docs for Actions runners too, though they use the same underlying system: https://depot.dev/docs/self-hosted/overview.

YouWhy · 2024-04-05T04:01:54 1712289714

TL;DR: managed runners by construction constitute a major ongoing infosec liability.

A managed runner means not only entrusting a third party with your code but also typically providing it with enough data/network connectivity to make testing/validation feasible as a part of the build process. While this is doable per se, it introduces multiple major failure modes outside of data owners' control.

Failure scenario (hypothetical): you hydrate your test DB using live data; you store it in a dedicated secure S3 bucket, which you make accessible for the build process. Now the managed runner organization gets hacked because making resilient infra is hard, and the attackers intercept the S3 credentials used by your build process. Boom! Your live data is now at the mercy of the attackers.

benced · 2024-04-05T04:58:25 1712293105

It’s not wisdom to point out that using 3P software constitutes a threat vector. Personally, except in rare cases of unusual competence or unusual sensitivity, I believe that in-house CI will be more vulnerable than managed.

pestkranker · 2024-04-04T20:18:58 1712261938

How does it compare to BuildJet?

jacobwg · 2024-04-04T20:40:30 1712263230

We're both offering managed GitHub Actions runners - some of the differences include:

- Depot runners are hosted in AWS us-east-1, which has implications for network speed, cache speed, access to internet services, etc. (BuildJet is hosted in Europe - maybe Hetzner?)

- Also thanks to AWS: each runner has a dedicated public IP address, so you're not sharing any third-party rate limits (e.g. Docker Hub) with other users

- We have an option to deploy the runners in your own AWS account or VPC-peer with your VPC

- We're integrating Actions runners with the acceleration tech we've built for container builds, starting with distributed caching

crohr · 2024-04-04T20:49:10 1712263750

Yes, BuildJet runs from Hetzner - https://runs-on.com/reference/benchmarks-gha-providers/

everfrustrated · 2024-04-04T21:57:51 1712267871

GitHub has a colo presence in Frankfurt so pulling repos from Europe is quick.