Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Actions as a time-sharing supercomputer (alexellis.io)
205 points by petercooper 11 months ago | hide | past | favorite | 79 comments



One thing that I still don't understand about github actions is how much computing power gets wasted downloading and reinstalling the same dependencies over and over. A build that takes a couple of seconds in my local machine can easily take several minutes in github actions.


I prefer GA to basically every other build system especially because of the plugins but yeah, and I think caching can be turned on... but it has tons of inconsistencies related to packages and deps. This video really highlighted just how brittle the whole thing is even behind the scenes:

https://youtu.be/9qljpi5jiMQ (13:58 time stamp for the part that is especially relevant here)


After that video, I think I finally get Github Actions.


In Azure DevOps pipelines there's a fun issue where if you're pulling a dependency from nuget.org it gets pulled very quickly, but pulling from your own artifact repositories hosted in the very same Azure DevOps instance take 40-80 seconds. I'm not the one paying the extra bills when my builds take 5 to 6 times longer than they should but it's still frustrating.

And Microsoft has effectively zero incentive to fix this because it still technically works and they get to charge my employer for all that extra build agent time. Here's a few examples of people reporting similar issues and getting zero help:

https://developercommunity.visualstudio.com/t/devops-build-p...

https://developercommunity.visualstudio.com/t/nuget-restore-...

https://developercommunity.visualstudio.com/t/slow-performan...


Microsoft hosted agents aren’t charged by the minute on ADO, IIRC. You pay for a certain number of parallel agents per month.

Unless you’re booting build servers on demand on your infra, you aren’t paying more for more minutes. Just waiting :(

(Unlike GitHub, where they do charge per minute.)


do you know if ADO has any caching mechanisms for pipelines doing builds?


The issue in this case isn't caching, even if I'm not actually pulling any packages from our private nuget repo simply having that repo listed in the project's nuget.config file is enough to cause the issue.


GitHub charges per minute. Some incentive to keep actions on the most under specced machines available.


Yeah. We have a similar thing going on with self-hosted GitLab at my work. (And there, there isn’t even any incentive why it should be slow.)

When I make an edit and rebuild on my computer, it takes a few seconds to rebuild because most things are cached.

In CI it takes 20 minutes on the MR branches of our repo. 30 minutes on master. Because a whole bunch of crap is downloaded from scratch and rebuilt from scratch.

Our infrastructure guys did set up something that is able to cache stuff that some of the repos use. But with the limited access I have I have not been able to figure out how to make use of it for the repo I work on. And they don’t have time to look into it for me either.

That’s my life lol


You can define a cache key in your .gitlab-ci.yml

https://docs.gitlab.com/ee/ci/yaml/index.html#cache


Considering gitlab is a very k8s native app it's way of running a build and moving caches into it, vs moving the container over to where the cache's are is very annoying. I've had issues open for years to update the cache-key with a CSI mount and let k8s move the pods to where the mount is fast. As it is now it pulls everything from S3.


Yeah gitlab has pretty good caching support and it is basically a must if you do bigger jobs, especially on stuff like c++ apps. It is pretty brittle sometimes, and in my experience it won't be able to be as fast as actual local dev but it's still good. You can even make it use a remote cache IIRC.


On free runners I'd regularly hit the 1 hour limit. Took some time to set up self-hosted through EC2 and really optimize caching, its now down to 10-20 mins, depending on the repo. Most of that is because of the self-hosted runners, the free ones are painfully slow.

Still, on my local machine these jobs would take 5 mins so its not perfect. And as the build gets more complicated and more stages are added, the problem compounds since the initial download is the slowest part.


Do other CI services do it differently? I worked with Jenkins and Circleci and it was exactly the same.

For github actions caching was fairly easy to add.


Last time I tried to add caching to my actions, I spent several hours and still couldn't get it to work properly in the end. I really wonder how many computing hours have been wasted because the default and most likely to work is to not cache anything.


On CircleCI it's also easy to add using save_cache and restore_cache, the docs have examples for different popular frameworks. There are a couple of advanced features and I have been able to cache most build artifacts.

They also cache the Docker images without you having to do anything but it really only works if you use their base images. With custom images the cache rate doesn't look great in my experience but I have no stats to back that up.


Use a persistent runner and you can take advantage of on disk caches...at the cost of repeatability you get from a clean image on each run.


repeatability must mean different things to us, a cache hit is repeatability to me, a network call can easily fail for so many reasons


It sure can fail but running on a clean agent each time reduces the "Well, it works on my machine" scenarios you can run into.

For example:

The last run didn't clean up, or changed something global.

New runs don't work because some expected state is missing that had been set up on a previous run.

Blowing the agent away each time reduces the chance that something previously run can impact your current run.

This is at the cost of pulling from networks each time. You can reduce those issues by adding network caches for modules/dependencies.


I use persistent runners, but we had to add additional steps (docker prune, mvn -U, docker image ls, etc) to keep the runner healthy and be better able to debug issues.


Given GitHub also owns npm these days, at least some of the constantly reinstalled dependencies are just getting copied from the equivalent of "next door".


And/or they can fix GitHub’s caching of NPM if it saves NPM money.


"super" seems a bit generous. Actions runners have always felt slow to me. Fast enough to get the job done for CI/CD but for a batch job running it locally would be faster.

This idea does leech free compute in an API-agnostic way though, if this could be tied together with a GCP free tier, AWS free instance, etc I wonder if you could cobble together enough free resources to run everything you wanted.


I suspect they are slightly slow on purpose - so people don't abuse it for mining or whatever


CPU is okay from my observations, but iops is surprisingly slow


Alex's product vision is fantastic. I hope https://github.com/self-actuated gets noticed by more folks out there who are hitting the limits of GitHub's hosted runners.


Seems like a solid tech approach too. I'm surprised there isn't more of this around GHA. It really feels like Microsoft calibrated exactly good enough, yet everyone seems to have their own piles of workarounds in published actions, even bigger piles in their infra, and then some more in the workflow yamls themselves, to get everything actually workings, especially when you need to support GHA runners, self-hosted runners, ARC runners, `act` runners, etc. If they can foster an OSS community around easy self-hosted, and then also offer a hosted runners product that is priced well. I'd pay 'em.


GitHub Actions computing time is crazy expensive. Especially the Mac prices.


Running macOS legally requires real mac servers and a bespoke storage solution: https://www.datacenterdynamics.com/en/analysis/not-just-stac...

A self-hosted macOS runner will be more economical in the long-run, if you have a spot you can hook it up at; or, if you're fine doing things less than legally, you can use https://github.com/sickcodes/Docker-OSX.


To add to this, they recently published a video showing how they provision and manage their macOS runners: https://www.youtube.com/watch?v=I2J2MzKjcqY


I'm sure legality depends on jurisdiction, too. If you acquired the software legally and you need to keep it running in a VM, I'm sure it's legal at least in some places.

But yeah, just drive-by-downloading MacOS to your Windows box it is probably not quite on the up and up.


Or get one from Hetzner (there's one in the server finder right now for 64/month).


Highly recommend everyone check out the self-hosted runners. GitHub has made it crazy easy to set up and all your actions are free and can have local cache. And you get all the benefits of the Actions ecosystem. Throw a used Mac mini or old PC or whatever at it.


They also price gouge by charging each job you run as per minute rounded up.

So if you have 5 10 second actions that run when a PR is created, that 50 seconds of compute is charged as 5 minutes.


Is it really price gouging, or just embodying the real warmup costs of spinning up a job?


I truly don't understand why this isn't more widely discussed (I've seen several "GH Actions Gotchas" articles where this isn't mentioned). Many of the community actions also seem to be designed to run as short jobs to paper around missing features (for ex: https://github.com/dorny/paths-filter ), that end up eating up an enormous amount of your minutes budget.


We are working on an alternative that's 100% compatible with GHA but much cheaper and faster. Check out https://dime.run/


Why do actions use VM’s instead of containers? And yet codespaces use containers.


Good question, I covered that in detail, including the issues with Docker In Docker.

https://actuated.dev/blog/blazing-fast-ci-with-microvms

https://docs.actuated.dev/faq/


sometimes you need to be able to run containers within your action, which can get complicated if you're already in a container


Is there an inherent complication here?


There are some! This is called "Docker in Docker", and you need to expose your docker socket to the container, which has some security implications.

It can be secured, as gitpod does, but as I understand it is a PITA.


That's not what dind is. Rather, there is a docker daemon running inside the container, and the containers it hosts are nested inside its cgroup in the host kernel. The result is very close in feel to docker in its own VM. Furthermore nesting can be done inside one of the payload containers creating a turduckin. E.g. you can run k8s in a container, with k8s nodes implemented as nested containers and the cluster pods as doubly nested "pigeon" containers.

I haven't tried more than three levels but in theory more should work.


docker-in-docker doesn't run a docker daemon in the container, it just bind-mounts the host's docker socket inside the container, and the docker client talks to that. Any containers you launch from within docker-in-docker are siblings, not nested.


You are mistaken. From https://hub.docker.com/_/docker:

    What is Docker in Docker?
    Although running Docker inside Docker is generally not recommended, there are some legitimate use cases, such as development of Docker itself.
    ...If you are still convinced that you need Docker-in-Docker and not just access to a container's host Docker server, then read on.
This makes it pretty clear that it's a different copy of the docker daemon (which eg. allows you to test changes to docker itself) and specifically says it's different from "just access to a container's host Docker server".


Yikes, I stand corrected. Thanks for setting me straight.


> you need to expose your docker socket to the container

I always thought this was a hard limitation, but I deployed some self-hosted GHA runners in Kubernetes this week and to my surprise that setup came with an option to run the full docker daemon inside of a container - so apparently it is possible.


If you're running a full docker daemon, then you'll be running as a privileged container which is worse or about the same in terms of terms of poor security. Anyone's workload can compromise the host, and likely the cluster.

Rootless containers are a lot of work and do not support many scenarios that you're going to need.

MicroVMs are the same experience as GitHub, full system and Kernel, do what you will. Even launch a nested VM.


It's hardly a PITA, I've done this a few times and as long as you use the dind docker image, it's easy to forget you're even using DIND.


I think it simply got easier over the years, as the various implementations overcame problems


> There's something persuasive about running jobs and I don't think it's because developers "don't want to maintain infrastructure".

I remember taking a small university course on Ethereum, and getting an introduction smart contracts, trustless environments, and so on. We then heard about a couple of example projects, and were finally asked for our own ideas.

Now, after learning a bit more, I'm pretty sure none if the ideas presented either by the lecturers, me, or the other students really benefitted from the trustless environment, which is mostly what you'd use Ethereum for, and arguably what you pay (a lot) for during contract execution. Yet there were so many ideas about what could be done using smart contracts which were really cool projects on their own.

I think a big world computer with nodes that can perform calculations, react to user input, be called from other nodes, and exchange tokens and information, is somehow an incredibly natural abstraction that humans can work very well with. So, agreed.


The company I co-founded (Terrateam) develops a Terraform/OpenTofu GitOps CI/CD that focuses on GitHub. We use GitHub to run the operations, for a number of reasons, but we treat it just like ephemeral compute. We initiate the run, basically a near blank image, with our runner on it, and it "morphs" the image into what we need to do for that run. It works really well, except that GitHub Actions can be slow and I think they haven't quite figured out how to do reliable operations for it, yet.

The major thing about the API that I don't like is when you initiate a GHA run via the API, it does not give you an ID that you can use to track it. So if you initiate and it either never runs or fails for some reason prior to any code you put on the image, there is no good way to track that.


Cloud in general is just "mainframe 2.0."



Thank you, much appreciated. While I'm still processing the arguments, the style itself is breathtaking. They don't make sentences like this anymore:

> Home computers are not unusual, anymore; the novelty is gone.


If you're a fan of elegant prose, I can't recommend Epicurian Dealmaker[1] enough. Sadly, the blog hasn't been written to in 8-ish years now. But I suspect that most of the information is still probably reasonably accurate.

---

1. https://epicureandealmaker.blogspot.com/2013/02/table-of-con...


Definitely validates the mainframe architecture though doesn't it?


Ironically the fastest path to getting everyone connected was to have them talk to cloud data-centers/neo-mainframes.

So it both justifies the mainframe, but if you look at why we all went mainframe in the end, it was to, ironically, connect many computers together.

There's also the overwhelming issue of power and control. Moving computing back to the mainframe allow totalistic control over computing by the service provider. This is a good way to make money. But is it good for the world? And what would the world look like if we had reliable fast trustable interconnected system, instead of cloud mainframes/data keeps?

I'm forgetting which books but some of the books about early computing talked about protests against computerization, against the mass data ingestion (probably among others What The Doormouse Said?). For a while the personal computer was a friendlier less scary mass-roll-out of computing, but this cloud era has not seen many viable alternatives to staying connected while keeping computing personal. RemoteStorage was early in, and Tim Berners Lee trying a seemingly very reasonable Solid idea seem like very reasonable takes, or going full p2p with data/hyper and that world: none of these have the inertia where others can follow suit. The problem is much harder, but I think it's more path dependence and perverse incentives, that breaking out will be found to be quite workable and good and validated, but there's gross inaction on finding the moral, open ecosystem, protocols & standards based alternatives for connecting ourselves together as we might.


Fast, trustable, and interconnected is a function of the infrastructure, so everyone who wants to host still needs a data center.


Hosting from home seems absurdly viable for many. I have a systemd-timer that keeps a upnp-igd nat hole punched so I can ssh in, and that has absurdly good uptime. My fiber to the home would survive to quite a lot of use.

Past that, vps can be had so cheap. If we have good software, the computing footprint ought to be tiny.

One real challenge is scale out. Ideally, for p2p to really work, some multi-tenant systems seem required, so we can effectively co-host. I loved the sandstorm model, but didn't actually use it, and I think there's further refinement possible.

Ideally imo, I could host like 10 apps, but if you want to use one, you spin up your own tenant instance. The lambda engine/serverless/FaaS thing wouldn't actually spin up new runtimes, it'd use the same FaaS instances, but be fed your tenant context when it ran and only be able to access your tenant stuff. That way as a host, I kind of know & can manage what runs, but you can have your own freedom of configuring your own instance to a large degree.

Then we need front ends that let you traffic steer and any cast in fancy ways, so you can host your own but it falls back to me, or you have 10 peers helping you host & you can weight between them.

Operationalizing what we have already & finding efficient wins to scale is kind of cart before the horse, since fediverse &Al are so new, but I think the deployment/management model to let us scale our footprint beyond ourselves is a crucial leap. And I think we are remarkably closer than we might think, that the jump into a bigger more holistic pattern is possible if we leverage the excellent serverless runtimes & operational tools that have recently merged integratively.


It's not a mainframe unless IBM ships 6 entirely different OSes for it.


> "is GitHub Actions production ready? The answer is yes, so by proxy, you could run this tool in production"

I sent this to some of my coworkers for a good laugh.


TIL production-readiness is contagious.


Maybe it should thought of as production-adjacent...


our ops group imposed Github Actions on us and build times have tippled on top of being less reliable.


This is really cool, Alex: the ability to run arbitrary jobs on GitHub actions that you've wrapped into a convenient API that (I guess) automatically runs on the 3000 free-minutes that every GH account gets on default Ubuntu-runner actions every month.

Super cool, I love this idea!

Less general than you, but I also had a notion to do kewl shit on GH actions, and I figured out how to turn it into a "personal ephemeral VPN-like" using BrowserBox.

Basically, you:

1. fork/generate the BrowserBox project into your own account

2. enable Actions and issues on the fork, and

3. open a new issue from the "Make VPN" issue template which triggers the action to create your remote browser and gives you a login link.

And voila! In a few minutes you get an up and running remote browser/private VPN that runs for 15 minutes or so (conservation, but you can tweak the value in the actions yaml!),

Even cooler, the action job will post comments guiding you through any setup steps you need, and then post the login link for your BrowserBox instance into the comments section of the issue you opened! Here's the action yaml that I use to do this^0

One hassle: I have still not avoided is the necessity of using ngrok for this. Sure, you can get around it using a tor hidden service which we also support, but that requires the user connecting via a tor browser.

Ngrok is required to create the tunnel from the BrowserBox running inside the GH Actions Runner, to the outside world.*

Seems like a lot of steps, right? So I added a "conversational" set of instructions auto posted by the runner as issue comments to guide you through it.

Check it out la! :)

https://github.com/BrowserBox/BrowserBox?tab=readme-ov-file#...

* Technically, this is probably possible by using mkcert on the IP address of the runner, and posting the rootCA.pem as an attachment to an issue comment. You then need to add it to your trust store and you're good to go, but ngrok is the easier way: sign up to ngrok, get your API key, add it to the Repository Secrets in settings and hit the "Make VPN" issue.

0: https://github.com/BrowserBox/BrowserBox/blob/boss/.github/w...


I know this is tangential to the linked article, but why the "Open" in OpenFaaS if the community edition is so limited? The first payment tier is 1000 dollars. Is that open in the sense you can send pull request and inspect code?


...and this is why we can't have nice things


Neat, this is a great simulation of eventual computing. When GH Actions goes down, it would be like your time-sharing quota has run out. (:


The approach the author describes is batch processing, not time sharing. The whole point of time sharing was to allow users to work with the computer interactively rather than having to submit a job and wait for results.


Wouldn't Google Colab then we closer to both "time shared" and "supercomputer"? Jupiter

notebooks are certainly interactive and Colab's purpose is to allow often low-powered clients to offload ML training/inference to a GNU-enabled server, so kind-of like time-shared systems of the old?


This reminds me of my dad talking about how amazing it was when time-sharing became a thing at UWaterloo. He'd talk about how batch-processing was such a pain when you'd discover some tiny error and have to get back in line, and how with time-sharing you could show up at 3am and have an entire PDP to yourself.


My PhD advisor stayed up late at night (in the '60s) to get exclusive access to an IBM mainframe that was normally used at UC for payroll. When I joined the lab the convention was to go find a machine that was only lightly loaded (manually; there was no batch queue, just telnet to a cluster of machines) and start your multi-month simulation and hope nobody else popped onto the machine to steal half your processor.


Batch processing and time sharing, there are pros and cons. Time aloning, though, there are no cons.


It needs JCL cards for proper batch processing.

Speaking of mainframes... I ran across this video on the latest workflow for COBOL / CICS development on your IBM mainframe. You have the old 3270 emulation way yes, but then we move into 2010 technology: a custom version of Eclipse IDE that submits the jobs for you, then we move into modern times with Visual Studio Code integration (using something called Zowe CLI):

https://www.youtube.com/watch?v=_CYUYnKim7U



Yes. In the courses I give on MLOps, we use Github Actions to run on a schedule:

1. feature pipelines

2. batch inference pipelines


I needed to setup GitHub actions for several open source repos for automated CI/CD, using it for ML pipelines seems awful.


I would love to hear your suggestion of an alternative free platform that can be used to run python programs on a schedule with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: