Interesting stuff. Thanks for the insight

latchkey · on May 24, 2023

What I ended up with was really neat.

machine <-> cloudflare <-> github

CI would run, build a binary that was stored as an asset in github. Since the project is private, I had to build a proxy in front of it to pass the auth token, so I used CF workers. GH also has limitations on number of downloads, so CF also worked as a proxy to reduce the connections to GH.

I then had another private repo with a json file in it where I could specify CIDR ranges and version numbers. It also went through a similar CF worker path.

Machines regularly/randomly hit a CF worker with their current version and ip address. The worker would grab the json file and then if a new version was needed, in the same response, return the binary (or return a 304 not modified). The binary would download, copy itself into position and then quit. The OS would restart it a minute later.

It worked exceptionally well. With CIDR based ranges, I could release a new version and only update a single machine or every machine. It made testing really easy. The initial install process was just a single line bash/curl to request to get the latest version of the app.

I also had another 'ping' endpoint, where I could send commands to the machine that would be executed by my golang app (running as root). The machine would ping, and the pong response would be some json that I could use to do anything on the machine. I had a postgres database running in GCP and used GPC functions. I stored machine metrics and other individual worker data in there that just needed to be updated every ping. So, I could just update column and the machine would eventually ping, grab the command out of the column and then erase it. It was all eventually consistent and idempotent.

At ~30k workers, we had about 60 requests per second 24/7 and cost us at most about $300 a month total. It worked flawlessly. If anything on the backend went down, the machines would just keep doing their thing.