Hacker News new | past | comments | ask | show | jobs | submit login

I don't really believe that, at the speed of nic it makes pretty much 0 difference even on 30k servers. Shaving couple of ms at worse few seconds vs modifing a binary, def not worth it.



The servers are not all on gige. Many are on 100mbit and yes, that saturates the network when they are all updating. I learned through trial and error.

The updates are not pushed, they are pulled. Why? Because the machines might be in some sort of rebooting state at any point. So trying to first communicate with the machine and timeouts from that, would just screw everything up.

So, the machines check for an update on a somewhat random schedule and then update if they need to. This means that a lot of them updating at the same time would also saturate the network.

Smaller binaries matter.


I’m curious why you’ve got servers on 100Mb. Last time I ran a server on 100Mb was more than 20 years ago. I remember the experience well because we needed AppleTalk support which wasn’t trivial on GbE (for reasons unrelated to GbE — but that’s another topic entirely).

What’s your use case for having machines on 100Mb? Are you using GbE hardware but dropping down to 100Mb, and if not, where are you getting the hardware from?

Sounds like you might work in a really interesting domain :)


Not the GP but edge devices on wifi/m2m are another scenario where you're very sensitive to deployment size.

Which can also be solved with compression at various other stages of the pipeline as mentioned by other commenters, but just to say that that's an easy case where this matters.


Because the 12 GPUS in them are a lot more important than the networking speed. =)

They were for mining ETH... we've turned them off though now that PoS has been successful.


For large-ish scale distributed updates like that, maybe some kind of P2P type of approach would work well?

IBM used to use a variant of Bittorrent to internally distribute OS images between machines. That was more than a decade ago though, when I was last working with that stuff.


Answered below. https://news.ycombinator.com/item?id=36052632

Another issue with that is that the systems I was running can go offline at any time. P2P, which could work, kind of wants a lot more uptime than what we had. It would just add some complexity to deal with individual downtime.


Interesting stuff. Thanks for the insight


What I ended up with was really neat.

machine <-> cloudflare <-> github

CI would run, build a binary that was stored as an asset in github. Since the project is private, I had to build a proxy in front of it to pass the auth token, so I used CF workers. GH also has limitations on number of downloads, so CF also worked as a proxy to reduce the connections to GH.

I then had another private repo with a json file in it where I could specify CIDR ranges and version numbers. It also went through a similar CF worker path.

Machines regularly/randomly hit a CF worker with their current version and ip address. The worker would grab the json file and then if a new version was needed, in the same response, return the binary (or return a 304 not modified). The binary would download, copy itself into position and then quit. The OS would restart it a minute later.

It worked exceptionally well. With CIDR based ranges, I could release a new version and only update a single machine or every machine. It made testing really easy. The initial install process was just a single line bash/curl to request to get the latest version of the app.

I also had another 'ping' endpoint, where I could send commands to the machine that would be executed by my golang app (running as root). The machine would ping, and the pong response would be some json that I could use to do anything on the machine. I had a postgres database running in GCP and used GPC functions. I stored machine metrics and other individual worker data in there that just needed to be updated every ping. So, I could just update column and the machine would eventually ping, grab the command out of the column and then erase it. It was all eventually consistent and idempotent.

At ~30k workers, we had about 60 requests per second 24/7 and cost us at most about $300 a month total. It worked flawlessly. If anything on the backend went down, the machines would just keep doing their thing.


could be IOT or edge type stuff that's POE'd?


Sounds like an interesting problem to have. Would something peer-to-peer like BitTorrent work to spread the load? Utilize more of the networks' bisectional bandwidth, as opposed to just saturating a smaller number of server uplinks. I recall reading many years ago that Facebook did this (I think it was them?)


The complication of implementing BitTorrent isn't worth it at 4mb binary sizes.

Always go with the simplest solution first.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: