I’ve learned a bunch of stuff about batch processing in the last few years that ...

I’ve learned a bunch of stuff about batch processing in the last few years that I would have sworn I already knew.

We had a periodic script that had all of these caveats about checking telemetry on the affected systems before running it, and even when it was happy it took gobs of hardware and ran for over 30 minutes.

There were all sorts of mistakes about traffic shaping that made it very bursty, like batching versus rate limiting, so the settings were determined by trial and error, essentially based on the 95th percentile of worst case (which is to say occasionally you’d get unlucky and knock things over). It also had to gather data from three services to feed the fourth and it was very spammy about that as well.

I reworked the whole thing with actual rate limiting, some different async blocks to interleave traffic to different services, and some composite rate limiting so we would call service C no faster than Service D could retire requests.

At one point I cut the cluster core count by 70% and the run time down to 8 minutes. Around a 12x speed up. Doing exactly the same amount of work, but doing it smarter.

CDNs and SaaS companies are in a weird spot where typical spider etiquette falls down. Good spiders limit themselves to N simultaneous requests per domain, trying to balance their burden across the entire internet. But they are capable of M*N total simultaneous requests, and so if you have a narrow domain or get unlucky they can spider twenty of your sites at the same time. Depending on how your cluster works (ie, cache expiry) that may actually cause more stress on the cluster than just blowing up one Host at a time.

People can get quite grumpy about this behind closed doors, and punishing the miscreants definitely gets discussed.