roneythomas6's favorites

> Anecdotally, there is a ton of money left on the table by established businesses...

True. fwiw, I worked on the same project at Twitter 4 years back - the Facebook folks call it capacity planning at scale, we called it capacity utilization modeling. The goal was the same - there are all these "jobs" - 10s of 1000s of programs running on distributed clusters, hogging CPU, memory and disk. Can we look at a snapshot in time of the jobs usage, and then predict/forecast what the next quarter jobs usage would be ? If you get these forecasts right ( within reasonable error bounds ), the folks making purchasing decisions ( how many machines to lease for the next quarter for the datacenters) can save a bundle.

From an engineering pov, every job would need to log it's p95 and p99 CPU usage, memory stats, disk stats...Since Twitter was running some 50k programs back then (2013ish) on these Mesos clusters, the underlying C++ API had hooks to obtain CPU and memory stats, even though the actual programs running were all coded up in Scala (mostly), or python/Ruby (bigger minority), or C/Java/R/perl ( smaller minority ). There's an interesting Quora discussion on why Mesos was in C++ while rest of Twitter is Scalaland...mostly because you can't do these sort of CPU/memory/disk profiling in the jvmland as well as you can in C++.

OK, so you now have all these CPU stats. What do you do with them ? Before you get to that, you have the usual engineering hassles - how often should you obtain the CPU stats ? Where would you store them ?

So at Twitter we got these stats every minute ( serious overkill :) and stored them in a monstrous JSON ( horrible idea given 50000 programs * number of minutes in day * all the different stats you were storing :))

So every day I'd get a gigantic 20gb JSON from infra, then I'd have to do the modeling.

In those days, you couldn't find a single Scala JSON parser that would load up that gigantic JSON without choking. We tried them all. Finally we settled on GSON - Google's JSON parser written in Java, that handled these gigantic jsons with no hiccups.

Before you get to the math, you would have to parse the JSON and build a data structure that would store these (x,t) tuples in memory. You had 50k programs, so each program would get a model, each model originated from a shitton of (x,t) tuples, the t being minutely and the fact that some of these programs had been running for years, meant you had very large datasets.

The math was relatively straightforward...I used so called "LAD" - least absolute deviation from mean, as opposed to simple OLS, because least squares wasn't quite predictive for that use case. Building the LAD modeling thing in Scala was somewhat interesting...Most of the work was done by the commons math Apache libraries, I mostly had to ensure the edge cases wouldn't throw you off, because LAD admits multiple solutions to the same dataset - it's not like OLS where you give it a dataset and it finds a unique best fit line. Here you'd have many lines sitting in an array, depending on how long you let the Simplex solver run. Then came the problem of visualizing these 50,000 piecewise line models using javascript heh heh. The front end guys had a ball with the models I spit out.

If someone's doing this from scratch these days, NNs would be your best bet. Regime changes are a big part of that.