The first memorable locality problem I ever solved, I had clocked a function at ...

nerpderp82 · on Nov 15, 2023

Performance (Really) Matters by Emery Berger talks a lot about layout. https://www.youtube.com/watch?v=7g1Acy5eGbE

And the topic of Coordinated Omission, "How NOT to Measure Latency" by Gil Tene https://www.youtube.com/watch?v=lJ8ydIuPFeU

Injection slowness into an application to measure the relative impact of that portion of the application has really worked well for me.

hinkley · on Nov 17, 2023

Once I’ve done the dead obvious bits I start looking at invocation counts, on two fronts.

Not enough is said about running a subsystem a deterministic number of times and then evaluating the call counts. Invariably there is some method that is being called 2x as often as the requirements and architecture dictate (scenario one above). Someone just goofed, forgot they already had an answer, or part of one.

And then they’re the “profilers lie” part. If you have two functions that take 4.5% of overall time, and one is called 25k times, while the other is called 250k times, the chances of bookkeeping errors in the latter are much higher. It could be 4%, or it could be 10%. When in doubt, sort by invocation count and try another optimization pass.

I’ve also seen a variation on this in GC languages - one periodic large allocation gets stuck doing a full GC almost every time, because a dozen little methods are churning out garbage at a tempo that keeps missing the GC threshold because of the periodic allocation, which keeps raising the heap size toward whatever max you’ve set. To an extent cache smashing does the same thing. Little invalidations make problems for bigger methods.

nerpderp82 · on Nov 28, 2023

It situations like this, Off Heap Collections or an Array of primitive values can reduce GC churn and pressure.

hinkley · on Nov 18, 2023

> Coordinated Omission

Now I have something new to worry about.

Or at least a new name for a nameless terror.

Humans are impatient and hate bad news, which adds to these things. I filed a PR recently against a small benchmarking library because I discovered when I cranked up the number of iterations that it was running out of memory.

They were going for simplicity, and doing too much processing along the way complicates things like our friend cache locality, but also branch prediction and thermal throttling.

But they weren’t consolidating between runs either and the sequence diagram for callbacks prevented you from manual cleanup.

Code is hard if you really think about it. Hard as you want it to be (or often, harder).

For injecting complexity, there have been times where I add a for loop to an inner scope to multiply the number of calls and see if the profiling data follows the expected curve. Plenty of surprises to find there.