Hacker News new | past | comments | ask | show | jobs | submit login

The first memorable locality problem I ever solved, I had clocked a function at 10% overall cpu time. The more I looked at it, the more I realized we were calling it way too often. It wasn’t allocating a lot of memory, it was just crunching a bit of data.

A bit of jiggling of the call structure above it and it goes from 2 invocations per call to one. So almost exactly half as many calls overall. I ran my end to end benchmark again, expecting 4-6% improvement in run time. I got 20%. Twice what I would have expected to get by commenting the function out entirely.

Few years later, similar situation. Operation taking about 5 times as long as we were allowed. Function A gets a list from the database. Function B gets the same list. I don’t recall what percent of the overall time this was, but it was a tall tent pole and obvious waste. Something like 30%. As a first course for a longer optimization engagement, I changed the method signatures to take the list and return a result, looked up the list once in the common parent function. Best case I figured a 20% reduction, which is still slow enough I can use the stopwatch function on my phone to estimate the improvement at the UI. I didn’t get 20%, I got 90%. So I went from 5x over the target to half, in one change. Unfortunately even this result didn’t convince some people they were barking up the wrong tree with respect to performance problems.

Right about the time of the first experience I had been learning an ugly truth: profilers lie. They aren’t the floodlight people think they are. They’re a book of matches in the dark. They reveal part of the picture, and cast as many shadows as they do light. You have to fill in a lot of gaps yourself, and that takes mechanical sympathy and more than a little stubbornness. Hardware counters narrow that gap a lot, but they still aren’t perfect, and a lot of languages seem not to use them.




Performance (Really) Matters by Emery Berger talks a lot about layout. https://www.youtube.com/watch?v=7g1Acy5eGbE

And the topic of Coordinated Omission, "How NOT to Measure Latency" by Gil Tene https://www.youtube.com/watch?v=lJ8ydIuPFeU

Injection slowness into an application to measure the relative impact of that portion of the application has really worked well for me.


Once I’ve done the dead obvious bits I start looking at invocation counts, on two fronts.

Not enough is said about running a subsystem a deterministic number of times and then evaluating the call counts. Invariably there is some method that is being called 2x as often as the requirements and architecture dictate (scenario one above). Someone just goofed, forgot they already had an answer, or part of one.

And then they’re the “profilers lie” part. If you have two functions that take 4.5% of overall time, and one is called 25k times, while the other is called 250k times, the chances of bookkeeping errors in the latter are much higher. It could be 4%, or it could be 10%. When in doubt, sort by invocation count and try another optimization pass.

I’ve also seen a variation on this in GC languages - one periodic large allocation gets stuck doing a full GC almost every time, because a dozen little methods are churning out garbage at a tempo that keeps missing the GC threshold because of the periodic allocation, which keeps raising the heap size toward whatever max you’ve set. To an extent cache smashing does the same thing. Little invalidations make problems for bigger methods.


It situations like this, Off Heap Collections or an Array of primitive values can reduce GC churn and pressure.


> Coordinated Omission

Now I have something new to worry about.

Or at least a new name for a nameless terror.

Humans are impatient and hate bad news, which adds to these things. I filed a PR recently against a small benchmarking library because I discovered when I cranked up the number of iterations that it was running out of memory.

They were going for simplicity, and doing too much processing along the way complicates things like our friend cache locality, but also branch prediction and thermal throttling.

But they weren’t consolidating between runs either and the sequence diagram for callbacks prevented you from manual cleanup.

Code is hard if you really think about it. Hard as you want it to be (or often, harder).

For injecting complexity, there have been times where I add a for loop to an inner scope to multiply the number of calls and see if the profiling data follows the expected curve. Plenty of surprises to find there.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: