Avoiding instruction cache misses (2019)

ncmncm · on Jan 8, 2021

For most code, you don't know how fast it "ought to" be, so you don't even know there is a problem. Most people discover there is a problem only when somebody comes up with a faster program. Then, your program you thought was fast instantly becomes slow.

So, in general it is best to just assume there is a problem, and that the program really ought to be running 2x or 10x faster. Then, discover why it isn't, and what must be done to get it there. Often enough, having got to 2x, a new bottleneck is slowing it down, and you may start over and get out another 2x. And again.

Sometimes it is best to release your intermediate improvements, and space them out a ways; three 2x releases can be much better for your reputation than a single 8x release, and no more work.

Also, sometimes clearing a bottleneck (apparently) doesn't end up helping, because you have just lurched from one bottleneck to another. Don't be discouraged by this: it's normal. You're digging a ditch across an oxbow, and halfway doesn't help anybody. You just need to keep digging. When you get there, the flow takes off, and you are a magician.

jorangreef · on Jan 8, 2021

> For most code, you don't know how fast it "ought to" be

This is exactly it. I was working on a "fastest substring search" challenge back in 2016 for some prize money and I was trying to think about the methodology I should use to win the competition.

I realized that the biggest problem was not actually the problem of optimizing the substring search defined by the competition, but rather knowing when my entry would almost certainly be faster than the other entries.

After that, the next biggest problems were still not the substring problem, but the types of machine the competition organizers would benchmark on, and the various sizes of the input data they would use.

I decided to start with the practical limit and then work backwards from there. In other words, measure something almost like an identity function followed by a simple linear for loop buffer scan, and then use this "null-cipher" limit as my competitor, because there's nothing faster than doing almost no work.

All this lead away from theoretically optimal algorithms such as Aho–Corasick towards something based on Rabin-Karp that could also switch dynamically to a second fallback brute force algorithm optimized for small data, depending on the length of the input data. There was also a curve of precomputed branch points in the algorithm to know when to switch, calculated according to the type of machine the organizers would use, probably server grade rather desktop.

Learning this lesson was probably the most important so far for me. These days when I start with a design problem, I figure out what the "null-cipher" performance would be (this kind of number is usually fairly cheap to obtain: like the speed of light in a vacuum if you were optimizing fiber links, or the max hello world http requests per second for a typical server), and then use back-of-the-envelope numbers to work backwards from there to know where the implementation should reasonably be expected to land up if there are no performance bugs. This way you know you've got a winner.

This is the exact opposite to starting with something and then trying to make it faster, and I find it saves wasted debugging time by avoiding implementations that can never be made fast enough for the requirements.

tayo42 · on Jan 8, 2021

Did you win the contest?

jorangreef · on Jan 8, 2021

Yes :)

By a hair of 1ms on the sprint data set (76ms vs 77ms), but by 2x on the long distance marathon data set, compared to the next runner up, who also used a switching hybrid to enter two types of "runner" in terms of algorithm, but with a single switch point instead of a curve so the sprint algorithm wasn't always used when it should have been, and a slower long distance algorithm that used a more theoretically optimal algorithm but with pointer chasing and more cache misses on real hardware.

This was also in JavaScript, on a literal V8 input data string instead of a Node.js buffer, and the thing that really helped was taking time upfront to first recode the string to a buffer. Sounds crazy and slow and unintuitive, but byte operations on a buffer are way faster than charCodeAt() function calls on a string and this more than made up for the upfront recoding. This came the closest to the "null cipher" imaginary competitor I had, close enough within a few percent, of all the prototypes I tried, so I submitted...

dcolkitt · on Jan 8, 2021

> For most code, you don't know how fast it "ought to" be, so you don't even know there is a problem

One way to build up intuition about this is to start predicting in advance how many clock cycles your newly written functions will take to run. It helps when you keep your functions small and focused on a single task.

Before benchmarking some code, try to estimate in advance. "How many arithmetic operations, how many memory fetches, how many cache misses, how many branches, etc." Then add those up, multiply by their respective clock cycle cost, divide by the chips frequency and voila. An estimate. Probably not that accurate, but at least you're building a mental model.

Now run the code section through Valgrind, and compare what you got right and wrong. If you keep repeating this exercise, you'll get substantially better at predicting code performance ahead of time. The major benefit of that is that you'll tend to write high performance code, even before you optimize or profile it.

Like any skill, the key to improvement is fast and accurate feedback cycles combined with repeated practice.

ncmncm · on Jan 8, 2021

Our collective "deal with the devil" is that, with all the various caches and predictors, ordinary code runs much faster than it has any business running, but it is impossible to be certain, up front, of the effects of any particular choice. Doing something clever is as likely to make code slower as it is to make it faster.

Still, it is true that practice improves intuition, and a habit of measuring helps in learning to avoid spending extra time to actually make code worse. But microarchitectures have become so complicated and so poorly documented that it amounts to a guessing game to discover why a change had the effect it did.

Compiler optimizers are always a few steps behind, often doing something that would once have been clever, but is now foolish. For example, it was once smart to intersperse arithmetic operations between branch instructions to fill pipeline holes, but those can now prevent micro-op fusion and make an innermost loop take two cycles where it could take one, resulting in a 2x penalty overall. Telling Gcc to optimize for Zen2 seems to result in much slower code, on Zen2, if the code had been tuned before.

I have recently got a 2x improvement in Quicksort, applied to random input, which on its face should have been impossible for such a mature algorithm. We have learned to avoid pointer chasing, and special-casing, but now a branch over extra work may take much longer than the work skipped, if it is predicted badly.

With the end of Dennard Scaling, just waiting doesn't make our programs faster anymore. As Moore's law, which still operates, lets microarchitecture complexity grow without bound, the improvement possible given extra attention increases, but the amount of skill needed to extract it grows too. As it grows, the number of people who can do it falls faster, and skepticism that any particular individual has it grows, particularly around languages that offer less direct control. As it gets harder, the range of applications that merit such extra work and attention narrows. Often, now, in places where better performance is most needed, no one on hand understands how to get it, or even knows it is available.

kstenerud · on Jan 8, 2021

Graphical profilers are really helpful for identifying hot spots in your code, and in my experience usually lead to a between 2-5x improvement in performance just by identifying allocator pressure and algorithmic issues. Go may be an annoying language, but its graphical profiler is great! Xcode's profiler tools are pretty nice as well.

aste-risk · on Jan 8, 2021

vtune can also do wonders

IgorPartola · on Jan 8, 2021

Fast and slow are relative, that’s for sure. But assuming there is a problem before discovering it... isn’t that the definition of premature optimization?

I once watched another developer spend an evening creating a binary tree implementation to do a round robin selection of servers to connect to in order to avoid just iterating over an array and/or having to reinitialize the array when a server is added/removed. We had a total of 4 servers at the time and expected no more than 8 ever for this particular thing. BTW this was written in PHP on top of its built in array().

ncmncm · on Jan 8, 2021

You would only ever do this on code where performance matters. You ask yourself: if I made this twice as fast (or half as slow) would anyone know or care? If yes, then it's not premature, it's just garden-variety optimization.

But choosing the right bit of code to work on needs some investigation. The assumption here is that that has already happened.

jcelerier · on Jan 8, 2021

> Fast and slow are relative, that’s for sure. But assuming there is a problem before discovering it... isn’t that the definition of premature optimization?

that works if you have a definite performance target, say, "less than 5ms". Not if your performance target is, "it must be faster by a good margin than all the competitors that we know and don't know about, and stay that way"

spacechild1 · on Jan 8, 2021

> For most code, you don't know how fast it "ought to" be, so you don't even know there is a problem.

Notable exceptions being realtime audio and video games, as both have a rigid time budget. If I try to run 1000 sinewave oscillators, I can literally hear if my code is too slow (for the given machine).

ncmncm · on Jan 8, 2021

When you have a deadline, you get a bright line between "fast enough" and "too slow", and no immediately apparent benefit from being faster than just enough.

In such cases, if it is known more work could be done, often more work can be found to do, but it is hard to know without actually trying to do more. If it is not certain the extra work could be finished in time, it might be wrong to start it instead of doing something else simpler and more certain.

Const-me · on Jan 8, 2021

I have only observed this problem in a few extreme cases. Once I unrolled 24x24 matrix*vector multiplication, with 72 AVX instructions. Other time I implemented JIT to generate large volume of AVX2 code in runtime.

In the other 99% of cases, inlining did better despite producing more code.

lldbg · on Jan 8, 2021

Could you expand on the 24x24 unrolling with AVX? How big of a benefit did you obtain, compared to say using a library such as Armadillo, Fastor, Eigen, etc?

Const-me · on Jan 8, 2021

Eigen was my baseline. As far as I remember, my best AVX version (which unrolls 24x4 blocks, running 6 iterations of the loop) was about 50% faster.

Eigen is designed in the assumption these vectors/matrices are large. 24 is not large, fits in 3 vector registers out of the 16 available. Eliminating data access latency of the accumulators is what caused the speedup.

lldbg · on Jan 8, 2021

In my experience Eigen is not very good either for small matrices, whilst Fastor performs much better.

Interesting

Const-me · on Jan 9, 2021

Libraries don’t cut it, IMO. Sometimes you want your complete input data in registers, 4x4 matrices fit there nicely. Other times you want to stream stuff from memory, the above-mentioned 24x24 multiplication streams the complete matrix, while loading 4-long chunks of the vector.

I’ve wrote an introduction article some time ago: http://const.me/articles/simd/simd.pdf

Here’s more practical one: https://stackoverflow.blog/2020/07/08/improving-performance-...