How I Got a 2x Speedup With One Line of Code

chrismorgan · on Nov 14, 2013

I'm sad to see that the one line of code, with its magic number and comparatively inscrutible contents, is entirely undocumented. Please, if doing things like that: a good commit message is necessary but not sufficient. (Yes, `git blame` exists, but you're not so likely to use it as a comment in the code.) A blog post is handy but not sufficient. A good inline comment is invaluable when returning to the code later if the situation changes, as it typically will in time for such precise, empirically derived performance tweaks.

   // This single line boosts performance by a factor of around 2 on GCC.
   // The optimal lookahead distance i+3 was chosen by experimentation.
   // See http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

naftaliharris · on Nov 15, 2013

Yeah, you're right: had I not written the code myself, it would not have been clear to me what that special line did or why i+3 without documentation. Your proposed comment describes it nicely; I've added it verbatim:

https://github.com/naftaliharris/lazysort/commit/a56f30adef7...

thejosh · on Nov 15, 2013

Referencing a URL, especially a blog, that could die anytime in documentation isn't good.

chrismorgan · on Nov 15, 2013

I agree with your general sentiment and was initially inclined to make the comment more verbose and not include the URL; I would certainly not be content with just a URL. Still, in this case I would consider it a fair compromise, primarily because the inclusion of the URL is non-essential but value-adding. That is, it explains more, but the two important points—the purpose of the code and the reason for the number 3—are captured by the lines above, but the blog post adds more information which could potentially be of use later. Furthermore, the author of the code is the author of the blog post, which increases the probability of its survival.

chris_wot · on Nov 15, 2013

I'd it the detailed explanation into the commit message. Or you could add it to README or make a file NOTES and explain it there.

raverbashing · on Nov 15, 2013

No, please, don't

Not like this.

Yes, the +3 deserves an explanation (but something better than "because of experimentation" - my guess is because it's big enough to get the next cache line and small enough to not get something that won't be needed)

And I find it funny when people on HN are shibeing __builtin_prefetch like "wow, I just heard about caches and prefetching data how amazing"

chris_wot · on Nov 15, 2013

That last line seems a little snarky. Not everyone knows built in prefetch directives in gcc.

susi22 · on Nov 14, 2013

You can get another 2x speedup if you choose the pivot better. Right now it seems you're doing median of 3.

When you're doing Order Statistics (ie Selection/Quickselect) with 1M elements you should be very very close to the theoretical optimum (1.5N) for complexity. See my comment here:

https://news.ycombinator.com/item?id=6629117

If you're interested in choosing the best strategy.

HTH

naftaliharris · on Nov 14, 2013

Neat! A quick skim of the referenced article in that comment (http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6193457) shows that it solves a problem I'd wondered about for a while now, about how to optimally choose the pivots when computing order stats. (Yeah, I'm just doing median of 3 right now). If you want to sort the whole list, then clearly you want the pivots to be as near to the median as possible. But if you just want an order stat, then that's not necessarily true; if you wanted, say, the 10th percentile, my gut feeling would be that you should aim for your first pivot to be roughly the 20th percentile or so. I'll read that article more closely and see if I can improve even more.

Thanks, that's really helpful!

akkartik · on Nov 14, 2013

Great article. But it seems somehow unfair to characterize this as a 'one-line change'. He had to run various performance experiments that will be invisible to someone else reading the code, and the optimal prefetch distance might silently shift over time, like say if each object grows larger. It's a little like the Carmack magic number in that respect: https://en.wikipedia.org/wiki/Fast_inverse_square_root

I've had this idea for some time that our representation of programs is fundamentally broken, because it excludes a detailed characterization of the input space. I took a first stab at a solution with http://akkartik.name/post/tracing-tests, which lets me write all my unit tests at the top level, showing how each piece of functionality augments the space of inputs my program can address. See, for example, https://github.com/akkartik/wart/blob/3039fe6d03/literate/02.... I suspect reading other people's 'code' could be an order of magnitude easier and more convenient if 'code' included the right information. I'm still working to flesh this idea out.

kabouseng · on Nov 15, 2013

>>It's a little like the Carmack magic number in that respect: https://en.wikipedia.org/wiki/Fast_inverse_square_root

Thanks that was a great read :D

liyanchang · on Nov 14, 2013

> No way, substantial speedups can really only come from algorithm changes.

I've often found the reverse to be true; that implementation details matter most.

The best fixes tend to be altering my code so that the compiler/cpu can be smarter. In this case, the author hinted the cache. Other times, I've found that looping in a different order helps my cache hit rate. Other times, properly naming variables and not reusing the same name allows the compiler to do smarter things.

Other times, when I've tried a new algorithm, it's slower because of bad constants or I've accidentally messed up the implementation to negate all benefits.

As a complete side note, you need to be careful when using "older" algorithms, ie those published pre-2000. I once implemented, based on a Hoare paper, a doubly linked list for spare matrices, and it was significantly slower then the naive method, simply because CPU cache lines have gotten really good while being able to predict linked list memory locations had not.

pkolaczk · on Nov 14, 2013

This is all true, but I never encountered cache optimizations like that to cause more than a 2x difference on average and 5x in very rare extreme cases. This is very far from what you can get from algorithmic Big-O changes, where it is easy to get 1000x speedups.

liyanchang · on Nov 15, 2013

That's true. Having been briefly in academia, I only evaluated algorithms based on Big-O and just a bit of practical experience taught me that there tends to be a sweet spot, where the o(n^2.7) is both much faster then your naive implementation, but simpler to code, better constants, and better cache lines then the o(n^2.7003).

nly · on Nov 15, 2013

I don't think this gain came from pre-fetching anything. The compiler may simply have decided to treat the entire function, or even the entire file, differently because of that one line.

I compiled (x86, GCC 4.8) lazysort.c both with and without the prefetch intrinsic, and there were substantial differences in code surrounding calls in to Python, but no change at all to the guts of the partition or sorting routines.

Regardless of what is responsible, it's not good to leave something like this in your code without really understanding what it did. There may have something else that occurred incidentally from this change, that could have lead you to a better solution.

deweerdt · on Nov 15, 2013

> I don't think this gain came from pre-fetching anything.

It may, and its possible that it doesn't show up on some cpus. see this thread for an example of how difficult it is to find a good generic prefetching size: https://lkml.org/lkml/2013/10/11/534

CountHackulus · on Nov 15, 2013

The lack of generated assembly shown makes this explanation just as likely.

sp332 · on Nov 14, 2013

There are only two really difficult problems in computer science: naming things, cache invalidation, and off-by-one errors.

sillysaurus2 · on Nov 14, 2013

http://i.imgur.com/LfqyX3p.gif

ramchip · on Nov 15, 2013

I don't get it... that's a cellular automaton joke?

adamnemecek · on Nov 15, 2013

Getting the joke requires understanding the game of life and this meme

http://knowyourmeme.com/memes/boardroom-suggestion

jonesetc · on Nov 15, 2013

http://en.wikipedia.org/wiki/Conway's_Game_of_Life#Examples_...

JSno · on Nov 14, 2013

Phil Karlton disagrees :D

deathanatos · on Nov 15, 2013

Is his version of the quote ... off by one? After all there are two hard things in CS...

wffurr · on Nov 14, 2013

>> The moral of the story, at least for me, is that you can occasionally get substantial improvements by understanding what is actually happening under the hood of your application, rather than fundamentally changing your application itself.

Also known as the rule of leaky abstractions, as in all abstractions are leaky.

Peaker · on Nov 14, 2013

Is parametric polymorphism leaky?

wffurr · on Nov 15, 2013

Method resolution order.

Peaker · on Nov 15, 2013

I said parametric polymorphism. That doesn't involve any method resolution order.

wffurr · on Nov 18, 2013

OK - different usage of "abstraction".

If you are talking about abstracting, i.e. taking away, similar bits of code into one place in programming, whether it's via polymorphism or refactoring functions, then no, that is not a "leaky abstraction".

I am talking about abstraction as in creating a simplified / unified interface over some complex thing. Abstracting the idea of something from its concrete realizations into its own concept. OSI networking layers, drivers for hardware, modeling real world systems, etc.

If you want to be pedantic instead of snappy, you can change my statement to "all abstraction layers are leaky".

Scaevolus · on Nov 14, 2013

Note that `ob_item[i+3]` could read past the end of the array.

jzwinck · on Nov 14, 2013

This jumped out at me too. GCC's documentation says:

Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next does not fault if p->next is not a valid address, but evaluation faults if p is not a valid address.

This seems to make the code as presented incorrect (i.e. it results in undefined behavior).

AznHisoka · on Nov 14, 2013

I got a 1000 X speedup by removing a sleep(100000) line once. Not as impressive though.

wiredfool · on Nov 14, 2013

I got a 2gb reduction in memory usage and orders of magnitude speedup by adding an _ in one line once.

Dylan16807 · on Nov 14, 2013

Care to explain that? I don't get it.

wiredfool · on Nov 14, 2013

It was a long involved bug chase, but some django middleware had an optional cache of some s3 information. The actual variable was self._something, with an @property on self.something so that the getter would backfill the metadata if it didn't exist.

In the code, it was checking if self.something existed, and then doing something else if it didn't. This triggered the lazy getter, and had the effect of pulling in the metadata for a 300k item s3 bucket.

Checking for self._something correctly returned None, it then checked s3 for the item, and uploaded it directly.

xster · on Nov 15, 2013

Ya, the title is pretty link bait-y. Just think of all the people who clicked the link and then found out they have no interest in sorting.

Title: I saved the world!!!#$^%!1111 Content: I was about to step on an ant today and narrowly missed it, saving the universe and all of existence for that ant.

pkolaczk · on Nov 14, 2013

And I got an infinite memory usage improvement by adding one missing delete once. :D

Someone · on Nov 15, 2013

I would expect that that magic number 3 is related to the size of a cache line relative to the size of an array element and the time spent in each loop. You will want to prefetch the next cache line so that it is just available by the time you want to start processing it (and, ideally, not a moment earlier because that would push data from the cache that still could be used before the requested data is needed)

A robust implementation should figure out the size of a cache line and the load delay, and derive the magic offset from that.

temujin · on Nov 15, 2013

Er, are you aware of the C++ standard library's std::nth_element() function? (Even my "pure C" programs tend to be compiled with g++ to gain access to that and std::sort().)

nly · on Nov 15, 2013

Yes, this is exactly what the author is trying to port to Python.

Mandatory reference link: http://en.cppreference.com/w/cpp/algorithm/nth_element . Note linear complexity.

There's also std::partial_sort:

http://en.cppreference.com/w/cpp/algorithm/partial_sort

Amadou · on Nov 14, 2013

Now, I didn't just "know" to prefetch the PyObject three indices forward in the array. I arrived at this value through experimentation.

There is a good chance that 3 is specific to the hardware you tested on. Different systems will have different memory latencies (and NUMA systems can make it even more complicated). It isn't likely that 3 will turn into a pathological case on any other hardware, but it may well end up being less optimal than say 2 or 4.

m_mueller · on Nov 15, 2013

This sort of thing is why porting naive x86 code to the GPU programming model (scalar programs aka 'kernels' streamed through on the device using a configuration) often results in amazing 15-20x speedup over 6 core, while the expected speedup is only 5-7x. x86 compilers often don't get the pre-fetching right in loops, only using their heuristics, so you have to talk to them. Then you you begin to doubt the compiler more and more, carving out a few percentage points here and there by adding vector intrinsics, loop unrolling and so on. And before you notice, your codebase is now bigger, more complex and more error prone than if you had ported it to GPU from the beginning.

Please note - of course this insight doesn't apply in your case since your algorithm is meant for general purpose usage, not HPC programming meant for systems where you have the hardware under your control. I'm simply urging people where the latter applies to think about whether their algorithms might be suitable for GPU instead - doing lots of prefetch instructions and loop unrolls is a sign of that.

gridspy · on Nov 14, 2013

Nice article.

One thing that wasn't clear to me in your explanation is that you're sorting a list of pointers to objects. To sort that list you're comparing two of the objects themselves.

    // follows pointer to access *ob_item[i]
    IF_LESS_THAN(ob_item[i], pivot)

So the point of __builtin_prefetch is to deference this pointer in advance and so avoid the latency of the 12 million reads scattered all over memory. Nice.

Another useful thing to do here is to see if you can find a "streaming operator" to dismiss *ob_item[i] and ob_item[i] from cache after the comparison. They won't be needed again for a while.

Another good article on this optimisation (mentioned by OP)

http://scripts.mit.edu/~birge/blog/accelerating-code-using-g...

minimax · on Nov 14, 2013

This is a great post. I guess the idea here is that you can start bringing future objects to be compared up through the cache hierarchy while you are doing the comparison on the current object. If that's the case, then I think the speedup from prefetching will depend on the speed of the comparison which in turn will depend on the type of objects in the collection.

In your post it says you have a collection of 10MM elements but you didn't say what they were. Are they just ints or something else?

naftaliharris · on Nov 14, 2013

Thanks!

Oh, thanks for pointing out the oversight! They were floats generated with random.random() and then shuffled with random.shuffle(.) Shuffling them removes any locality that might have arisen when the floats were being generated, even though the distribution remains uniform on (0, 1). (And indeed, empirically the code runs faster if you don't shuffle them).

Yeah, I'm not sure how well this would work if you were sorting different objects. I would guess that it depends on two things--how large the objects are in memory, and how long it takes to compare them.

bmm6o · on Nov 14, 2013

I was thinking about that when I read the article too. How you've obviously done a lot of testing, but that the run-time's allocator algorithm is going to have an effect on this, and how allocating N floats in a loop will have different performance than if the floats were allocated over time (with other allocations mixed in) or if the objects were larger. Especially if the comparison function requires accessing object members from other memory locations... The magic value 3 might have a lot to do with the number of objects that fit in the cache or a page.

Anyway, it's an interesting article and an interesting library.

osmala · on Nov 15, 2013

The prefetch will always take same amount of data no matter what the size of python object in question is. That leads two results. 1) You can miss if your cacheline size is less than python object size. 2) You miss if your python object is not alligned at the cacheline and comparison is with the part that is in the next cacheline side.

If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.

Another about prefetch distance I'd pick 5 based on his testing data. 1) It has no downside on current tests. 2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance. 3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test. 4) It allows a nice factor of improvement on CORE execution speed relative to memory latency. 5) It allows you to handle situations of page misses better.

bmm6o · on Nov 21, 2013

The prefetch may take the same amount of time, but the number of objects that it pulls into cache depends on their size and allocation pattern. If they are large or were allocated over time (= from different memory blocks) fewer will be pulled in. It seems to me that this would make prefetch less effective. That is, if the thing you are trying to prefetch was pulled in because it was close enough to the last one, prefetch is a no-op.

Nimi · on Nov 14, 2013

Funny, I just experimented with __builtin_prefetch this week, and got no speedup. Does anyone know whether the kernel list.h always triggers hardware prefetching when going over a list?

BTW, my case wasn't the problem the kernel maintainers encountered with small lists, detailed here:

http://lwn.net/Articles/444336/

alextingle · on Nov 15, 2013

I'm pretty sure you could have saved yourself a lot of work by just using profile guided optimisation (PGO). It's mature & ready to go in GCC at least.

(Integrating it into you build process is a little bit more challenging, I admit. I've set it up in a large dev environment used by multiple large projects, and it was well, well worth the effort.)

jheriko · on Nov 15, 2013

its one of those things... micro optimisation becomes the big one once you do enough algorithm optimisation. even the humble memory copy becomes faster from prefetching and clever use of registers... and actually making the algorithm theoretically worse. :P

also, note that it is possible to remove the magic from your magic number with some thought about latency, the size and alignment of your data etc. fetching a cache line takes a fairly predictable amount of cycles.

opposite to the prefetch is the non temporal read/writes where you don't pollute cache to prevent big one off copies from damaging the performance of other code...

joosters · on Nov 14, 2013

Interesting post! I wonder if the author experimented with different gcc command-line optimisation options as well? gcc might be able to insert some prefetches by itself with some settings?

naftaliharris · on Nov 15, 2013

Thanks! I tried -fprefetch-loop-arrays before figuring out that it only prefetches the array memory, (ie, memory with addresses like &ob_item[i]), rather than the memory referenced by the array memory, (ie, memory with addresses like ob_item[i]). And empirically adding -fprefetch-loop-arrays got me no speedups.

I didn't try any other GCC command line arguments because I honestly didn't know what other relevant options there were.

mattholtom · on Nov 15, 2013

Came expecting http://thedailywtf.com/Articles/The-Speedup-Loop.aspx, was disappointed...

jheriko · on Nov 15, 2013

also, read this: http://www.gamedev.net/page/resources/_/technical/graphics-p...

clintonc · on Nov 15, 2013

I get at least that good with

    DEFINT A-Z