I'm sad to see that the one line of code, with its magic number and comparatively inscrutible contents, is entirely undocumented. Please, if doing things like that: a good commit message is necessary but not sufficient. (Yes, `git blame` exists, but you're not so likely to use it as a comment in the code.) A blog post is handy but not sufficient. A good inline comment is invaluable when returning to the code later if the situation changes, as it typically will in time for such precise, empirically derived performance tweaks.
// This single line boosts performance by a factor of around 2 on GCC.
// The optimal lookahead distance i+3 was chosen by experimentation.
// See http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/
Yeah, you're right: had I not written the code myself, it would not have been clear to me what that special line did or why i+3 without documentation. Your proposed comment describes it nicely; I've added it verbatim:
I agree with your general sentiment and was initially inclined to make the comment more verbose and not include the URL; I would certainly not be content with just a URL. Still, in this case I would consider it a fair compromise, primarily because the inclusion of the URL is non-essential but value-adding. That is, it explains more, but the two important points—the purpose of the code and the reason for the number 3—are captured by the lines above, but the blog post adds more information which could potentially be of use later. Furthermore, the author of the code is the author of the blog post, which increases the probability of its survival.
Yes, the +3 deserves an explanation (but something better than "because of experimentation" - my guess is because it's big enough to get the next cache line and small enough to not get something that won't be needed)
And I find it funny when people on HN are shibeing __builtin_prefetch like "wow, I just heard about caches and prefetching data how amazing"
You can get another 2x speedup if you choose the pivot better. Right now it seems you're doing median of 3.
When you're doing Order Statistics (ie Selection/Quickselect) with 1M elements you should be very very close to the theoretical optimum (1.5N) for complexity. See my comment here:
Neat! A quick skim of the referenced article in that comment (http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6193457) shows that it solves a problem I'd wondered about for a while now, about how to optimally choose the pivots when computing order stats. (Yeah, I'm just doing median of 3 right now). If you want to sort the whole list, then clearly you want the pivots to be as near to the median as possible. But if you just want an order stat, then that's not necessarily true; if you wanted, say, the 10th percentile, my gut feeling would be that you should aim for your first pivot to be roughly the 20th percentile or so. I'll read that article more closely and see if I can improve even more.
Great article. But it seems somehow unfair to characterize this as a 'one-line change'. He had to run various performance experiments that will be invisible to someone else reading the code, and the optimal prefetch distance might silently shift over time, like say if each object grows larger. It's a little like the Carmack magic number in that respect: https://en.wikipedia.org/wiki/Fast_inverse_square_root
I've had this idea for some time that our representation of programs is fundamentally broken, because it excludes a detailed characterization of the input space. I took a first stab at a solution with http://akkartik.name/post/tracing-tests, which lets me write all my unit tests at the top level, showing how each piece of functionality augments the space of inputs my program can address. See, for example, https://github.com/akkartik/wart/blob/3039fe6d03/literate/02.... I suspect reading other people's 'code' could be an order of magnitude easier and more convenient if 'code' included the right information. I'm still working to flesh this idea out.
> No way, substantial speedups can really only come from algorithm changes.
I've often found the reverse to be true; that implementation details matter most.
The best fixes tend to be altering my code so that the compiler/cpu can be smarter. In this case, the author hinted the cache. Other times, I've found that looping in a different order helps my cache hit rate. Other times, properly naming variables and not reusing the same name allows the compiler to do smarter things.
Other times, when I've tried a new algorithm, it's slower because of bad constants or I've accidentally messed up the implementation to negate all benefits.
As a complete side note, you need to be careful when using "older" algorithms, ie those published pre-2000. I once implemented, based on a Hoare paper, a doubly linked list for spare matrices, and it was significantly slower then the naive method, simply because CPU cache lines have gotten really good while being able to predict linked list memory locations had not.
This is all true, but I never encountered cache optimizations like that to cause more than a 2x difference on average and 5x in very rare extreme cases. This is very far from what you can get from algorithmic Big-O changes, where it is easy to get 1000x speedups.
That's true. Having been briefly in academia, I only evaluated algorithms based on Big-O and just a bit of practical experience taught me that there tends to be a sweet spot, where the o(n^2.7) is both much faster then your naive implementation, but simpler to code, better constants, and better cache lines then the o(n^2.7003).
I don't think this gain came from pre-fetching anything. The compiler may simply have decided to treat the entire function, or even the entire file, differently because of that one line.
I compiled (x86, GCC 4.8) lazysort.c both with and without the prefetch intrinsic, and there were substantial differences in code surrounding calls in to Python, but no change at all to the guts of the partition or sorting routines.
Regardless of what is responsible, it's not good to leave something like this in your code without really understanding what it did. There may have something else that occurred incidentally from this change, that could have lead you to a better solution.
> I don't think this gain came from pre-fetching anything.
It may, and its possible that it doesn't show up on some cpus. see this thread for an example of how difficult it is to find a good generic prefetching size:
https://lkml.org/lkml/2013/10/11/534
>> The moral of the story, at least for me, is that you can occasionally get substantial improvements by understanding what is actually happening under the hood of your application, rather than fundamentally changing your application itself.
Also known as the rule of leaky abstractions, as in all abstractions are leaky.
If you are talking about abstracting, i.e. taking away, similar bits of code into one place in programming, whether it's via polymorphism or refactoring functions, then no, that is not a "leaky abstraction".
I am talking about abstraction as in creating a simplified / unified interface over some complex thing. Abstracting the idea of something from its concrete realizations into its own concept. OSI networking layers, drivers for hardware, modeling real world systems, etc.
If you want to be pedantic instead of snappy, you can change my statement to "all abstraction layers are leaky".
This jumped out at me too. GCC's documentation says:
Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next does not fault if p->next is not a valid address, but evaluation faults if p is not a valid address.
This seems to make the code as presented incorrect (i.e. it results in undefined behavior).
It was a long involved bug chase, but some django middleware had an optional cache of some s3 information. The actual variable was self._something, with an @property on self.something so that the getter would backfill the metadata if it didn't exist.
In the code, it was checking if self.something existed, and then doing something else if it didn't. This triggered the lazy getter, and had the effect of pulling in the metadata for a 300k item s3 bucket.
Checking for self._something correctly returned None, it then checked s3 for the item, and uploaded it directly.
Ya, the title is pretty link bait-y. Just think of all the people who clicked the link and then found out they have no interest in sorting.
Title: I saved the world!!!#$^%!1111
Content: I was about to step on an ant today and narrowly missed it, saving the universe and all of existence for that ant.
I would expect that that magic number 3 is related to the size of a cache line relative to the size of an array element and the time spent in each loop. You will want to prefetch the next cache line so that it is just available by the time you want to start processing it (and, ideally, not a moment earlier because that would push data from the cache that still could be used before the requested data is needed)
A robust implementation should figure out the size of a cache line and the load delay, and derive the magic offset from that.
Er, are you aware of the C++ standard library's std::nth_element() function? (Even my "pure C" programs tend to be compiled with g++ to gain access to that and std::sort().)
Now, I didn't just "know" to prefetch the PyObject three indices forward in the array. I arrived at this value through experimentation.
There is a good chance that 3 is specific to the hardware you tested on. Different systems will have different memory latencies (and NUMA systems can make it even more complicated). It isn't likely that 3 will turn into a pathological case on any other hardware, but it may well end up being less optimal than say 2 or 4.
This sort of thing is why porting naive x86 code to the GPU programming model (scalar programs aka 'kernels' streamed through on the device using a configuration) often results in amazing 15-20x speedup over 6 core, while the expected speedup is only 5-7x. x86 compilers often don't get the pre-fetching right in loops, only using their heuristics, so you have to talk to them. Then you you begin to doubt the compiler more and more, carving out a few percentage points here and there by adding vector intrinsics, loop unrolling and so on. And before you notice, your codebase is now bigger, more complex and more error prone than if you had ported it to GPU from the beginning.
Please note - of course this insight doesn't apply in your case since your algorithm is meant for general purpose usage, not HPC programming meant for systems where you have the hardware under your control. I'm simply urging people where the latter applies to think about whether their algorithms might be suitable for GPU instead - doing lots of prefetch instructions and loop unrolls is a sign of that.
One thing that wasn't clear to me in your explanation is that you're sorting a list of pointers to objects. To sort that list you're comparing two of the objects themselves.
// follows pointer to access *ob_item[i]
IF_LESS_THAN(ob_item[i], pivot)
So the point of __builtin_prefetch is to deference this pointer in advance and so avoid the latency of the 12 million reads scattered all over memory. Nice.
Another useful thing to do here is to see if you can find a "streaming operator" to dismiss *ob_item[i] and ob_item[i] from cache after the comparison. They won't be needed again for a while.
Another good article on this optimisation (mentioned by OP)
This is a great post. I guess the idea here is that you can start bringing future objects to be compared up through the cache hierarchy while you are doing the comparison on the current object. If that's the case, then I think the speedup from prefetching will depend on the speed of the comparison which in turn will depend on the type of objects in the collection.
In your post it says you have a collection of 10MM elements but you didn't say what they were. Are they just ints or something else?
Oh, thanks for pointing out the oversight! They were floats generated with random.random() and then shuffled with random.shuffle(.) Shuffling them removes any locality that might have arisen when the floats were being generated, even though the distribution remains uniform on (0, 1). (And indeed, empirically the code runs faster if you don't shuffle them).
Yeah, I'm not sure how well this would work if you were sorting different objects. I would guess that it depends on two things--how large the objects are in memory, and how long it takes to compare them.
I was thinking about that when I read the article too. How you've obviously done a lot of testing, but that the run-time's allocator algorithm is going to have an effect on this, and how allocating N floats in a loop will have different performance than if the floats were allocated over time (with other allocations mixed in) or if the objects were larger. Especially if the comparison function requires accessing object members from other memory locations... The magic value 3 might have a lot to do with the number of objects that fit in the cache or a page.
Anyway, it's an interesting article and an interesting library.
The prefetch will always take same amount of data no matter what the size of python object in question is. That leads two results. 1) You can miss if your cacheline size is less than python object size. 2) You miss if your python object is not alligned at the cacheline and comparison is with the part that is in the next cacheline side.
If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.
Another about prefetch distance I'd pick 5 based on his testing data.
1) It has no downside on current tests.
2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance.
3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test.
4) It allows a nice factor of improvement on CORE execution speed relative to memory latency.
5) It allows you to handle situations of page misses better.
The prefetch may take the same amount of time, but the number of objects that it pulls into cache depends on their size and allocation pattern. If they are large or were allocated over time (= from different memory blocks) fewer will be pulled in. It seems to me that this would make prefetch less effective. That is, if the thing you are trying to prefetch was pulled in because it was close enough to the last one, prefetch is a no-op.
Funny, I just experimented with __builtin_prefetch this week, and got no speedup. Does anyone know whether the kernel list.h always triggers hardware prefetching when going over a list?
BTW, my case wasn't the problem the kernel maintainers encountered with small lists, detailed here:
I'm pretty sure you could have saved yourself a lot of work by just using profile guided optimisation (PGO). It's mature & ready to go in GCC at least.
(Integrating it into you build process is a little bit more challenging, I admit. I've set it up in a large dev environment used by multiple large projects, and it was well, well worth the effort.)
its one of those things... micro optimisation becomes the big one once you do enough algorithm optimisation. even the humble memory copy becomes faster from prefetching and clever use of registers... and actually making the algorithm theoretically worse. :P
also, note that it is possible to remove the magic from your magic number with some thought about latency, the size and alignment of your data etc. fetching a cache line takes a fairly predictable amount of cycles.
opposite to the prefetch is the non temporal read/writes where you don't pollute cache to prevent big one off copies from damaging the performance of other code...
Interesting post! I wonder if the author experimented with different gcc command-line optimisation options as well? gcc might be able to insert some prefetches by itself with some settings?
Thanks! I tried -fprefetch-loop-arrays before figuring out that it only prefetches the array memory, (ie, memory with addresses like &ob_item[i]), rather than the memory referenced by the array memory, (ie, memory with addresses like ob_item[i]). And empirically adding -fprefetch-loop-arrays got me no speedups.
I didn't try any other GCC command line arguments because I honestly didn't know what other relevant options there were.