Hang on, you can't just quote MB/s numbers for an O(n log(n)) sort. What length were these tests run at?
The code size might not end up quite as good (also requires malloc), but a branchless merge sort is a contender for a fast and lightweight sort. Just published, tiny-sort-rs[0] cites 632 bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my tests, my own pisort[1] benches a little over twice as fast as LongSort, but it uses sorting networks as the base case so it's like 5KB. It's roughly based on piposort[2] which has more complicated recursion but a simpler base case.
400 MB/s seems a bit slow for a radix sort on that hardware: I'm hitting those numbers on my i5-6200U, which has less than half the clock rate, with my own radix sort. Recommend checking ska_sort_copy from [3] as it has about the same performance.
The first example is "assembly code they published for sorting an array with three items" - this isn't an entire general-purpose sorting algorithm, it's just the innermost part.
Just realized that obviously you don't need stability if you're using in-place quicksort, so the tiny-sort heapsort is a better recommendation. 304 bytes, although the scaling to large arrays is much worse because of the awful access patterns.
How much does code size matter here? As long as the code has a good access pattern that maintains cache locality, is there any fundamental difference between 632 bytes and 5KB? L1 cache sizes are generally somewhere around 16-64 KB these days, so it seems like there wouldn't be a big difference here. Or am I just totally off base?
> L1 cache sizes are generally somewhere around 16-64 KB these days, so it seems like there wouldn't be a big difference here.
Would you want to depend on a C library that claims all 64kb of your L1 cache for itself? Of course not. You'd want to use a library that stays out of the way, so that your code can be the one exploiting system resources.
All else equal, larger L1C footprint code still microbenchmarks nicely, but is pretty often worse in a real system.
Sometimes it's better to choose a slower routine with a smaller footprint. Nowadays memory bandwidth is arguably the biggest performance limiting factor.
I think the benefit is that when you start getting inlined code you can have multiple copies of it, so it can multiply out, and ideally you always want your code to be as small as possible so that you can have fewer instruction cache misses. So it's unlikely to matter if you're calling a single function in a loop whether the code is 632 bytes or 5KB (well, instruction decoding throughput aside I suppose), but when you're looking more broadly it might matter.
longsort appears in cosmopolitan libc, and possibly gets embedded in all the output executables? For most applications the requirements are much less restrictive. I'm working on sorting for interpreted programming languages; I see >20KB for each sort now and don't have a problem with that. For small arrays only a fraction of the code will be used. I still make some effort to reduce size, but if you're doing HPC work where sorting matters you can go much bigger with sorting networks for every size of something like that. icache is 32KB/core on every processor I've checked, although it's often reported weird. But it's fine for a hybrid sort targetting large arrays to exceed that because many components like partitioning will spend their time running on lots of data, so the time to load is relatively insignificant.
> The above algorithm shows what the new and improved libcxx is doing. It's basically quicksort except it switches to the sorting kernels and insertion sort when recursing into smaller slices.
This is a pretty standard technique, isn't it? You can eliminate hella recursive calls just by cutting off the bottom level of the call tree. For instance, take the following (Emacs lisp) functions:
(defun fib1 (n)
(if (or (= n 1) (= n 0))
1
(+ (fib1 (- n 1)) (fib1 (- n 2)))))
(defun fib2 (n)
(if (or (= n 1) (= n 0))
1
(if (= n 2)
2
(+ (fib2 (- n 1)) (fib2 (- n 2))))))
Using the first function to calculate (fib 5) makes 29 recursive calls before finally terminating, while the second only 17.
With libcxx I think they even took the added step of schlepping in heapsort, which is kind of slow, but prevents adversaries from smashing your stack.
Oh, yeah. I forgot to quote that, because I was going to comment on it. It's an iterative, in-place algorithm, so there are no recursive calls to be made.
The bit I meant to comment on was the "kind of slow" part. It is true that heapsort tends to be slower than a well-implemented quicksort, but you don't use heapsort when you need the absolute best speed. The (IMO) best thing about heapsort is that its best case and worst case are the same order of magnitude, so sorting n things will take a fairly consistent amount of time, no matter what.
Heap sort can sort n elements with O(1) auxiliary data while quick sort (which is what libcxx usually relies on) in its worst-case performance would require storing O(n) stack frames. Since stack sizes are usually small, an adversary making you sort a million elements would likely cause a stack overflow.
The linked implementation uses logarithmic auxiliary storage but allocates extra storage such that the amount allocated is a constant and rejects inputs that are too large (inputs that wouldn't fit into any computer anyway). A similar trick can be used to convert any algorithm to "use constant space." Just allocate enough space to handle inputs of some large size and reject larger inputs.
Did you scroll all the way to the bottom? The "never fail" version that always sorts the smaller partition first, and allocates 300 "pseudo stack frames" will successfully sort any array on a real computer.
Sure, theoretically, it does what you say, but you know what they say, right? In theory, theory and practice are the same. In practice, not so much. And, these are real differences, too: theory idealizes real computers as general Turing machines, when, in fact, they're really only linear bounded automata: https://en.wikipedia.org/wiki/Linear_bounded_automaton
See also the commentary following the code:
> This might be slightly slower than the first one, but it will never fail because it always performs the smaller partition first. Hence, the only way it could run into the limit is if the to-be-sorted array was at least 2MAX_LEVELS elements in size. Since 2300 is greater than 1090, and there are only about 1080 fundamental particles in this universe from which a human-made computer can be built, no larger limit will ever be needed.
> (Note: Someone reminded me that a typically 64-bit index variable can index only 264 items, not 2300. That’s true, but if you’re using a 64-bit computer, you’re probably not going to have an array of more than 264 elements anyway, even if each element was only one byte.)
Is it strange that it's slower in jart's testing but claimed to be faster in the AlphaDev blog post?
jart doesn't provide detail about length of sequences used in testing, and AlphaDev basically says that between 6 and 249,999 elements the optimizations are slower (they only claim improvement for very small and 250k+ element sequences).
The AlphaDev numbers are so curious as well. AFAICT there's extra branching when you splice the tiny-sequence optimized versions (slower), but better sorting for the tiny sequences (faster).
Is it, like, branch prediction gets an edge when the leaf nodes of the recursion are all sorting tiny sequences? In jart's code, it's DFS, which I can only guess would trample a bit on branch prediction. I wonder if a BFS search could be better
No idea what would cause this though, curious if anyone has other ideas, I really don't know.
Just an aside, but I noticed when I was reading this that Justine's C coding style resembles my preference, and it reminded me of why I like to code this way. I learned Pascal before C and Pascal required that you define all methods before they are used. (This is a restriction that ANSI C and C++ worked around with function prototypes.) You can still code C without prototypes if you completely define all functions in the same file, before they are called. IMHO, code without forward references and/or function prototypes is inherently easier to read, but unfortunately it's not always possible to produce.
Wouldn't this mov instruction be handled by the register renamer (Allocate/Rename/MoveElimination/ZeroIdiom) at essentially 'zero' cost? Yet clearly they're measuring a difference. I'll be curious what Agner Fog and Peter Cordes think.
Answer: renaming can fail if the operands aren't ready and it isn't zero cost, just less.
Just having to go through the hardware frontend is a cost. It's one of the reasons SIMD is fast: you go through the frontend 1 time for N lanes of data.
This post inspired me to look at move 37 of the Lee Sedol game with a strong AI (stronger than AlphaGo Lee anyway) - interestingly, it thinks that the position favours Lee Sedol (~51% win rate to white, AlphaGo was still grinding down the komi advantage) and there was a better move available for black. Still was a good move though. It also thought AlphaGo then misplayed the following sequence and gave Lee an advantage for a brief moment.
It can be done for fixed-length lists. Optimal sorting networks [1] are an active research topic with many interesting connections to differentiable sorting and ranking [2].
The thing that has struck me about all these sorting algorithms is before you can run them you already need to have all the items to be sorted, which may involve you waiting to receive them. So the act of sorting ends up being on the critical path of what you need to achieve. Not good.
I think its far better to sort the items incrementally as and when you receive them, so the act of sorting them is no longer on the critical path. Then the sorting takes virtually no time at all afterward, no matter how many items you need to sort.
The major bottleneck to computer systems is almost always the network and not the computing of algorithms. You can interleave the compute steps into the time spent waiting for the network.
>I don't think it's progress that OpenAI is promising to automate all the tasks I love doing most, like coding.
1. That sounds like a you problem. Automating coding would be fantastic if it could be done; programs could rewrite themselves at runtime to do whatever the user needs. You could add features by asking for them. It would fundamentally revolutionize how we interact with computers.
2. Deepmind's algorithm discovery is just a different approach to automating coding. It's less learning from preexisting code and more searching the space of possible code - you get more original solutions but at a higher compute cost.
How do compilers detect the need to replace high-level code with a hard-coded three-number sorting algorithm? As someone not deeply familiar with compiler internals, I'm eager to understand the underlying mechanisms. Could anyone shed light on how modern compilers recognize situations where it's beneficial to replace generic code with optimized assembly instructions specifically designed for sorting three numbers?
I just realized that Justine was the person responsible for the massive reduction in the memory footprint of the Llama models back in March.[1] Super impressive! These are my favorite kinds of blog posts.
Just a wild guess – developers lack soft/managerial skills. (Overgeneralization.) In companies, this is accounted for because developers are shielded by managers. But in F/OSS you get to interact directly with the developers. As for the why devs lack soft skills – hold your hats – programming is basically commanding. Without "thank you", without "please" and with lot of swearing. No time for pleasant bullshit. (By this reasoning, devs using declarative/functional programming should be more polite than imperative programmers, right? :D)
There was more. You can't just splat giant C structs with pointers into shared memory/a file, and expect another process to just mmap and be able to recreate valid state again. At the very least the pointers are going to be all wrong. There was necessary work to adjust the file format. Not rocket science, but not just turning while(fread()) into open();mmap().
Also, there were insights into how to minimize which models needed adjustment. The ideas and code were worked on by at least 2 people, and I'm an outsider on that project, but I didn't see anything untoward like "stealing credit". The magic change wasn't a perfect move, but is the kind of thing I do locally when I don't know the project/binary format well yet, so not exactly the megalomaniacal move it was painted as. Better that only the version number changed, but she's independent and doing good work, so you'd kind of hope she has a self-promotion streak! Changing the magic would be on the very very low end of letting that side go a bit too far, assuming that was the impetus.
Why is it Justine posts and seemingly only Justine posts that always get this type of comments? Do people regularly comment on the authors of other content, for better or for worse, and I miss it?
The code size might not end up quite as good (also requires malloc), but a branchless merge sort is a contender for a fast and lightweight sort. Just published, tiny-sort-rs[0] cites 632 bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my tests, my own pisort[1] benches a little over twice as fast as LongSort, but it uses sorting networks as the base case so it's like 5KB. It's roughly based on piposort[2] which has more complicated recursion but a simpler base case.
400 MB/s seems a bit slow for a radix sort on that hardware: I'm hitting those numbers on my i5-6200U, which has less than half the clock rate, with my own radix sort. Recommend checking ska_sort_copy from [3] as it has about the same performance.
[0] https://github.com/Voultapher/tiny-sort-rs
[1] https://github.com/mlochbaum/SingeliSort/blob/master/src/mer...
[2] https://github.com/scandum/piposort
[3] https://github.com/skarupke/ska_sort