Understanding DeepMind's sorting algorithm

mlochbaum · on June 12, 2023

Hang on, you can't just quote MB/s numbers for an O(n log(n)) sort. What length were these tests run at?

The code size might not end up quite as good (also requires malloc), but a branchless merge sort is a contender for a fast and lightweight sort. Just published, tiny-sort-rs[0] cites 632 bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my tests, my own pisort[1] benches a little over twice as fast as LongSort, but it uses sorting networks as the base case so it's like 5KB. It's roughly based on piposort[2] which has more complicated recursion but a simpler base case.

400 MB/s seems a bit slow for a radix sort on that hardware: I'm hitting those numbers on my i5-6200U, which has less than half the clock rate, with my own radix sort. Recommend checking ska_sort_copy from [3] as it has about the same performance.

[0] https://github.com/Voultapher/tiny-sort-rs

[1] https://github.com/mlochbaum/SingeliSort/blob/master/src/mer...

[2] https://github.com/scandum/piposort

[3] https://github.com/skarupke/ska_sort

jstanley · on June 12, 2023

> What length were these tests run at?

The first example is "assembly code they published for sorting an array with three items" - this isn't an entire general-purpose sorting algorithm, it's just the innermost part.

srcreigh · on June 12, 2023

The alpha dev post claims 1.7% improvement on large sequences (250k+)

refulgentis · on June 12, 2023

Yes, and as both posts say, that’s because large sequences are implemented by building up from small sequences :)

mlochbaum · on June 12, 2023

Second part of the article, starting at "I thought it'd be useful to share something that's actually portable and executable".

mlochbaum · on June 12, 2023

Just realized that obviously you don't need stability if you're using in-place quicksort, so the tiny-sort heapsort is a better recommendation. 304 bytes, although the scaling to large arrays is much worse because of the awful access patterns.

Paul-Craft · on June 13, 2023

> ...632 bytes... 5KB...

How much does code size matter here? As long as the code has a good access pattern that maintains cache locality, is there any fundamental difference between 632 bytes and 5KB? L1 cache sizes are generally somewhere around 16-64 KB these days, so it seems like there wouldn't be a big difference here. Or am I just totally off base?

jart · on June 13, 2023

> L1 cache sizes are generally somewhere around 16-64 KB these days, so it seems like there wouldn't be a big difference here.

Would you want to depend on a C library that claims all 64kb of your L1 cache for itself? Of course not. You'd want to use a library that stays out of the way, so that your code can be the one exploiting system resources.

vardump · on June 13, 2023

All else equal, larger L1C footprint code still microbenchmarks nicely, but is pretty often worse in a real system.

Sometimes it's better to choose a slower routine with a smaller footprint. Nowadays memory bandwidth is arguably the biggest performance limiting factor.

foota · on June 13, 2023

I think the benefit is that when you start getting inlined code you can have multiple copies of it, so it can multiply out, and ideally you always want your code to be as small as possible so that you can have fewer instruction cache misses. So it's unlikely to matter if you're calling a single function in a loop whether the code is 632 bytes or 5KB (well, instruction decoding throughput aside I suppose), but when you're looking more broadly it might matter.

mlochbaum · on June 13, 2023

longsort appears in cosmopolitan libc, and possibly gets embedded in all the output executables? For most applications the requirements are much less restrictive. I'm working on sorting for interpreted programming languages; I see >20KB for each sort now and don't have a problem with that. For small arrays only a fraction of the code will be used. I still make some effort to reduce size, but if you're doing HPC work where sorting matters you can go much bigger with sorting networks for every size of something like that. icache is 32KB/core on every processor I've checked, although it's often reported weird. But it's fine for a hybrid sort targetting large arrays to exceed that because many components like partitioning will spend their time running on lots of data, so the time to load is relatively insignificant.

Paul-Craft · on June 13, 2023

> DeepMind basically built an artificial intelligence that fiddles around with assembly code and deletes stuff at random to see if it breaks.

When it's not done by "artificial intelligence," we just call this "mutation testing": https://en.wikipedia.org/wiki/Mutation_testing

> The above algorithm shows what the new and improved libcxx is doing. It's basically quicksort except it switches to the sorting kernels and insertion sort when recursing into smaller slices.

This is a pretty standard technique, isn't it? You can eliminate hella recursive calls just by cutting off the bottom level of the call tree. For instance, take the following (Emacs lisp) functions:

    (defun fib1 (n)
      (if (or (= n 1) (= n 0))
          1
          (+ (fib1 (- n 1)) (fib1 (- n 2)))))

    (defun fib2 (n)
      (if (or (= n 1) (= n 0))
          1
      (if (= n 2)
          2
          (+ (fib2 (- n 1)) (fib2 (- n 2))))))

Using the first function to calculate (fib 5) makes 29 recursive calls before finally terminating, while the second only 17.

With libcxx I think they even took the added step of schlepping in heapsort, which is kind of slow, but prevents adversaries from smashing your stack.

zodzedzi · on June 13, 2023

> ... heapsort, which is kind of slow, but prevents adversaries from smashing your stack.

How does heapsort protect against a stack attack?

Paul-Craft · on June 13, 2023

Oh, yeah. I forgot to quote that, because I was going to comment on it. It's an iterative, in-place algorithm, so there are no recursive calls to be made.

The bit I meant to comment on was the "kind of slow" part. It is true that heapsort tends to be slower than a well-implemented quicksort, but you don't use heapsort when you need the absolute best speed. The (IMO) best thing about heapsort is that its best case and worst case are the same order of magnitude, so sorting n things will take a fairly consistent amount of time, no matter what.

ReaLNero · on June 13, 2023

Heap sort can sort n elements with O(1) auxiliary data while quick sort (which is what libcxx usually relies on) in its worst-case performance would require storing O(n) stack frames. Since stack sizes are usually small, an adversary making you sort a million elements would likely cause a stack overflow.

Paul-Craft · on June 13, 2023

Quicksort can be implemented iteratively, with O(1) auxiliary storage, as well: https://alienryderflex.com/quicksort/

anonymoushn · on June 13, 2023

The linked implementation uses logarithmic auxiliary storage but allocates extra storage such that the amount allocated is a constant and rejects inputs that are too large (inputs that wouldn't fit into any computer anyway). A similar trick can be used to convert any algorithm to "use constant space." Just allocate enough space to handle inputs of some large size and reject larger inputs.

Paul-Craft · on June 13, 2023

Did you scroll all the way to the bottom? The "never fail" version that always sorts the smaller partition first, and allocates 300 "pseudo stack frames" will successfully sort any array on a real computer.

Sure, theoretically, it does what you say, but you know what they say, right? In theory, theory and practice are the same. In practice, not so much. And, these are real differences, too: theory idealizes real computers as general Turing machines, when, in fact, they're really only linear bounded automata: https://en.wikipedia.org/wiki/Linear_bounded_automaton

See also the commentary following the code:

> This might be slightly slower than the first one, but it will never fail because it always performs the smaller partition first. Hence, the only way it could run into the limit is if the to-be-sorted array was at least 2MAX_LEVELS elements in size. Since 2300 is greater than 1090, and there are only about 1080 fundamental particles in this universe from which a human-made computer can be built, no larger limit will ever be needed.

> (Note: Someone reminded me that a typically 64-bit index variable can index only 264 items, not 2300. That’s true, but if you’re using a 64-bit computer, you’re probably not going to have an array of more than 264 elements anyway, even if each element was only one byte.)

srcreigh · on June 12, 2023

Is it strange that it's slower in jart's testing but claimed to be faster in the AlphaDev blog post?

jart doesn't provide detail about length of sequences used in testing, and AlphaDev basically says that between 6 and 249,999 elements the optimizations are slower (they only claim improvement for very small and 250k+ element sequences).

The AlphaDev numbers are so curious as well. AFAICT there's extra branching when you splice the tiny-sequence optimized versions (slower), but better sorting for the tiny sequences (faster).

Is it, like, branch prediction gets an edge when the leaf nodes of the recursion are all sorting tiny sequences? In jart's code, it's DFS, which I can only guess would trample a bit on branch prediction. I wonder if a BFS search could be better

No idea what would cause this though, curious if anyone has other ideas, I really don't know.

anonymousiam · on June 13, 2023

Just an aside, but I noticed when I was reading this that Justine's C coding style resembles my preference, and it reminded me of why I like to code this way. I learned Pascal before C and Pascal required that you define all methods before they are used. (This is a restriction that ANSI C and C++ worked around with function prototypes.) You can still code C without prototypes if you completely define all functions in the same file, before they are called. IMHO, code without forward references and/or function prototypes is inherently easier to read, but unfortunately it's not always possible to produce.

jart · on June 13, 2023

I learned Pascal before C too :-)

gww · on June 13, 2023

My first real coding education was Pascal on a VAX/VMS terminal in high school. It was a great experience and I miss the monochrome orange screens

CalChris · on June 12, 2023

  mov %rdx,%rcx

Wouldn't this mov instruction be handled by the register renamer (Allocate/Rename/MoveElimination/ZeroIdiom) at essentially 'zero' cost? Yet clearly they're measuring a difference. I'll be curious what Agner Fog and Peter Cordes think.

Answer: renaming can fail if the operands aren't ready and it isn't zero cost, just less.

dundarious · on June 12, 2023

Just having to go through the hardware frontend is a cost. It's one of the reasons SIMD is fast: you go through the frontend 1 time for N lanes of data.

bla3 · on June 13, 2023

zeux posted this: https://gist.github.com/zeux/37f8f9dd7840ac0217ee862ba3a269f...

mgaunard · on June 13, 2023

This is all making headlines by making one implementation 1% faster by removing one copy.

Meanwhile I can just make sort 10 times faster with vectorization.

roenxi · on June 13, 2023

This post inspired me to look at move 37 of the Lee Sedol game with a strong AI (stronger than AlphaGo Lee anyway) - interestingly, it thinks that the position favours Lee Sedol (~51% win rate to white, AlphaGo was still grinding down the komi advantage) and there was a better move available for black. Still was a good move though. It also thought AlphaGo then misplayed the following sequence and gave Lee an advantage for a brief moment.

cammil · on June 12, 2023

Why is sorting not done by specialised hardware? Or is it already?

bmc7505 · on June 12, 2023

It can be done for fixed-length lists. Optimal sorting networks [1] are an active research topic with many interesting connections to differentiable sorting and ranking [2].

[1]: https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin...

[2]: https://arxiv.org/abs/2105.04019

nuc1e0n · on June 13, 2023

The thing that has struck me about all these sorting algorithms is before you can run them you already need to have all the items to be sorted, which may involve you waiting to receive them. So the act of sorting ends up being on the critical path of what you need to achieve. Not good.

I think its far better to sort the items incrementally as and when you receive them, so the act of sorting them is no longer on the critical path. Then the sorting takes virtually no time at all afterward, no matter how many items you need to sort.

The major bottleneck to computer systems is almost always the network and not the computing of algorithms. You can interleave the compute steps into the time spent waiting for the network.

Semisweet6708 · on June 12, 2023

>I don't think it's progress that OpenAI is promising to automate all the tasks I love doing most, like coding.

1. That sounds like a you problem. Automating coding would be fantastic if it could be done; programs could rewrite themselves at runtime to do whatever the user needs. You could add features by asking for them. It would fundamentally revolutionize how we interact with computers.

2. Deepmind's algorithm discovery is just a different approach to automating coding. It's less learning from preexisting code and more searching the space of possible code - you get more original solutions but at a higher compute cost.

yuz · on June 13, 2023

How do compilers detect the need to replace high-level code with a hard-coded three-number sorting algorithm? As someone not deeply familiar with compiler internals, I'm eager to understand the underlying mechanisms. Could anyone shed light on how modern compilers recognize situations where it's beneficial to replace generic code with optimized assembly instructions specifically designed for sorting three numbers?

charcircuit · on June 13, 2023

They don't

smodad · on June 12, 2023

I just realized that Justine was the person responsible for the massive reduction in the memory footprint of the Llama models back in March.[1] Super impressive! These are my favorite kinds of blog posts.

[1] https://github.com/ggerganov/llama.cpp/pull/613

gajnadsgjoas · on June 12, 2023

You wanted to say the one was banned by the author because of all the drama that followed

jimsimmons · on June 12, 2023

What drama. Ooc

pcj-github · on June 12, 2023

https://github.com/github-drama/github-drama/pull/46

skeaker · on June 12, 2023

I don't understand this. What's the contention that's actually getting people upset here?

nl · on June 12, 2023

Having read through a bunch of outraged comments the issue seems to have been:

  * It changed the file format
  * For some people it was slower that the original version particularly on low end computers

One person was particularly outraged and reverted the whole change.

However, the current version does use a similar approach to that which was proposed: https://github.com/ggerganov/llama.cpp/commit/f963b63afa0e05...

nicman23 · on June 13, 2023

Lol why oss people are so antisocial - I include myself in the subset

mlajtos · on June 13, 2023

Just a wild guess – developers lack soft/managerial skills. (Overgeneralization.) In companies, this is accounted for because developers are shielded by managers. But in F/OSS you get to interact directly with the developers. As for the why devs lack soft skills – hold your hats – programming is basically commanding. Without "thank you", without "please" and with lot of swearing. No time for pleasant bullshit. (By this reasoning, devs using declarative/functional programming should be more polite than imperative programmers, right? :D)

jart · on June 13, 2023

It was 4chan. You can read my comment history on Hacker News if you want to learn more.

m00x · on June 12, 2023

It's just using mmap, nothing too impressive. It's a nice contribution nonetheless.

dundarious · on June 13, 2023

There was more. You can't just splat giant C structs with pointers into shared memory/a file, and expect another process to just mmap and be able to recreate valid state again. At the very least the pointers are going to be all wrong. There was necessary work to adjust the file format. Not rocket science, but not just turning while(fread()) into open();mmap().

Also, there were insights into how to minimize which models needed adjustment. The ideas and code were worked on by at least 2 people, and I'm an outsider on that project, but I didn't see anything untoward like "stealing credit". The magic change wasn't a perfect move, but is the kind of thing I do locally when I don't know the project/binary format well yet, so not exactly the megalomaniacal move it was painted as. Better that only the version number changed, but she's independent and doing good work, so you'd kind of hope she has a self-promotion streak! Changing the magic would be on the very very low end of letting that side go a bit too far, assuming that was the impetus.

CaptainNegative · on June 13, 2023

Why is it Justine posts and seemingly only Justine posts that always get this type of comments? Do people regularly comment on the authors of other content, for better or for worse, and I miss it?

Conscat · on June 13, 2023

Faster Than Lime, too.

tabtab · on June 12, 2023

Several years ago I read about using genetic algorithms to "evolve" better mini-sorts. I wonder how the two compare.

fred256 · on June 12, 2023

This reminds me of challenge 28 from the game Human Resource Machine.

minitoar · on June 13, 2023

Loved that game.

ma2rten · on June 13, 2023

Someone just asked GPT-4 and got the same result as DeepMind did:

https://twitter.com/DimitrisPapail/status/166684395282416846...

optimalsolver · on June 13, 2023

He practically tells it the solution. Very leading prompt.

ma2rten · on June 15, 2023

How so?