More

danlark · 2024-07-25T14:33:14 1721917994

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

danlark · 2024-07-25T12:53:14 1721911994

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

thanatropism · 2024-07-26T14:21:27 1722003687

Is any of those easily modifiable to return the arg-median (the index which has the median).

cfors · 2024-07-25T13:36:12 1721914572

Just wanted to say thank you for this article - I've read and shared this a few times over the years!

danlark · on Dec 23, 2023

Disclaimer: the author of the comment is the CEO of ClickHouse

danlark · on June 7, 2023

I agree on the sorting front. Removing one cmov is not likely to improve much.

orlp · on June 7, 2023

It's not even a cmov. Look at Figure 3 in the paper: https://www.nature.com/articles/s41586-023-06004-9

They eliminated a register-register mov.

danlark · on June 7, 2023

You can see hashing optimizations as well https://www.deepmind.com/blog/alphadev-discovers-faster-sort..., https://github.com/abseil/abseil-cpp/commit/74eee2aff683cc7d...

I was one of the members who reviewed expertly what has been done both in sorting and hashing. Overall it's more about assembly, finding missed compiler optimizations and balancing between correctness and distribution (in hashing in particular).

It was not revolutionary in a sense it hasn't found completely new approaches but converged to something incomprehensible for humans but relatively good for performance which proves the point that optimal programs are very inhuman.

Note that for instructions in sorting, removing them does not always lead to better performance, for example, instructions can run in parallel and the effect can be less profound. Benchmarks can lie and compiler could do something differently when recompiling the sort3 function which was changed.

For hashing it was even funnier, very small strings up to 64 bit already used 3 instructions like add some constant -> multiply 64x64 -> xor upper/lower. For bigger ones the question becomes more complicated, that's why 9-16 was a better spot and it simplified from 2 multiplications to just one and a rotation. Distribution on real workloads was good, it almost passed smhasher and we decided it was good enough to try out in prod. We did not rollback as you can see from abseil :)

But even given all that, it was fascinating to watch how this system was searching and was able to find particular programs can be further simplified. Kudos to everyone involved, it's a great incremental change that can bring more results in the future.

CalChris · on June 7, 2023

This sounds like a more intelligent version of superoptimization. The original Masselin SO, albeit for its time, also created surprising results which is similar to AlphaDev's incomprehensible for humans. You see the same thing in computer chess which Agadmator calls disgusting engine lines.

https://courses.cs.washington.edu/courses/cse501/15sp/papers...

mtlmtlmtlmtl · on June 8, 2023

Stockfish actually use machine learning to pick the best magic numbers for their code.

agalunar · on June 8, 2023

SPSA, if memory serves.

https://en.m.wikipedia.org/wiki/Simultaneous_perturbation_st...

thomasahle · on June 7, 2023

I'm disappointed a the hashing is just based on training on microbenchmarks and SMHasher, rather than designing a fast _provably_ universal hash. Suites like SMHasher are never complete. They are just trying to catch the most common weaknesses. If you train on the test cases you'll only get an algorithm that passes the tests, but people can always find a set of values on which you will do badly.

eklitzke · on June 7, 2023

I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function, but outside of cryptographic applications that is rarely the requirement. And (huge caveat, this isn't my domain expertise) I'm not sure what security properties are really "proven" about existing cryptographic hash functions, AFAIK existing cryptographic hash functions are considered secure because we don't know how to break them, not because of some fundamental mathematical property about them.

For the other 99.999% of hashing applications there is a balance between collision resistance and hashing latency. For example, in a hash table (probably the most common use for a non-cryptographic hash function) there is a cost incurred by hash collisions because lookups on keys with collisions may have to do extra probing. On the other hand, every hash table lookup requires doing at least one hash operation, regardless of whether or not it collides. So it may make sense to have a slightly worse hash function (in the sense that it is more likely to have collisions with pathological inputs) if it has slightly lower latency. The only way to really know what is faster for a real world application is to have some kind of benchmark to train against as a loss function.

thomasahle · on June 7, 2023

> I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function.

I wish this misconception would die. There is a great theory of algorithmic probabilistic hash functions, completely distinct from cryptographic hash functions. If you are designing a hash table, or a different algorithm using a hash function, you nearly always want the former kind.

The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_. Here the probability is over the random seed of h. Lots of good hash functions, like UMASH (https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...) has this guarantee. Other fast hash functions, like MURMUR don't.

When a function doesn't have this guarantee, it means I can find sets of values x1, x2, ... that will likely collide under _any_ or most seeds! Sure, if your inputs are basically random, this probably won't happen, but people can still use this to DDoS your hash table, or whatever you are coding.

Notice again, this has nothing to do with cryptography. It is all about probabilistic guarantees. You can't just test the hash function on a fixed number of inputs and say it's good, since you may just have moved the "bad set" to somewhere else.

In this day and age there are super fast algorithmic hash functions with guaranteed low expected collisions. It's just silly to use one that you can break so easily.

Dylan16807 · on June 8, 2023

> The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_.

That sounds like such a function is strongly collision resistant, which means it's also second preimage resistant. And that gets you most of the way to a cryptographic hash function.

Is the only difference that it doesn't have to be first preimage resistant? Compared to cryptographic hashes, does that expand the set of viable functions a lot, to allow first preimages while still not allowing second preimages?

> It is all about probabilistic guarantees

So are cryptographic hash functions.

When I search for `algorithmic probabilistic hash functions` I just get results about bloom filters.

thomasahle · on June 8, 2023

> > It is all about probabilistic guarantees

> So are cryptographic hash functions.

Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.

It's muddied a bit by the fact that cryptographers also use universal hashing (or probabilistic hashing, or what I called algorithmic hashing) for stuff like UMACs, https://en.m.wikipedia.org/wiki/UMAC#NH_and_the_RFC_UMAC , but they often have a lot of extra considerations on top of just collision resistance.

Some algorithms also need stronger probabilistic guarantees than just collision resistance (see e.g. https://en.m.wikipedia.org/wiki/K-independent_hashing#Indepe... ). These properties are usually too hard to test for with an experimental testing suite like SMhasher, but if your hash function don't have them, people will be able to find inputs that break your algorthm.

Dylan16807 · on June 8, 2023

> Cryptographic hash functions like MD5, SHA-2, BLAKE2, etc are deterministic functions, so it doesn't really make sense to talk about Pr[h(x)=h(y)]. Either the collide or not.

Eh, that's how I usually see collision resistance described. The probability is based on generating fresh inputs with any method you want/the most effective attack method available.

But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.

So I'm still not sure in what sense a hash like this is facing different requirements than a cryptographic hash.

thomasahle · on June 8, 2023

> You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix. It'll prevent the same attacks and you can give it the same analysis.

I'm curious if you can link to such an analysis. These functions are notoriously much harder to analyze than simple functions like "h(x) = ax+b mod p" which is all you need for the probabilistic guarantee.

But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.

Dylan16807 · on June 8, 2023

By definition, if they're secure then they should meet the requirements, right?

> But even if you could analyze this, you would just end up with a universal hash function that's way slower than you need, because you didn't pick the right tool for the job.

I understand that, I'm just trying to figure out how a universal hash is easier to construct. But as you've gone through the descriptions here I think I understand how the collision resistance necessary is much much simpler, and there seems to be an assumption that the output of the hash will not be available to the attacker.

versteegen · on June 9, 2023

> But I wouldn't say the hash you linked is nondeterministic just because it has a seed. You can seed MD5, SHA-2, and BLAKE2 by tossing bytes in as a prefix.

Yes, but the point is that hash functions used for hash tables are much, much faster than these cryptographic ones.

tga_d · on June 7, 2023

>If you really need a collision resistant hash you need to use a cryptographic hash function, but outside of cryptographic applications that is rarely the requirement.

There are reasons to use (strongly) collision resistant hashes outside of cryptographic settings. E.g., the default Rust hash function, used in hash maps and sets, has strong collision resistance, because otherwise you could open up applications to DoS attacks (the attacker uses lots of inserts with collisions to kill performance of accesses and further inserts at those buckets).[0]

>I'm not sure what security properties are really "proven" about existing cryptographic hash functions, AFAIK existing cryptographic hash functions are considered secure because we don't know how to break them, not because of some fundamental mathematical property about them.

There are provably secure hash functions[1] (typically using the same sort of primitives as public key crypto), but they're generally only used when certain properties need to be composed, and are often less secure than the non-provable ones in practice anyway. This is pretty similar to the state of symmetric vs. asymmetric cryptography in general: primitives like RSA, DH, etc. have much stronger proofs than AES, but algorithms built using AES for security are generally viewed as a lot less likely to be broken any time soon than algorithms built using typical asymmetric primitives for security, even ignoring things like quantum advantage.

[0] https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

[1] https://en.wikipedia.org/wiki/Security_of_cryptographic_hash...

Someone · on June 7, 2023

> I'm not sure what security properties are really "proven" about existing cryptographic hash functions

AFAIK, we don’t even know whether trapdoor functions exist.

https://en.wikipedia.org/wiki/Trapdoor_function:

“As of 2004, the best known trapdoor function (family) candidates are the RSA and Rabin families of functions”

Also note that the ‘examples’ section starts with:

“In the following two examples, we always assume it is difficult to factorize a large composite number (see Integer factorization).”

aidenn0 · on June 8, 2023

IIRC, if P=NP then trapdoor functions do not exist, so proving that one existed would be a huge deal even outside of cryptography.

Dylan16807 · on June 8, 2023

Only if your definitions of "easy" and "hard" are based entirely on complexity classes.

If you show me a setup where "easy" is n^3 and "hard" is n^15 I will happily call that a trapdoor function.

jacquesm · on June 7, 2023

Indeed, and this has been the case for quite a while now. You can always improve on some general algorithm by taking advantage of knowledge of the data but that never generalizes and usually leads to either worse performance on other data and/or new pathological cases that result in results that are unusable.

It's an instance of overfitting.

matteoraso · on June 8, 2023

>Indeed, and this has been the case for quite a while now. You can always improve on some general algorithm by taking advantage of knowledge of the data but that never generalizes and usually leads to either worse performance on other data and/or new pathological cases that result in results that are unusable.

Deepmind did the exact same thing with AlphaTensor. While they do some geniunely incredible things, there's always a massive caveat that the media ignores. Still, I think it's great that they figured out a way to search a massive space where most of the solutions are wrong, and with only 16 TPUs running for 2 days max. Hopefully this can be repurposed into a more useful program, like one that finds proofs for theorems.

bee_rider · on June 7, 2023

Ship the optimization framework in with the application, sample from the user data, and optimize for that? It isn’t overfitting if you overfit on the data you care about, right?

jacquesm · on June 7, 2023

Data tends to change over time, and once a hash function is in use you can't really replace it easily without a lot of overhead, possibly quite a bit more overhead than what you saved in the first place. There are some examples of this in the sorting arena too, such as 'Timsort', personally I haven't found any that gave a substantial boost, but probably there are some cases where they do. Unless sorting or hashing (and lookup) are the main bottleneck for an application I would spend my time on other aspects of it.

scns · on June 7, 2023

Sounds like the JVMs recompilation of hor paths to me.

hajile · on June 7, 2023

To what extent is this simply working around the weirdness of x86? Do these improvements apply to something like MIPS, ARM64, or RISC-V that have inherently simpler ISAs?

danlark · on June 7, 2023

In this particular case they were universal but in paper it's said the optimizations were done on x86. One of the ideas was to use LLVM IR but intuition for optimizer over optimizer was unlikely to work properly.

ndesaulniers · on June 7, 2023

> but intuition for optimizer over optimizer was unlikely to work properly.

Wut?

mindwok · on June 8, 2023

My guess: Using LLVM IR would mean that the LLVM optimiser might have made the results more noisy or hard to understand when it was compiled to actually execute.

moonchild · on June 8, 2023

> 9-16 was a better spot and it simplified from 2 multiplications to just one and a rotation

I'm very confused as to why rotation was at all useful. Xoring with a random-ish constant makes sense, because the constant has high entropy and is likely to decorrelate bits from the input (also can use a different constant per hash table). But rotating by a constant—and a fixed one at that—seems like it just accounts for expected input distribution. Especially (assuming this is intended for text) if shifting by a value >8 makes a significant difference (vs shifting by the same value mod 8), it smells like serious overfit. Could be useful for something like a perfect hash, but seems problematic and prone to issues as a general hash.

Edit: to make my objection clearer: the hash simply replaces lo with rotr(lo, 53). If rotr(lo, 53) performs significantly better than rotr(lo, 53 mod 8), then that implies the following. I can take a set of strings, and I can apply the same permutation to the characters of all of the strings in the set, and this will significantly affect the quality of the hash function. That seems like an undesirable property, even for a non-cryptographic hash.

manwe150 · on June 9, 2023

Does that end up moving high bits into the low bits? That could possibly be very helpful for all sets of strings, since the equality test will start with the first byte, so it could better to have the rotr on the hash so that the hash is less affected by the first byte (and more affected by later bytes). Just hypothetically speaking that is where the implication could break down, since it doesn’t consider the non-uniform cost of the equality.

moonchild · on June 10, 2023

> the equality test will start with the first byte

I would expect the equality test to compare at least a full word at a time, just as the hash hashes at least a full word at a time.

manwe150 · on June 24, 2023

I didn’t look at their implementation, but in general, strings don’t have to be aligned so you can only peek one byte at a time looking for the end, besides not wanting to annoy valgrind and other similar UB detection tools by reading past the end of the string.

moonchild · on June 29, 2023

Strawman; nul-terminated strings are horribly slow for nearly every application. Hence I assume (especially given that they are using c++) that they are using length-accompanied strings rather than nul-terminated ones.

13of40 · on June 7, 2023

> something incomprehensible for humans

This might be buggy whip talk, but I wonder if you could take the same system and apply it to smaller problems (e.g. computing an 8-bit hash) so the novel techniques could be identified and used by humans.

tyingq · on June 8, 2023

One of the examples from another comment[1] here was:

"They found that in a sorting network handling 3 inputs, the AI found a way to save an instruction by reducing a "min(A, B, C)" operation to just "min(A, B)" by taking advantage of the fact that previous operations guaranteed that B <= C."

Which isn't incomprehensible at all.

[1] https://news.ycombinator.com/item?id=36229068

cabalamat · on June 7, 2023

> proves the point that optimal programs are very inhuman

Maybe there should be an AI that produces optimally-readable/understandable programs? That's what I would want if I was adding the output to a codebase.

whatever1 · on June 7, 2023

Optimal routing of delivery vehicles for UPS/Fedex etc is also non-compressible to the drivers, so the planners often intentionally generate suboptimal solutions.

A suboptimal implemented solution is a better than an optimal not implemented one.

cabalamat · on June 8, 2023

> Optimal routing of delivery vehicles for UPS/Fedex etc is also non-compressible to the drivers, so the planners often intentionally generate suboptimal solutions.

Really? That's the first time I've heard that (and I've worked on vehicle routing).

Normally, the driver would get the next address they have to visit and would use satnav to work out how to get there. They don't need to "comprehend" the overall route.

whatever1 · on June 8, 2023

Packages might have delivery time windows attached to them, so the optimal solution calls for multiple visits in the same neighborhood by the same van. This is bs from a driver’s perspective.

m3kw9 · on June 7, 2023

It’s not that is written like obfuscated, the routine/ algo is just hard to understand even if they commented every line. Likely some recursive trick is involved, those are always hard to follow

make3 · on June 7, 2023

that's what LLMs can with rl from human (or ai) readability feedback & instruction tuning + prompting. we will 100% see this if gpt-4 doesn't already do this.

ska · on June 7, 2023

I wouldn't classify any of the output I've seen so far as "optimally readable/understandable".

Some if it looks pretty ok, especially where it overlaps with well established approaches.

chaxor · on June 7, 2023

It can do well with optimization and readability *if you ask it specifically for those things*. Especially if you have a particular paradigm and algorithm in mind (you obviously already should anyway).

This is why these systems are helpful in programming: they allow developers to think more about the design paradigms, and algorithmic solutions, rather than the fine grained code syntax and typing.

My hope (not prediction unfortunately, but *hope*) is that these systems will make people "better* programmers. This could happen by alleviating the requirement of typing out the code in a particular way, and allowing more time to really try out or think carefully about the correct solution for how to make their programs (i.e. multiprocessed, producer-consumers, distribute data with ipfs, faster algorithms, etc)

ska · on June 8, 2023

> It can do well with optimization and readability if you ask it specifically for those things

My experience so far (including -4) this isn't really true, even when focusing on those aspects. I'm cautiously optimistic this will get better.

brundolf · on June 8, 2023

Thought: optimizing JIT compilers like V8 already observe code behavior and use that to choose how to optimize things as they go. I wonder if one day V8 will incorporate some ML to help study and optimize the running code

gumby · on June 7, 2023

Your description makes the approach sound like applying ca 1980s simulated annealing, also a form of gradient descent. Am I missing something?

ska · on June 7, 2023

This doesn't really sound like SA, which is a principled approach to avoid the local min problem in descent algorithms and end up with a true global optimization. The general result is clean and impractical, the approximations hard to analyze.

thesz · on June 7, 2023

So it works as a superoptimizer of sorts.

ks84 · on June 10, 2023

I haven’t read the paper, but it sounds like this is a sort of similar approach to genetic algorithms? Create lots of agents with random changes and pick the ones that produce the most promising results? Is my impression wrong?

sytelus · on June 7, 2023

If more compute is spent than before but in parallel to reduce latency then wouldn't this increase power? Shouldn't latency and power both be objective?

lukeplato · on June 8, 2023

> optimal programs are very inhuman

sounds revolutionary

danlark · on June 7, 2023

I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...

jart · on June 7, 2023

Wow. I love your work. Thank you for coming here and talking about it. You could write Hacker's Delight 2nd edition for a new generation.

spicymaki · on June 7, 2023

Hacker's Delight already has a 2nd edition.

https://www.oreilly.com/library/view/hackers-delight-second/...

tr33house · on June 7, 2023

would totally love to read a modern `Hacker's Delight`. My mind was so blown away the first time I learned about low-level optimizations. I wish I did more of that on a day to day

zX41ZdbW · on June 7, 2023

Let's add it to ClickHouse: https://github.com/ClickHouse/ClickHouse/blob/master/base/ba...

It should significantly improve the performance on ARM.

Sesse__ · on June 7, 2023

The VSHRN trick is nice (I used it only two hours ago!), but it really does feel like a crutch; I don't understand why they couldn't simply implement a PMOVMSKB-like instruction to begin with (it cannot possibly be very expensive in silicon, at least not if it moved into a vector register). One-bit-per-byte is really the sweet spot for almost any kind of text manipulation, and often requires less setup/post-fixup on either side of the POVMSKB/VSHRN.

jeffreygoesto · on June 7, 2023

> However, developers often encounter problems with Arm NEON instructions being expensive to move to scalar code and back.

I remember talking to an ARM engineer easily 10 years ago and he told us in that nice british accent: "You know, NEON is like 'back in the yard'" :-D. This has changed a lot, but not enough from what you wrote... Bit sad that these SIMD optimizations are still hand written...

danlark · on April 20, 2022

Sorry, it definitely was a little inconsistent as I stored all snippets without proper formatting and I tried at the same time align texts for phones, etc

Failed at everything. Will improve

danlark · on April 20, 2022

Hi, the author is here

I am working in MapReduce/Data-pipelining/Dataproc efficiency internally and externally. In cloud you can check out Google Cloud Dataflow

We are working closely with general efficiency and bare metal teams, yes :). They definitely do a fascinating work. This one was my 20% project together with the google efficiency team

danlark · on April 20, 2022

Thanks, I'll remove the note about "recalling"

westurner · on April 20, 2022

xeus-cling didn't exist back then: https://github.com/jupyter-xeus/xeus-cling#a-c-notebook

https://github.com/fffaraz/awesome-cpp#debug

A standard way to benchmark and chart {sorting algorithms, web framework benchmarks,} would be great.

The TechEmpower "framework overhead" benchmarks might have at least average case sorting in there somewhere: https://www.techempower.com/benchmarks/#section=data-r20&hw=...

westurner · on April 20, 2022

awesome-algorithms https://github.com/tayllan/awesome-algorithms#github-librari...

awesome-theoretical-computer-science > Machine Learning Theory, Physics; Grover's; and surely something is faster than Timsort: https://github.com/mostafatouny/awesome-theoretical-computer...

samhw · on April 20, 2022

Christ, the description on that second repo is so comically terribly spelled that I can’t figure out whether it’s a joke:

> The interdicplinary of Mathematics and Computer Science, Distinguisehed by its emphasis on mathemtical technique and rigour.

(Perhaps a witty satire on the reputation of mathmos for being godawful at anything literary...)

danlark · on April 20, 2022

You don't, it's just not disallowed :)

infogulch · on April 20, 2022

"because it prevents the user from writing a comparison function that modifies the list while sorting" is not a reason I would expect to prefer rust, and yet here we are.

mbrubeck · on April 20, 2022

It's possible in Rust too, but only for a subset of element types: those that support shared mutation (also known as "internal mutation"). For example, a type containing an `AtomicU32` or a `Mutex<T>` or a `Cell<T>` could have an `Ord::cmp` implementation that alters its value, and the compiler will not prevent this.

orlp · on April 20, 2022

Yes, I had a nasty bug in an initial version of glidesort where I (naively) assumed that I could create a temporary copy of an element by doing a memcpy and call the comparison function on this temporary copy a bunch of times before throwing it away, since I assumed the comparison function would not mutate the copy. Unfortunately interior mutability means this would lead to potential double drops, and it was just not sound.

account42 · on April 22, 2022

The comparison operator is not actually mutating the list, which would not be allowed in C++ either. Instead it is dynamically generating the < comparison operator for the elements in the list based on the order for which elements it is evaluated. The order for two elements remains fixed once examined and in the end it always ends up with a valid totally ordered comparison operator. The only thing required for this to work is that the comparison operator can have mutable state and that the same shared state is used for all comparisons in the sorting algorithm (asided from thread-safety so you'd need adjustments for a parallel sort).