Go will use pdqsort in next release

dang · on April 21, 2022

Recent and related:

Changing std:sort at Google’s scale and beyond - https://news.ycombinator.com/item?id=31098822 - April 2022 (144 comments)

dehrmann · on April 21, 2022

With how long we've been studying sorting algorithms, I'm amazed we're still coming up with optimizations that yield real-world improvements.

Cthulhu_ · on April 21, 2022

It's not just the algorithms, but the hardware as well; the base algorithms we were taught in school were usually aimed at single-threaded applications, but in the past two decades, multithreading and parallelisation has become mainstream.

flaghacker · on April 21, 2022

That's not particularly relevant, no standard library sort method is using multithreading in the background. It actually is about algorithmic improvements.

morelisp · on April 21, 2022

For sorting algorithms many of the "algorithmic improvements" are caching / locality / access pattern improvements, as latency differences get more extreme and cache sizes grow. (At multiple levels, L1/L2 vs. RAM, RAM vs. disk, node vs. network.) So it's still very much tied to changes in hardware.

Whether you call that an "algorithmic improvement" or not is mostly semantics, it would not have been called such in the 80s but it would be today - what properties we consider part of "the algorithm" has also changed to meet hardware changes.

Even whether it's "parallelization" is somewhat semantics - if you change an algorithm to avoid a pipeline stall, is that designing it for parallelization? I guess not because it's single-threaded, but again someone in the 70s/80s might say yes because two things are happening at once.

afr0ck · on April 21, 2022

You're refering to instruction-level parallelism. The parent was refering to thread-level parallelism.

morelisp · on April 21, 2022

I mean, probably, which is also what I said. But what the parent actually said could also be either.

The point of my comment is that not only are the sorting algorithms changing over time, but what we say constitutes "algorithmic" vs. "implementation" changes also changes over time, so it's really hard to say whether certain changes are "algorithmic" changes or not!

rav · on April 21, 2022

Given that sorting algorithms are evaluated on practical merits on real-world hardware - when the hardware changes, sometimes the sorting algorithm has to change too.

kreetx · on April 21, 2022

They are just switching from "pure blood" algorithms like merge sort or quicksort, to hybrid and heuristics-based ones. (Not to say these advances are irrelevant though.)

37ef_ced3 · on April 21, 2022

According to the benchmarks, pdqsort only really helps for Go's integer sort.

The fastest way to sort integers is a radix-256 increasing-significance radix sort, and that has been a well-known fact since the 1960's. In other words: a stable bytewise counting sort, sorting from least-significant byte to most-significant byte.

A comparison sort, no matter how clever, will not outperform a radix sort, except for very small arrays or very large (many byte) integers.

pastaking · on April 21, 2022

On a side note, this change's author is from ByteDance. It's great to see everyone working together on open source.

limoce · on April 21, 2022

That makes sense. ByteDance has many employees who have experiences in competitive programming. These guys tend to be more sensitive and aggressive in algorithmic changes.

sydthrowaway · on April 21, 2022

The ByteDance interviews must be tough, but what they're making is completely trash. I'm not talking about the quality of their product.

B-Con · on April 21, 2022

I did a quick Google search for pdqsort and found this HN post: https://news.ycombinator.com/item?id=14666710

> I think it's fair to say that pdqsort (pattern-defeating quicksort) is overall the best unstable sort and timsort is overall the best stable sort in 2017, at least if you're implementing one for a standard library.

_hrfd called it 5 years ago.

dhzhzjsbevs · on April 21, 2022

https://news.ycombinator.com/item?id=31107187

Maybe not anymore?

orlp · on April 21, 2022

I think I will copy/paste my comment from a thread from yesterday (https://news.ycombinator.com/item?id=31098822), which incidentally was also referring to pdqsort, as it is just as relevant here:

Hi all, I'm the author of pdqsort. I recently held a talk at my institute on efficient in-memory sorting and the ideas in pdqsort, in case you're interested in hearing some of the theory behind it all: https://www.youtube.com/watch?v=jz-PBiWwNjc Next week I will hold another talk in the Dutch seminar on data systems design (https://dsdsd.da.cwi.nl/) on glidesort, a new stable sorting algorithm I've been working on. It is a combination of adaptive quicksort (like pdqsort, fully adaptive for many equal elements) and an adaptive mergesort (like timsort, fully adaptive for long pre-sorted runs). It is the first practical implementation of an algorithm I'm aware of that's fully adaptive for both. Like pdqsort it uses modern architecture aware branchless sorting, and it can use an arbitrary buffer size, becoming faster as you give it more memory (although if given a constant buffer size it will degrade to O(n (log n)^2) in theory, in practice for realistic workloads it's just a near-constant factor (c ~<= 3-5) slower).

The source code isn't publish-ready yet, I have to still do some extra correctness vetting and testing, and in particular exception-safety is still not yet fully implemented. This is important because I wrote it in Rust where we must always give back a valid initialized array, even if a comparison operator caused a panic. But I do have some performance numbers to quote, that won't significantly change.

For sorting 2^24 randomly shuffled distinct u32s using a buffer of 2^23 elements (n/2), glidesort beats Rust's stdlib slice::sort (which is a classic timsort also using a buffer of n/2) by a factor of 3.5 times. When stably sorting the same numbers comparing only their least significant 4 bits, it beats stdlib slice::sort by 12.5 times using 6.5 times fewer comparisons, both numbers on my Apple M1 Macbook. All of this is just using single-threaded code with a generic comparison function. No SIMD, no threads, no type-specific optimizations.

Finally, glidesort with a buffer size of >= n/2 is faster than pdqsort.

xavxav · on April 21, 2022

>I have to still do some extra correctness vetting and testing, and in particular exception-safety is still not yet fully implemented. This is important because I wrote it in Rust where we must always give back a valid initialized array, even if a comparison operator caused a panic. But I do have some performance numbers to quote, that won't significantly change.

Do you make extensive use of `unsafe` Rust? What other features do you use? I ask this as my PhD thesis is building a deductive verifier for (safe) Rust, and I'm always looking out for interesting challenges.

If you would ever be interested in a collaboration proving your algorithm correct let me know :)

orlp · on April 21, 2022

> Do you make extensive use of `unsafe` Rust?

Yes. Glidesort moves data back and forth between partially initialized buffers a lot, so unsafe code is unavoidable. I also need to concatenate slices that are adjacent in memory, which is impossible to do with safe slices (due to pointer provenance). Mixing pointers/slices in general is just a bad idea, so for the most part I simply use pointers.

I use no other special (unstable) features, just pointers and ptr::copy(_nonoverlapping).

> If you would ever be interested in a collaboration proving your algorithm correct let me know :)

I might take you up on that, or at least ask you to take a look once I'm done polishing the code.

xavxav · on April 21, 2022

> Yes. Glidesort moves data back and forth between partially initialized buffers a lot, so unsafe code is unavoidable. I also need to concatenate slices that are adjacent in memory, which is impossible to do with safe slices (due to pointer provenance). Mixing pointers/slices in general is just a bad idea, so for the most part I simply use pointers.

Ah, that's frustrating, my tool Creusot is meant for the verification of safe code, though I also worked on unsafe code verification as part of the RustHornBelt project (upcoming PLDI paper).

I'd still love to chat and see how far we can push verification, you can reach me via email at xldenis @ lri . fr

gavinray · on April 21, 2022

How is it that programmers are still discovering state-of-the-art algorithms?

I just assumed Einstein-tier people had discovered all the optimal ways to do things by now.

Do you have a process for coming up with these improved algorithms, or does it just start from a passing thought in your mind?

wongarsu · on April 21, 2022

The field of computer science is at best 180 years old, and the first compiler is a mere 70 years old. Based on the history of every other field of science it seems safe to assume that the majority of notable algorithms are still undiscovered.

And that's ignoring that "optimal" is a shifting target. What's optimal on a 1960s computer where RAM is as fast as registers may not be optimal on a modern CPU with instruction-level parallelism and RAM that's orders of magnitude slower than registers.

throwaway894345 · on April 21, 2022

I'm of the impression that computer science isn't considered a "science" field despite the name. Anyway, stuff like this is largely just math; other science fields often move slower because you're at the mercy of natural experiments and expensive instruments (e.g., waiting for some burst of gravitational waves to arrive from some black hole collision that happened billions of years ago).

orlp · on April 21, 2022

Computer Science is neither a science, nor is it about computers. It is an awful name. Computational mathematics would be significantly better.

The worst part is that most people pursuing a computer science degree actually want to learn about software engineering, so most curricula nowadays disappoint both the people that want to become software engineers ("why do I have to learn about formal grammar automata") and those that want to become computational mathematicians ("why do I have to do a course on agile?").

wongarsu · on April 21, 2022

Most of our day jobs are closer to engineering, but there is a very theoretical side of computer science, and algorithm design certainly fits there. As you said it has a lot of similarities to math, and there's no consensus on whether math is a science, so I won't comment on the "science-ness" of computer science.

But I can say that mathematics is an ancient "science" which has been studied by great scholars for millennia, but many of the things we take for granted today are relatively recent: modern set theory was introduced 150 years ago by Cantor and Dedekind, meaning things like Russel's Paradox [1], Gödel's Incompleteness Theorem [2] and the Peano Axioms [3] have all been discovered after that, despite being taught in many (university-level) introductory math courses today. Rings are also relatively young [4]. In comparison, imaginary numbers are about 500 years old.

1: https://brilliant.org/wiki/russells-paradox/

2: https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...

3: https://en.wikipedia.org/wiki/Peano_axioms

4: https://en.wikipedia.org/wiki/Ring_(mathematics)

chriswarbo · on April 21, 2022

I think of "computer science" somewhat like "engine science", AKA thermodynamics: they're both intimately tied to practical engineering (hardware/software; steam locomotives) and they're both intimately tied to abstract mathematics (computability, complexity theory; statistics, information).

An easy way to reconcile this is to consider that everything is an engine (~ 'system that transforms energy'); and likewise that everything is a computer (~ a 'system that turns inputs to outputs'). Each field offers a different perspective, on different aspects of, the world :)

foobiekr · on April 21, 2022

To quote my CS professor, “You will notice that most of the algorithms we will cover, and most of those on the white book, are so short and simple that they fit on a single page and usually less. The real world code is longer, but there is a world out there of algorithms which probably aren’t this short and simple but we don’t know them yet.” (1995 or thereabouts)

If anything I think he missed a key component which is that algorithmic improvement is actually only interesting in a few very specific places.

I think pdqsort actually shows this to be true - it uses added complexity to improve on things that really do fit in a few lines of pseudocode.

chriswarbo · on April 21, 2022

There is definitely a trade-off between brevity and efficiency/practicality. A classic example is "The Fastest and Shortest Algorithm for All Well-Defined Problems":

> An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low-order additive terms

Spoiler alert, the "algorithm M" is a brute-force search; the "low-order additive terms" in its time complexity come from brute-force searching for provably faster programs; if one is found, we restart with that. All the time spent searching for proofs is independent of the input size (e.g. it depends on "sorting", but not which particular list we've been given); hence it's technically just a constant 'warmup time'!

http://hutter1.net/ai/pfastprg.htm

p1mrx · on April 21, 2022

Efficient O(n log n) sorting was "solved" by John von Neumann in the 1940s; everything since then is more a matter of engineering than science:

https://en.wikipedia.org/wiki/Merge_sort

nwallin · on April 21, 2022

> How is it that programmers are still discovering state-of-the-art algorithms? I just assumed Einstein-tier people had discovered all the optimal ways to do things by now.

Because it's a moving target. There isn't a fixed 'optimal' sorting algorithm.

We've had worst case O(n log n) sorting algorithms for decades, since at least the '50s, possibly the '40s. pdqsort isn't more 'optimal' in the algorithmic sense than merge sort is, they're both worst/expected O(n log n) algorithms.

If RAM is low-latency and your CPU isn't super scalar, heapsort is going to take less time than quicksort. With high-latency RAM and speculative executing CPUs, quicksort is going to take less time than heapsort. So every few hardware generations we need to sit down and benchmark everything.

pdqsort is based on the principal that the real world is different than the academic world. There are no frictionless spherical cows in a vacuum. In academic research, we assume that the data has been contrived to break your algorithm, or it's randomly distributed. In the real world, data is almost never randomly distributed. A lot of data that gets plugged into your standard library of choice's sorting algorithm is already sorted. Or is sorted in reversed. Or is sorted with one random element plopped onto the end. Or is sorted with one element randomly in the middle that's out of place. pdqsort is designed to account for that.

jandrewrogers · on April 21, 2022

> I just assumed Einstein-tier people had discovered all the optimal ways to do things by now.

This is the answer to your question, almost everyone makes the same assumption you do.

There are a surprising number of problems in computer science that no one works on because everyone assumes that other clever people are. Relatedly, people assume the set of possible solutions has been exhaustively explored because there have been no incremental improvements in decades. Or people assume the scope of optimality of a well-known solution is much broader than it actually is.

The vast majority of people in software never test or revisit these assumptions. If you do test these assumptions, you will be surprised by how often you find low-hanging fruit.

gameswithgo · on April 21, 2022

In part the optimum algorithm changes as hardware changes. In the 1980s for instance the CPU was about as fast as a ram access. Today the CPU is on the order of 200 times quicker than a RAM access, which means algorithms need to be designed to work well with CPU caches to perform well. There is also much MORE memory available, so in some cases it makes much more sense now to consume more memory to get more speed. Computers often have many cores now as well, also changing the equation at times.

account-5 · on April 21, 2022

Hats off to you. I remember doing an algorithms and data structures course a number of years ago and I could barely understand (merge|quick)sort.

aerowalk · on April 21, 2022

Amazing to see a former gg2 player do this :)

memorable · on April 21, 2022

Concidentally I just read an old HN thread that says pdqsort is the best unstable sorting algorithm a few hours ago or so.

buryat · on April 21, 2022

so the behavior is going to change: sorting becomes unstable, i.e. it doesn't preserve the relative order of equal elements. I'm pretty sure this will affect users

ayanamist · on April 21, 2022

If you want a stable sort, you should always use https://pkg.go.dev/sort#SliceStable This PR only modifies Slice method which is declared as "is not guaranteed to be stable"

Asraelite · on April 21, 2022

Go already randomizes the order of keys in map iterations to discourage relying on undefined behavior. It would be funny if they randomized the order of equal elements in what would otherwise be a stable algorithm for the same reason.

skunkworker · on April 21, 2022

Currently sort is not guaranteed to be stable.

https://github.com/golang/go/blob/master/src/sort/sort.go#L4...

dehrmann · on April 21, 2022

Is it stable, but not guaranteed to be stable, or is it unstable? If it just happens to be stable, code somewhere will eventually depend on that behavior.

infogulch · on April 21, 2022

Go has moved in this direction before, when they made map iteration order actually random to match the spec that said that you can't depend on iteration order. IIRC the impl is actually pretty light and only randomizes the starting index, and only generates the random offset once at startup, but it's enough to prevent people from accidentally depending on the order e.g. in their test suite. There might be a bit of disruption, but I expect it will be reasonably well received by most Go users.

There's a debate/lesson in here about when implementations should change to actually take up all the space claimed by the spec.

Cthulhu_ · on April 21, 2022

I kinda like that, because a lot of people will not read the spec but just try something first. If their map iteration is the same as e.g. insertion order, they will shrug and assume it's correct and move on.

tialaramex · on April 21, 2022

Go's decision here also defeats the next lazy thing people do, which is they run it once, and then they copy-paste whatever the results are as "correct" into their test case. With the randomisation their next run is also randomised and this "correct" answer will now be wrong so their tests fail again.

As a result most people will go "Oh, I guess I should actually test the outcome I care about" (good) or "Oh, testing is hard, I'll skip it" (bad, but at least not a problem for the team maintaining the standard library). Very few will say to themselves "I bet I can defeat the randomisation somehow with a different incorrect test design" because humans are lazy.

setr · on April 21, 2022

If you ignore the stated guarantees of the API and depend on implementation details instead, then it’s on you? I might understand if it were unstated, but it explicitly states it’s not.

Otherwise the only way to have a non-stable sort function that currently happens-to-be stable would have to include a randomized shuffle afterwards to ensure no one goes around depending on it…

It would be equivalent to assuming the golang ABI was stable because it survived unmodified for a couple versions, despite it being explicitly not. And then freezing the ABI to maintain a guarantee it explicitly didn’t want to maintain.

Someone · on April 21, 2022

> if you ignore the stated guarantees of the API and depend on implementation details instead, then it’s on you?

In theory, yes. In practice, if a OS or library update uncovers bugs in many programs, that update won’t get favorable reviews.

That’s the reason behind Linux’s “we do not break userspace!” rule (https://linuxreviews.org/WE_DO_NOT_BREAK_USERSPACE), and also why Microsoft Windows has layer upon layer of APIs, a copy of the Windows 95 memory manager (https://mcpmag.com/articles/2010/07/27/trip-down-memory-lane...), why large databases support character sets that probably haven’t been used anywhere since before many HN readers were born (https://docs.oracle.com/goldengate/1212/gg-winux/GWUAD/wu_ch...), etc.

(Apple has a totally different viewpoint on this. That’s one reason it never really got into businesses for a long time)

samus · on April 21, 2022

Linus' rule also has disadvantages as it blocks real innovation in the Linux ecosystem. Granted, Desktop Linux doesn't need another Python 2 -> 3 or X.org -> Wayland deathmarch...

ElectricalUnion · on April 22, 2022

> Linus' rule also has disadvantages as it blocks real innovation in the Linux ecosystem.

Block real innovation like what?

> Granted, Desktop Linux doesn't need another Python 2 -> 3 or X.org -> Wayland deathmarch...

That's all userspace? Userspace doesn't care it breaks. That's why userspace sucks. You need to keep compatibility if you want to have a stable base to build something. That's why we have (sometimes silly) standards like SUS or POSIX.

"we do not break userspace" at least garantees that you need to fix only the userspace, and not the whole thing all the time.

samus · on April 22, 2022

Breakages of userspace are almost always unintended consequences. I don't have examples to the contrary, but I believe that intentional breakages and redesigns are almost unthinkable because of Linus's rule.

Significant changes in the kernel already often need years to land because of the multitude of configurations and interfaces that have to be supported. And even longer to deprecate and remove. Not necessarily a bad thing, just the price of success.

Large-scale breakages like the Wayland transition affect the whole Desktop Linux ecosystem, and are just an example of what mayhem fundamental changes can cause.

remus · on April 21, 2022

While I totally agree with the sentiment it is nice when language developers go out of their way to not break stuff, even when they don't have to. For example the golang team have made efforts not to break certain usages of the unsafe package in the past, despite the package offering no guarantees about stability.

On a similar note, Im sure there was a suggestion to artificially make the previous golang sort function unstable so that users wouldn't come to rely on stable behaviour where it wasn't guaranteed. I quite like the idea, start the fight against hyrum's law!

edflsafoiewq · on April 21, 2022

Happened in JS land. IE was stable and sites depended on it. FF switched to match IE. Chrome switched to match IE and FF. And then finally they put it into the spec.

ElectricalUnion · on April 22, 2022

> IE was stable

[citation needed]

This is the wrong wording, IE had a decently enough installed/captive user base. It was never particularly stable.

dspillett · on April 21, 2022

The former. The spec just says the result will be sorted, it just happens that the current implementations use a method that results in stable positioning of items with equal sort value.

There aren't a lot of places where stability matters, where it does the programmer should hopefully know and not have relied on an undefined behaviour… The most likely impact, I think, will be in UIs where the difference won't really matter but could be quite noticeable.

Do note though that "unstable" can mean two different things. A single threaded sort will produce the same result for the same input, and many parallels sorts have this property too due to the way they coordinate the threads and collate their results, so the output order of equal items is arbitrary but not random. Some parallel sorts do order items with the same sort key more randomly though due to being sensitive to timing differences caused by uncontrolled factors (the OS's scheduler and other things it is dealing with, differing IO delays if the sort is not in-memory, different numbers of threads if comparing run results with those done on other processor architectures, …). IIRC pdqsort is the former if those two types of unstable.

tgv · on April 21, 2022

Good question. I checked the 1.18.1 code, and Sort() and Stable() call different functions. The former calls quick sort, the latter something else (groups of 20 items with insertion sort, then symmerge?). Quick sort's traditional implementation isn't stable, so I'd be surprised if Sort() would actually be.

petesergeant · on April 21, 2022

Perl's move to actively randomize hash/dict ordering, where previously it'd been sort of stable but not guaranteed, affected a lot of codebases that had incorrectly relied on that stability.

I'm just noting that users are necessarily flawed, and moving from "stable but not guaranteed to be" to "actually unstable" will be a breaking change for some real world codebases.

masklinn · on April 21, 2022

This exact issue is why go actively randomises dict iteration too.

Seems really odd that the same team would have used a stable sort and just documented as “not guaranteed to be stable”.

Someone · on April 21, 2022

That’s not the reason. Library developers aren’t changing code to annoy or educate their users.

The reason is that a hash table with a fixed hash function may be used for denial of service attacks. If an attacker knows how a hash map maps keys and can control the keys being inserted, they can insert thousands of keys that all map to the same hash, effectively turning the hash map into a list or vector that has to be searched linearly (on both lookups and insertions)

And that’s a real problem. https://lwn.net/Articles/474912/ says:

“That's where web frameworks come into play. Those frameworks helpfully collect up all of the POST form data that gets sent to a particular web page into a dictionary that gets handed off to the application for processing. The amount of data that the frameworks will allow is typically rather large (e.g. 1MB), so an attacker can create tens of thousands of form entries in a single HTTP request. Because they all collide, those entries can take tens of seconds to a few minutes to process on an i7 core according to the advisory [PDF] released by Wälde and Klink.”

If you manage to insert 10,000 keys with identical hashes, that’s about 10,000²/2 key comparisons.

masklinn · on April 21, 2022

> That’s not the reason. Library developers aren’t changing code to annoy or educate their users.

They absolutely are.

> The reason is that a hash table with a fixed hash function may be used for denial of service attacks. If an attacker knows how a hash map maps keys and can control the keys being inserted, they can insert thousands of keys that all map to the same hash, effectively turning the hash map into a list or vector that has to be searched linearly (on both lookups and insertions)

That has nothing to do with what I'm talking about.

Independent from hash randomisation, Go will randomise the iteration of hashmaps, such that iterating the same hashmap twice, within the same process, without having modified it in any way, will yield different results. Currently this is done by offsetting the starting point of the iteration by a random amount.

For some reason they dropped this explicit mention when https://go.dev/blog/go-maps-in-action was moved over to https://go.dev/blog/maps, but it used to be there:

> Since the release of Go 1.0, the runtime has randomized map iteration order. Programmers had begun to rely on the stable iteration order of early versions of Go, which varied between implementations, leading to portability bugs.

You can trivially observe this behaviour by just iterating the same map multiple times. With mere hash randomisation, even if hash randomisation was done on a per-map basis by iterating the same map you'd observe the same order, but in Go you don't: https://go.dev/play/p/3HZBv2WHKcT

As you can see the items are always in the same relative order, but the starting point of the iteration varies.

You can also see this in the initialisation of the map iterator, it's even labelled with a nice little comment: https://github.com/golang/go/blob/master/src/runtime/map.go#...

averell · on May 4, 2022

This discussion is already dead, but for posterity: The real motivation was actually a performance improvement, as can be seen at https://github.com/golang/go/commit/55c458e05f35d0d5d539107d...

Someone · on April 21, 2022

Thanks! Didn’t know that. That’s an interesting design choice.

sanxiyn · on April 21, 2022

Go's sort was already unstable.

buryat · on April 21, 2022

oh, that's pretty cool

tptacek · on April 21, 2022

Those users deserve what they get; not only does Sort() explicitly not guarantee a stable sort, but the library provides --- directly next to Sort() in the package --- a Stable().

benibela · on April 21, 2022

I wanted a stable sort in one of my projects, so I implemented merge sort

Then I had a larger input (like 100k elements) and wondered, why does it take a few minutes to print the output, when it calculated the output in a second? Well, it was the merge sort being called after the calculation completed to print it sorted. A double fail, because the calculation already produced sorted output in this case.

Now I sort a list of indices with quick sort

comonoid · on April 21, 2022

https://polkadot.network/blog/a-polkadot-postmortem-24-05-20... comes to mind.

And Hyrum's law in general: https://www.hyrumslaw.com/

turing_complete · on April 21, 2022

Nothing changes. If you read the diff, you will see that they used Quicksort before. And vanilla quicksort is unstable as well.

MaxBarraclough · on April 21, 2022

See also this 2017 thread on pdqsort: https://news.ycombinator.com/item?id=14661659

Direct link to the GitHub home of pdqsort (in C++), which has some performance comparisons: https://github.com/orlp/pdqsort

onionisafruit · on April 21, 2022

Good for them. It’s good to improve on the standard library sort, but if you’re looking for performance you probably aren’t just blindly using the standard library sort anyway. You try a few things until you find the one that is best for your situation.

xavdid · on April 21, 2022

> if you're looking for performance you probably aren't just blindly using the standard library sort anyway

That's so interesting- I think I have the opposite response. I figure unless I'm doing a very specific case that doesn't fit well with the stdlib sort (for some specific reason), It's probably good enough. Those language folks usually know their stuff, who am I to try something else?

onionisafruit · on April 21, 2022

I agree that it is relatively rare for sort to be your bottleneck. What I meant is that in that relatively rare circumstance most people are going to look outside the standard library and try a few different sorts.

My comment also had a more dismissive tone than I intended.

nicoburns · on April 21, 2022

Are you looking at that from a perspective of implementing your own sort if you’re not using the stdlib one? From my perspective I can likely install and try out a drop-in replacement library in about 60 seconds. So if sorting is a bottleneck then why wouldn’t I?

jchw · on April 21, 2022

This is true, but it’s definitely still a very nice win. For things where either performance isn’t very critical or there just isn’t time to micro-optimize, it’s nice when the standard library provides best-in-class general solutions for problems. The philosophy is sort of the same as that of a rising tide lifts all boats. The difference it makes could still be a quality of life improvement without the element of necessity.

Overall, I think the Go standard library does pretty good here, and yet it could still do better, and it’s definitely nice to see it improve.

alanbernstein · on April 22, 2022

This is my first time reading about pdqsort. I haven't read through the paper, but it's a neat concept, sort of a hybrid that can reduce to a radix sort-like behavior. The cost is instability and slightly worse behavior for, I suppose, higher-entropy inputs.

It's interesting to think that sorting has a maximum "total complexity", distributed across the whole input space. And this hybrid algorithm kind of shapes that distribution, squeezing efficiency out of one part of the space, and into another, more practically meaningful part.

dejongh · on April 21, 2022

Wow. I did not expect such a simple improvement to have such a big effect.

Maksadbek · on April 21, 2022

IMO, These changes does not look that simple.

lifefeed · on April 21, 2022

This is a small thing, but I love that it uses Heapsort as its fallback. Heapsort is such a pretty algorithm and I always like seeing it used in the wild.

verst · on April 21, 2022

This solution was developed by a SWE 2 candidate in the FAANG on-site interview loop. /s

(Just joking but some interviews indeed feel that way)

orthoxerox · on April 21, 2022

And the candidate was ultimately not hired by the said FAANG.

Ensorceled · on April 21, 2022

"This is clever, but I don't think it will work in practice and I don't believe it is O(n * log(n)) and will assert it is O(n*n) until I die."

fer · on April 21, 2022

"It outperforms any of the solutions I've seen but the code isn't as readable, next!"

bombcar · on April 21, 2022

“Sorry Dr. Knuth, we’re looking for someone with more algorithm experience.”

ignoramous · on April 21, 2022

I get the sarcasm [0], but: Having watched a few of Dr. Knuth's lectures [1], I'm convinced he'd wipe the floor with FAANG interviewers.

[0] I mean, this has happened for real: Bram Cohen (BitTorrent co-inventor) / brew's (package manager for macOS) co-developer couldn't get past Google interviews: https://news.ycombinator.com/item?id=9695102

[1] https://hn.algolia.com/?query=knuth%20christmas

technofiend · on April 21, 2022

Programming MIX? Do you have any relevant experience, Dr. Knuth?

Ensorceled · on April 21, 2022

"Yes, yes. I meant PRACTICAL algorithm experience."

B-Con · on April 21, 2022

You joke, but real world problems have been solved in analogous scenarios, such as homework assignments where the author simply didn't know the problem was yet unsolved and banged their head until they got it.

mr_mitm · on April 21, 2022

> In 1939, a misunderstanding brought about surprising results. Near the beginning of a class, Professor Neyman wrote two problems on the blackboard. Dantzig arrived late and assumed that they were a homework assignment. According to Dantzig, they "seemed to be a little harder than usual", but a few days later he handed in completed solutions for both problems, still believing that they were an assignment that was overdue.[4][6] Six weeks later, an excited Neyman eagerly told him that the "homework" problems he had solved were two of the most famous unsolved problems in statistics.

https://en.wikipedia.org/wiki/George_Dantzig

Does anybody have more examples?

mananaysiempre · on April 21, 2022

Not exactly this as far as official sources say, but the Karatsuba algorithm (long multplication in n^1.6) was the result of Kolmogorov (already a foundational figure, but only beginning his work in what would come to be called algorithmic complexity) starting a seminar with the goal of showing that long multiplication can’t be done in less than n^2 time (because nobody was able to do it faster in thousands of years) and a student (Karatsuba) coming in after a couple of meetings to prove him wrong (the catch being that the constant factors make the result not worth it for values of n people historically worked with by hand).

(The algorithm boils down to the observation that (Xb + Y)(Zb + W) = XZ b^2 + [(X + Y)(Z + W) - XZ - YW] b + YW, for a total of three unique base-b products, plus what is nowadays a routine exercise to see that calculating these products recursively in base b/2 yields complexity n ^ log_2 3.)

e-master · on April 21, 2022

Reminds me how once I had a math test in college about simplifying trigonometric expressions (cos(a + b) kind of formulas) and I made a cheat sheet with all the formulas because I couldn’t remember them all. Obviously I forgot one formula and as fate would have it it was on my test. Took me about 15 min, but eventually I derived it and put it all on the test paper. My gosh the math professor was proud of me, probably one of the top 10 moments of my life when he kept boasting about it when he announced the test results. Many years later I kept thinking I should tell him the full story, but alas he died not too long ago. RIP ‘Stakan Stakanovich’

Rooki · on April 21, 2022

Not homework, but this comes to mind: https://www.wired.com/story/how-an-anonymous-4chan-post-help...

morelisp · on April 21, 2022

Huffman coding was invented when Huffman's professor, Robert Fano, gave a class the option of either taking a regular final exam or solving the optimal encoding problem which (unknown to the students) he and Shannon had been failing to solve for some time.

dekhn · on April 21, 2022

In a class of his that I took he told us that even crumpled up the paper with the solution threw it away then went back and retrieved it and realized it was right

gh02t · on April 21, 2022

Not quite the same scenario, but Dantzig had the roles reversed on him by John von Neumann. From Wikipedia

> When George Dantzig brought von Neumann an unsolved problem in linear programming "as I would to an ordinary mortal", on which there had been no published literature, he was astonished when von Neumann said "Oh, that!", before offhandedly giving a lecture of over an hour, explaining how to solve the problem using the hitherto unconceived theory of duality.

carapace · on April 21, 2022

Not quite the same thing...

> George Pólya, whose lectures at ETH Zürich von Neumann attended as a student, said "Johnny was the only student I was ever afraid of. If in the course of a lecture I stated an unsolved problem, the chances were he'd come to me at the end of the lecture with the complete solution scribbled on a slip of paper."

> https://en.wikipedia.org/wiki/John_von_Neumann#Cognitive_abi...

isaacimagine · on April 21, 2022

Does anyone know what the two unproven theorems he proved were?

reshlo · on April 21, 2022

They appear to be published in these two articles, as described and cited by Snopes[1]:

Dantzig, George B. “On the Non-Existence of Tests of ‘Student’s’ Hypothesis Having Power Functions Independent of Sigma.” Annals of Mathematical Statistics. No. 11; 1940 (pp. 186-192).

Dantzig, George B. and Abraham Wald. “On the Fundamental Lemma of Neyman and Pearson.” Annals of Mathematical Statistics. No. 22; 1951 (pp. 87-93).

[1] https://www.snopes.com/fact-check/the-unsolvable-math-proble...

randunel · on April 21, 2022

I was wondering about the morality of using candidates' solutions in your products. Does the hiring company own the test results? Does the candidate? What if the candidate gets hired, who owns the solution, then?

The candidate does not get paid to solve the interview test, but it could be seen as a collaborative effort on both the company and the candidate, since both put time and engineering resource in it.

dijit · on April 21, 2022

It is immoral to use candidate answers in your product, assuming you don’t hire them.

If you do hire them then I guess it becomes fine, but otherwise definitely immoral.

The problem is that we’re fleshy meatbags and easily influenced. It’s possible we internalise answers to interview questions and then “spontaneously” come up with them later, without _knowingly_ stealing the idea.

This happens sometimes in comedy, and the comedian apologises. You can usually tell the difference in comedy between plagiarism and accidental theft because the joke is delivered much differently but has the same core. It would be harder with interview answers.

pdpi · on April 21, 2022

If you’re hiring, and your candidate produces a solution to a real problem, and your interviewers actually want to use their solution, that person is getting an offer unless they were absolutely obnoxious and terrible to work with. Failure to do so is just bad recruitment.

Or, put differently: if I’m the candidate and they end up “stealing” my solution, that’s a small price to pay to avoid working in what is likely a terrible environment.

foobiekr · on April 21, 2022

That assumes the interviewer is ethical. In my experience, most people in SW do not have a strong cultural basis for ethical behavior of this sort.

biztos · on April 21, 2022

> If you do hire them then I guess it becomes fine

Not that anything will stop companies from doing this, and not that I think it ever really happens, but for it to be morally OK I think you’d have to both get their permission and compensate them fairly.

Before I work for you, my ideas are mine even if I write them on your whiteboard.

dijit · on April 21, 2022

Sure, but if I ask you an interview question and your answer is good enough that I want to use it: you'll probably be the one implementing the solution.

Why would I trust the answer more than the person?

pbhjpbhj · on April 21, 2022

>Why would I trust the answer more than the person? //

Why wouldn't you? The truth of statements is not a function of the people making them (though of course that can often be a strong indicator of truth).

jakey_bakey · on April 21, 2022

The real question is, what on earth is a candidate doing in a few hours of interview processes that's worth stealing?

I've had candidates use useful Python libraries that I didn't know about in tech tests before, but that's pretty much the extent to which I've learned anything helpful when recruiting

adwww · on April 21, 2022

About coding maybe, but I find interviewing really informative to learn about what other companies are doing and how they have architected stuff.

chillfox · on April 21, 2022

It's not a work for hire and there is no copyright assignment so copyright will remain with the candidate.

However some whiteboard scribbles are not the same as working code, so it is likely easy to make an implementation that does not infringe in most cases. Not that it really matters because it's highly unlikely that anyone is going to sue over it.

mytailorisrich · on April 21, 2022

There is no copyright on algorithms.

To ask interviewees a question about an actual issue you have and see what solutions they come up with is clever. You get more input for free and you may find someone you want to hire.

I don't see a problem as long as you are really looking to hire someone and not advertising fake job openings.

powersnail · on April 21, 2022

Is algorithm copyright-able?

If it can be regarded as an intellectual property, does that mean that the company would be infringing on the candidate’s right?

iampivot · on April 21, 2022

XOR certainly was patented for use to display mouse cursors.

Shadonototra · on April 21, 2022

Go is a great language, reading the code in this PR is so easy

kevincox · on April 21, 2022

On the other hand without generics they are literally using a text templating language to generate the code.

morelisp · on April 21, 2022

Generics alone wouldn't be sufficient to replace the code generator, you'd also need specialization and/or overloading, or at least some significantly different syntax allowed for comparing primitive types.

kevincox · on April 21, 2022

I think it probably would in this case. They appear to be substituting just a comparison function. They could instead take a generic type argument that had the desired functions defined. Then you would just need to write the top level wrappers with the different names.

In fact they could probably have done this with a traditional Go interface if they weren't worried about missed optimizations.

yencabulator · on April 22, 2022

I fear the generic version would miss on those same optimizations, and that is the real reason it's written like it is. Remember, Go generics use gcshape stenciling, not C++ style templates.

icholy · on April 21, 2022

I believe OP is being sarcastic.

hcbasedhacker · on April 21, 2022

[flagged]

jrockway · on April 21, 2022

You created a brand new HN account to shitpost about Rob Pike? I'm not even mad, that's amazing.

Koffiepoeder · on April 21, 2022

To be honest, this is not a shitpost in my opinion. He might really have met Pike and Pike really might have said that some day.

nindalf · on April 21, 2022

If you loved that you’ll love my story about meeting Rob Pike in an LA grocery store.

hajhatten · on April 21, 2022

Go on...

shxdow · on April 21, 2022

I don't know, the name "hc based hacker" doesn't help

mhh__ · on April 21, 2022

Does this bar serve anything other than everclear?

wiz21c · on April 21, 2022

The idea is not bad but it sounds a bit extreme... Indeed choosing the right sorting algorithm is important and depends on one's problem to solve. But, 99.99% of the time, a general, all-round algorithm will do the job within 99% of the best-one's performance...

But when it fails, well, it will fail big... In my experience, some devs (me included :-)) make tests with a reduced dataset 'cos they can iterate faster and then go in production without testing on a real-size dataset. All of a sudden, the o(n²) edge case becomes, well, a problem :-)

lifefeed · on April 21, 2022

Reminds me of the time some dev[0] accidentally recreated the traveling salesman problem, created unit test with 5 items in it, said "looks good", and pushed it.

[0] - me

patall · on April 21, 2022

If you ignore linear/non-comparison algorithms like radix or bucketsort. Those are much harder to optimize to your particular problem but should result in significantly lower run times when properly adapted.

robertlagrant · on April 21, 2022

Sadly I can't disagree for myself either! Doing diligent performance testing is hard to prioritise.