Hacker News new | past | comments | ask | show | jobs | submit login
Sorting Algorithm Cheat Sheet (interviewcake.com)
229 points by gameguy43 on April 26, 2019 | hide | past | favorite | 49 comments



> Radix sort looks fast, with its O(n)O(n) worst-case time complexity. But, if you're using it to sort binary numbers, then there's a hidden constant factor that's usually 32 or 64 (depending on how many bits your numbers are). That's often way bigger than O(\lg(n))O(lg(n)), meaning radix sort tends to be slow in practice.

The constant factor in radix sort isn't the number of bits, it's the number of digits. Since one typically sorts on a computer using 8-bit 'digits', that's a constant factor of 4 or 8 (not 32 or 64).


It's the number of bits. You can change to bytes, but you are just hiding another constant factor of 8 by doing that.


If you have a theoretical computer with single-bit registers, sure. Quicksort is also quite slow on such an computer.


And this is why O() notation drops constants. 0(bits) or O(bits/8) aka bytes are the same thing. It's also worth pointing out that in standard comparison sorts, the comparison itself is technically linear to radix too, but is treated as constant. I get why it's dropped but it's worth knowing.


Exactly. You can try to compare two bytes, but really you are comparing 8 bits, if you thought about it algorithmically. Maybe those steps are 100% parallel. But it's irrelevant to the big O. You are measuring number of steps, and it's intended to be hardware independent.


Big O is hardware independent.


Of course in the real world for problems where sorting is actually the bottleneck and not just something you want not to kill your app's performance, you end up with things like timsort that destroy all these in practice.


I'd say it's the other way around: you use whatever your language provides as default (which is probably something like Timsort) until it becomes a bottleneck and then you analyse the data and write something that is specific to your use case that will blow Timsort out of the water.

Timsort is your first choice.


In practice this boils down to, "you use whatever your language provides".

If sorting became a bottleneck, I'd literally look at every possible approach to speed it up that I could find before considering improving the sorting algorithm. Starting with trying to figure out how to throw around less data while trying to solve the problem.


As someone who has spend a great deal of time on sorting algorithms when I was in academia ( even published one) I completely agree with this sentiment. Sorting algorithms are tricky to get right and there is a lot of edge cases, so even when you (think) you know what you are doing you still get it wrong. Been there many times and it always causes some frustration.

Do you really want your business critical sorting rely on something that only might work?

Sorting algorithms are like crypto, don't roll your own if you can avoid it.


For a problem i faced I ended up being able to produce very small chunks of already sorted data and using the merge part of the stable sort algorithm (it was provided).

Not only that, I ended up being able to do it lazily making the user experience much better.


And I have been forced to write sorting algorithms from scratch as well. But it was truly a last resort. In my case I had to work with data that literally did not fit in uncompressed form on the computer that I had to work with. So I had to write a disk sort where data as kept in compressed form. (I used merge sort for this.)

However the fact that sometimes we are forced to our last resort doesn't change the fact that we should be aware that it really is a last resort.


Is there any evidence of timsort being superior in practice? I've been looking for benchmarks but on synthetic ones with randomized data, both mergesort and quicksort handily beats timsort. The belief about timsort's superiority seem to me to be more about Tim Peters being a very well-respected developer than performance data.


On completely random data, Timsort is just a mergesort with some extra bookkeeping. Its advantage is on arrays that have sorted subarrays of non-trivial size, which is often the case with real data.


Also no mention of introsort (C++ std::sort), where it uses insertion sort for small arrays, detects if it's fallen too far towards O(n^2) of quicksort, and punts to heapsort when it needs to.


You can pry my stable sort algorithms out of my cold, dead hands.


Heh, I think stable sorts are great. However, you don't always need one, and you do pay a little bit for it.


Add the factors to the sort key? Every sort is stable if you have some vague ideas about what to sort on.


What?


I think it's important to mention that counting sort and radix sort are not "true" comparison-based sorts: they have additional requirements on the input data.


Indeed. Usually if you get question such as "you have an array of 1 through n", the immediate thought should be "why is it just up to n and not arbitrary numbers"? From there, you know you can get a linear sort by leveraging the mutual relationships between these numbers.


Who is still asking candidates to talk about sorting algorithms?

It is just trivia and has little to no bearing on if the candidate can actually do the job (unless the job relates to the runtime of sorting algorithms, of course).


There are two situations where I wouldn't harass my coworkers for asking sort related questions.

One, I've asked. Not implementing sort, but implementing complex compare() functions for sortable values. There are many things I can accept as flaws in a coworker but inability to deal with 3 conditionals at once isn't one of them. And I've had people (interviewed or worse, worked with) who can't sort by last name, first name, age.

The other is the "I gave you code now tell me what's wrong with it" answer but that's not writing a sort algorithm, that's detecting that someone needs a better one.

Personally, I'd be all too happy if we stop interviewing people expecting answers that would never pass a code review. I'd say it's hypocritical but it's not even that. It's just wrong headed, and gives bad information to the candidate.


I see your point, and usually these work best if the interviewer can prove that such a thing is needed. For instance, many jobs might require sorting of data that cannot fit on disk (think for instance of loading a massive file that you want to parse, and sorting might help you in processing it), in which case knowing merge sort (the disk variety) can help. So yes, the onus is on the interviewer I think.


I'm less worried about understanding of sorting algorithms than I am about understanding of patterns. With any team at scale my hardest challenge is about patterns. More importantly why a pattern is used. Efficiency of a sorting algorithm can be refactored. Is the implementation of the sorting composable? I'll hire without question someone who understands patterns and composablilty over someone who can site an ideal solution. Seriously at what cost does readability over complexity really mean anything to consumer value. I understand that in very limited scope computational efficiency is imperative but let's be pragmatic and say good programming and understanding of best practices trump perfect solutions.


I think the intention is that the job will relate to the running time of other algorithms and that sorting things is a toy problem you can use to test whether they know anything about that subject in general.

There may be better ways to assess that, though. For example, you could ask them to give one or two examples each of algorithms that are O(1), O(log N), O(N), O(N log N), and O(N^2), and O(something worse than N^2).


Behold, the worstsort:

Generate all permutations of the data. Stop when one of them is in order.

O(n*n!).


Sorting is such a foundational problem in computer science that pretty much every job will end up relying on the runtime of sorting algorithms. Not every job requires being able to write a bulletproof dual-partition QuickSort, of course, but knowing complexity bounds is important, even if you’re just calling your standard library’s implementation: it places fundamental bounds on what things you can improve.


Dev with decades of experience here - I've worked with a variety of languages and platforms, and across different domains. I'm not sure I've ever had to implement a sorting algorithm, or even had sorting as an optimisation issue (and I love micro-optimisation, a bit too much TBH!). I've had to sort things, of course, but standard (or at least "common") libraries invariably have sorting functions built-in.


Pretty much every job? I'd say about 10%. With all the web and web-adjacent jobs around, maybe even less. Most stdlibs are already written in a sensible way, so calling "sort" usually means calling quicksort. Most people just do not care, they have tickets and bosses to worry about.


Absolutely. While every job will depend on sort, so many of them will have negligible benefit by changing the least efficient algorithm to the most.

And most of them just use a native language sort that does something relatively smart out of the box.

It's good to write all of these at don't point so you understand why things are inefficient.


Disagree. If you think that, what do you ask? "How would you Google or Stack overflow for the answer?" That's hardly going to provide any useful signal. At the end of the day you need to ask something that shows evidence there candidate knows something about efficiency tradeoffs.


Sorting algorithm complexity and implementation is rote-learnt at this point. Hardly tells you much.

Traditional technical interviews don't seem to help you find the best candidates (Google famously studied their interview process and decided it was no better than chance).

So maybe don't do a technical interview. Or if you do, just take a couple of real problems from work.

If you're making a lot of hires, do some research and run a RCT.


I get the feeling that the people who don't like algorithms interviews are also the people who find them difficult to pass.

Go ahead and memorize, it'll be obvious if you can't explain how you got that answer.


Some people like that sort of thing. Mostly the "academic" types. The memorizers. There are people who got through their higher education by using memory instead of wit. I know some of those people, I went to school with some of them, they were always studying and ramming as much "data" into their brains as possible... while I was getting through with minimum of effort by just coding a lot and trying to think about things in my own way.

The memorizers in this industry often get jobs higher up the tree and therefore they are highly represented in the hiring practices.

Anyways, I completely agree with you. In my opinion it is essential to KNOW that you DON'T KNOW (i.e. you don't remember how exactly an RB tree works, you just know there is such a thing). Everything else is just a quick Google session away.


This has to be the most misguided attitude that you can have. I don't think this will serve you well in your career.


Original author here, happy to answer questions about data structures, algorithms, and coding interviews!


It might be notable that on very small lists (up to around 50, maybe 100 depending on cpu/cache) insertion sort is fastest of all, even on difficult input distributions. Its so simple that cpu can skip through it rapidly. I noticed this myself and have also read other developers mention it in sort design discussions. A popular general purpose sort called 'Timsort' defaults to it for small sequences. Its possible to tweak the insert algorithm to sum distance of elements moved so far and escape to a more substantial algorithm when it gets excessive. If little rearrangement is required insertion sort is about as fast as scanning through memory can be.

I think in practice the answer to the question "which should I use" (rather than "which should ideally be used") is the best developed hybrid sort library for the platform. Beyond the suitable case for insertion sort, its a really substantial job for an individual to develop, test and optimize a high performance sort library that can handle a range of distribution types.


One of the big factors behind this is that the cost of pipeline stalls in a theoretically good algorithm can significantly exceed the cost of extra comparisons.

Whether that is true depends on a combination of how much data you have, what kind of data you have, and your programming language's low-level representation.

For ints in C++ and a list of 20 elements, it will be reliably true. For strings in Perl with a list of 500 elements, it reliably won't be true.


LSB radix sort can be slow, but MSB radix sort is the fastest sorting algorithm for an array of machine words.


"No CS degree necessary." lured me in. I'm subscribed and looking forward to learning more.


Curious: did people notice that you can click on each algorithm and click the blue button to get a detailed write up of how it works? (Or is that too hard to find?)


The links to decent writeups with example code ( excellent to have) are a nice surprise when a row is expanded.

On my firefox browser the paragraphs and elements in the table and writeups are somewhat excessively spaced vertically.


There is also sorting networks. If you have to sort short list of items, it's probably the best solution around.

https://en.wikipedia.org/wiki/Sorting_network


The best real sorting algorithms are often hybrid.

Would be nice to see mention of parallel sorting algorithms. These have basically linear speedup and you can do them on GPUs.

Obvious disclaimer that you should never care about your sorting algorithm choice until you need to. Performance is more often IO or memory related than compute. Tech interviews are daft, etc.


Best case for heapsort is actually O(n). Build heap always takes O(n). Then e.g. when all keys are the same, the max-heapify call will take O(1) instead of O(logn).


Hmmmm good point!


feature request: make O(k) bright red, like O(n^2). As it stands, it looks nice and pleasant like O(1) but it's honkin' terrifying.


Oh yeah, looks like we messed that up. That space complexity also doesn't even agree with the final space complexity we describe in the write up.

I'm gonna actually change the space complexity in the table for both counting sort and radix sort to O(n)--that at least matches the amount of hand-waviness we used for the time costs for those two in the table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: