Calling binary search and mergesort implementations "broken" does the author no ...

karpierz · on Nov 6, 2022

Mergesort and binary search have a contract which defines:

- Which inputs are valid.

- For a valid input, what the constraints on the return value are.

You'd have a point if the implementations had as an input constraint: "array must be less than 2^30". But they didn't.

Otherwise, nothing is broken unless it never returns the right answer. Take:

fn add(x: u32, y:u32) -> u32 { return 1; }

This implementation works perfectly for numbers that add to 1. It just loses generality outside of that.

fn add(x: u32, y: u32) -> u32 { return (x + y) - (x >> 10) - (y >> 10); }

This implementation works for x < 2^10 and y < 2^10. Arguably this implementation is much worse than the previous one because it fails unexpectedly. At least the previous implementation would be much more obviously broken.

But these are both broken because they don't fulfill the (implicit) contract for add. You can't just say "well, it's implied that my add function only takes inputs that add to 1" unless you actually write that somewhere and make it clear.

dataflow · on Nov 6, 2022

I get what you're saying but I don't think they're analogous. If nothing else, strlen is defined only with null-terminated strings; this comes in both the spec itself, as well as the documentation of pretty much every implementation you find. Whereas most binary search implementations don't claim they only work under some particular inputs. (I think there are likely more differences too, but this is sufficient to make my point.)

More generally, I feel like the thought process of "it's not broken if it works fine for inputs that occur 99% of the time" is an artifact of how little attention we pay to correctness, not something that is intrinsically true. If your function breaks for inputs that are clearly within its domain without any kind of warning... it's broken, as much as we might not want to admit it. We're just so used to this happening near edge cases that we don't think about it that way, but it's true.

gnull · on Nov 6, 2022

> most binary search implementations don't claim they only work under some particular inputs

They do implicitly. It's just common sense. When you read a recipe in a cookbook, it usually doesn't mention that you're expected to be standing on your legs, not on your arms. Reader is expected to derive these things themselves.

A lot of generic algorithm implementations will start acting weird if your input size has the order of INT_MAX. Instances this big will take days or weeks or process on commodity CPUs, so if you're doing something like that you would normally use a specialized library that takes these specifics into account.

dataflow · on Nov 6, 2022

>> most binary search implementations don't claim they only work under some particular inputs

> They do implicitly. It's just common sense.

That's neither how language specifications work, nor true in this case even if it's true in other cases. Providing one more of the same kind of input that already works is in no way the same thing as changing something totally unrelated.

> When you read a recipe in a cookbook, it usually doesn't mention that you're expected to be standing on your legs, not on your arms.

I don't think this binary search was breaking because of people standing on their arms either.

> A lot of generic algorithm implementations will start acting weird if your input size has the order of INT_MAX. Instances this big will take days or weeks or process on commodity CPUs,

It's incredibly strange to read this from someone in 2022. I don't know of any standard library algorithm that would take "days or weeks" for inputs of size 2^31 now, let alone the majority of them being like this. In fact I don't think this was the case back when the article was written either.

gnull · on Nov 6, 2022

Ok, I looked at it closer and I admit that quicksort implemented in C won't take days on an input of 2³¹ elements. It will take less than 1-2 hours, I think. Something that is a bit worse than O(n log n) or has a 20× bigger constant hidden in O(·) will take days though.

I don't see my other arguments being convincingly refuted, so they still hold.

dataflow · on Nov 6, 2022

> Ok, I looked at it closer and I admit that quicksort implemented in C won't take days on an input of 2³¹ elements. It will take less than 1-2 hours, I think.

How ancient is your machine? Quicksorting (2^31 - 16) elements (because that's Java's hardcoded limit) takes < 11 seconds on my machine, and a big chunk of that time is taken in the random number generation to create the input...

  # Temp.java
  import java.util.*;
  public class Temp { public static void main(String[] args) { byte[] arr = new byte[0x7FFFFFF0]; new Random().nextBytes(arr); Arrays.sort(arr); } }

  $ javac Temp.java && time -p java Temp
  real 10.30

gnull · on Nov 6, 2022

No, wait. The QuickSort from the article is O(n²), in fact. So that one specifically will take weeks, or even months to run — especially in Java. Feel free to test it and get back to me if you think I'm wrong.

dataflow · on Nov 6, 2022

> No, wait. The QuickSort from the article is O(n²),

What article are you referring to? This article is about binary search and mergesort, not quicksort?

And which quicksort has O(n^2) typical behavior? That's the worst-case behavior you normally only get on adversarial inputs, not the typical one. (And standard libraries have workaround for that anyway.)

gnull · on Nov 7, 2022

Sorry, I indeed switched to a wrong browser tab and skimmed instead of reading when started this discussion. Please disregard the quicksort discussion.

I still think that it's normal for programs to behave weird when the input parameters are getting close to INT_MAX. Sometimes it's unavoidable. And if it's not specified in the function docs, you should go and check the implementation as a programmer. For binary search it is avoidable, so the criticism of the linked implementation is fair.

queuebert · on Nov 6, 2022

> Calling binary search and mergesort implementations "broken" does the author no service with his argument.

Very much on brand for the FAANG r/iamverysmart crowd though.

_dain_ · on Nov 6, 2022

What on Earth are you talking about? There's nothing "iamverysmart" about the blogpost at all. The guy literally cites an example where the code broke in production, it isn't an esoteric hairsplitting point at all.

Jensson · on Nov 6, 2022

They tried to implement a standard algorithm themselves and failed. Doesn't mean that almost all binary searches are wrong. C++ standard library had a correct implemented binary search with more flexible signature when this article was written, they could just have used that one instead.

Blackthorn · on Nov 6, 2022

This blog post predates r/iamverysmart. There was a way of talking and discourse in 2006 that this is very much as example of. One has to take things from the time they were written.

moralestapia · on Nov 6, 2022

r/iamverysmart was created in response to a particular human behavior, not the other way around.

_dain_ · on Nov 6, 2022

it literally threw an exception in production on a valid input so yes it is fair to say it was broken