Hacker News new | past | comments | ask | show | jobs | submit login
There Are Only Four Billion Floats, So Test Them All (2014) (randomascii.wordpress.com)
329 points by tosh on Sept 14, 2017 | hide | past | favorite | 50 comments



On the topic of "test them all", I wrote this in my paper about error bounds and worst cases for complex multiplication (http://www.daemonology.net/papers/complexmultiply.pdf):

[For finding worst-case inputs] the approach taken was to perform an exhaustive search, taking several hours on the second author’s laptop, of IEEE single-precision inputs, using only a few arguments from Theorem 1 to prune the search.

Even though (2^32)^4 was far too large for it to be tractable to test all 4-tuples of single-precision inputs, a "pruned" search was able to easily cover and/or exclude the entire search space.


Nice!

I have noticed that there is a lot of high quality mathematical/numerical software coming out of INRIA; and have often wondered of late why we do not have an equivalent in the US. Do you have any thoughts on why this is the case; and what the closest equivalent in the USA may be?


It's more a matter of what have they done for us lately. We had a number of supercomputing centers in the 90's (at least one has shuttered), and among other things NCSA brought most people their first telnet client and their first web browser via NSF funding that Al Gore pushed for (see also: Al Gore invented the Internet)

After that they were working with INRIA on collaboration software fifteen years before WebEx was a thing. They also had large divisions looking at VR and data visualization.


Oh, and also Apache web server was a fork of NCSA's httpd.



Intel has MKL/TBB/IPP/etc which are now under one umbrella: https://software.intel.com/en-us/performance-libraries


This is largely done by private companies in the US. In the past, ATT Bell Labs and Xerox PARC were shining examples. Today, every major technology company has labs and other internal organizations contributing research to new technologies: Netflix, Google, Microsoft, etc.


When I took a testing course, I was taught that Correlated Active Clause Coverage was good enough for most test cases. This term might not be as widely popular


Did you find anything surprising?


Unfortunately, this doesn't scale well beyond a single 32-bit floating point number.

The good news is that if your code works with doubles or multiple inputs, you might still be able to test it exhaustively using an SMT solver like Z3 which supports a "theory of floating point numbers"[1]. A solver like this symbolically models floating point operations at a binary level, letting it use an efficient pruning search to solve problems about the exact behavior of floating point numbers.

I haven't tried it myself so I don't know quite how well it scales, but it should definitely work better than a loop across all possible inputs!

[1]: https://pdfs.semanticscholar.org/db9f/2cc0bf18c4661f5629b088...


I want to disagree with you about scaling: most formulas that one might reasonably come up with don't depend in any serious way on the number of bits. So suppose you have a formula that you can test exhaustively with 32-bit floats, and you were to implement it on 64-bit floats. If it fails a couple of times on 32-bit floats, giving it a failure rate of a couple per billion, the same failures are likely to show up in testing with 64-bit floats, a couple per billion. A couple per billion is good enough to be triggered by purely random tests. It's only specifically the exhaustive part that doesn't scale, but that part is not really necessary for this to be useful.

Z3's support for floating point theories is really cool, but sometimes the simpler brute-force approach works too!

IOW, you could change the message to "Your formula's failure rate is either exactly zero or larger than a few per billion, so random testing is good enough", and I think it would still be good, insufficiently-applied advice.


That's a good point—even a few hundred random test cases are usually enough to cover "normal" code. I use QuickCheck for that at work, and it's enough to both find bugs I wouldn't see otherwise and to make me confident in my code when it doesn't find any bugs. (As a bonus, it also tries to "reduce" failing cases, which can really help with debugging.)

It's worth tuning your random input generator to produce inputs that commonly cause problems like -0.

I can still think of a few situations where exhaustiveness might be worth the effort:

- verifying code passed in from the outside (security considerations?)

- verifying code generated automatically (ie as part of program synthesis)

- verifying very general code you expect to be used in hard-to-predict contexts (ie library routines or compiler optimizations)


A few hundred random test cases? Double-precision numbers can have 2,048 different exponents. I've seen multiple bugs that only happen with denormals so you need thousands of tests to have any hope of hitting those.

If you're going to do random testing then you need to have a huge set of special numbers (denormals, NaNs, infinity, zero, exact powers of two, one more or less than exact powers of two) and you should do millions or billions of random numbers. Then you can start to feel confident.


>It's worth tuning your random input generator to produce inputs that commonly cause problems like -0.

This would definitely include denormals. As well as +-inf, nan, odd int, even int, negative odd and even int. +-1. Some .5s.


I'm not sure I understand your point --

For instance, if an algorithm failed only when exactly all three inputs were 0, then it would fail only in 1 per (2^32)^3, despite not being exactly 0 at its failure rate. So exhaustive testing only works if you have a single input, since 2^96 is somewhat harder to exhaustively search, and statistical testing may not be enough, since 1/2^96 << 1/2^32.

We can imagine a situation where it fails with each of the values being a number besides 0, making this nontrivial to check for. (For example, if the inputs are normalized somehow before being processed.)


I wanted to make the point that very few formulas fail on really special inputs in practice.


When rewriting sin() for Xbox 360 I found that the old function gave the wrong result for denormals - off by a factor of two. Since double has an eleven bit exponent that means that roughly one number out of 2,048 would hit this special failure case.

For round/ceil/floor a common failure case is the number immediately preceding 0.5. If you're "lucky" then your algorithm fails for lots of numbers ending in 0.5, but if it's just the one number then a random search will never find it. And, an exhaustive search over all of the doubles from 0.0 to 1.0 turns out to cover a really big search space (one quarter of all doubles, roughly).


> "but if it's just the one number then a random search will never find it"

Minor, but I think it's not never just very rare. If you're sampling from all floats then the problematic inputs should be possible.

I think the distinction people aren't seeing is what are you testing. If it's the sin routine, then I think you're making valid points.

But if we're talking about testing for correctness and the overall logic of the program or the abstract expression of the calculation, there are only a limited number of variables. Trying to divide them and vary them under this kind of control seems like a valuable tactic. It might not be sufficient but I've seen way more bugs in shitty if else boolean soups than I have from bits getting flipped.

If the output is so sensitive to the value of a certain calculation there should probably be a secondary or checksum. Unless you're doing games where... Framerates.


The most common ones (for example, when something is 0 or a list is empty) are also typically prioritised in tools like quickcheck.


Thanks for replying.

I'm not sure I agree with that, but I'll meet you halfway with those errors rarely matter in practice. (There are only 2^33 people, so getting 2^96 events (or significant portion) is well, impractical. So failure on a single, random one is unlikely to matter.)


Yeah but what about something like hypot or atan2?


Well, for organizations of a certain size they could map reduce the job across 4 billion servers for 90 seconds (2^32 servers each handling 2^32 operations).


If you're interested in this style of testing and use Python, https://github.com/HypothesisWorks/hypothesis-python is a very powerful library that lets you create parametrized tests which will automatically be run against a set of arguments designed to exercise those interesting edge-cases.


hypothesis is a pretty different style of testing, it is not intended for exhaustive testing.

Which you can easily do in pytest directly using parametrized test cases.


Just don't try testing the nine billion names of God.



Just because something was a valid reference doesn't mean it shouldn't be downvoted. I fail to see how it's relevant to the article.


The article's author's point was the feasibility and reasonableness of testing 4 Billion entities (full 32-bit range). The linked sci-fi story is about testing 9 Billion entities (names of god in a given alphabet).

The relevancy is that a sci-fi author thought about applying this method over 60 years ago, and at the time it seemed like an insurmountable task for humans that could be solved with massive computers. With the unexpected consequence of ending the universe. And now anyone can do this on their desktop.


Often humorous and witty comments are my favorite parts of HN. They are far better than posts on a topic where people seem not to have read the article.


Sure, most people enjoy humor or wit, but most people aren't as funny as they think, and I think even you'd agree the luster tends to wear off once the top thread on every post is a chain of memes, in-jokes, and stupid puns nested too deep to fit comfortably on any screen.

That is the natural entropic end-state of all large social news sites. HN has anti-bodies against that by downvoting a bit more aggressively than might seem warranted at first glance, but all in all I think it's a wise choice.


"Often humorous and witty comments are my favorite parts of HN"

Then you'll find far better on Slashdot where free speech seems to actually happen, as opposed to here where you're not allowed to joke about anything.


That was an excellent read. Kind of a reverse of the short story "Nightfall" by Isaac Asimov.



Does this need a (2014) on the title? Certainly worth seeing if you missed it before though.


Yes! I'm the author of the article but, still, I hate it when old articles are posted without acknowledgment that they are old.

I mean, it's a timeless classic, but the fact that it's three years old is still worth acknowledging.


I remember doing this. In Python I wanted to know if a float -> str -> float would result in the same number... it does.

But it wasn't really documented or guaranteed that it would, so I wrote some bit-bashing code to generate all 232 bit patterns.


After reading about the case the author knew was not right (negative zero), I wonder why the relevant "equality" test wasn't more like

    if(isfinite(testValue.f) && isfinite(refValue.f) 
            ? testValue.i == refValue.i
            : (fpclassify(testValue.f) == fpclassify(refValue.f)
               && signbit(testValue.f) == signbit(refValue.f)))
i.e., if the numbers are both finite [including denormals], they must have the same bit pattern; otherwise, they have to be the same kind (i.e., both infinite, or both nan) and have the same sign (my manpage for signbit says NaN has a sign, but that's news to me actually).

I doubt it's that I know more than Bruce Dawson about FP; is it about working in the distant past (2014, cough) without compiler support for C99 stuff like fpclassify and signbit?


I don't like testing for NaN's using == and != either, but the only values that compare the same while having a different bit patterns are ±0 and all the NaN's, so as long as you get those right, the plain equality is fine. Also, I think your test has a corner case with 0x7ff8000000000000 and 0xfff8000000000000, because NaN's can have sign bits too, while the intention in the post is to treat any NaN as matching any NaN. So the test for equality could be something like "isnan(a) == isnan(b) && (isnan(a) || (a == b && signbit(a) == signbit(b)))" without any bitwise manipulation.


I wouldn't want to write the test that way because it seems very complex. The other way I'd be tempted to write it would be to compare the bit patterns and treat mismatched bit patterns as a failure unless both numbers are NaN.

To some extent it comes down to personal taste and how much knowledge of FP you want embedded in the test.



Something is wrong with the encoding of the code snippets on this site. I spent quite a while staring at while (i &lt;= stop) Until I realized that's html escape codes.

Weird, because I am running standard chrome on windows 7.


Fixed. Wordpress mangled the code when I made an unrelated change.


Also: property-based testing.

- Haskell, Erlang: QuickCheck

- Elixir: StreamData

- Clojure/ClojureScript: clojure.spec

other languages have their own implementations. It may not cover everything, but it will cover more than manually hardcoding values in your tests.


Now do doubles.


4 billion is fine. What happens if your program has a few thousand floats?


Then you can't use exhaustive testing?


Funny he complains so much about incorrect rounding without doing any research: that's called bankers' rounding.

http://wiki.c2.com/?BankersRounding


The functions called "ceil" and "floor" shouldn't do banker's rounding - the very names describe what the behavior is supposed to be.


The author is not complaining about the default rounding rule for floats in IEEE 754 (a very thoughtful standard, principally designed by William Kahan, who has several articles/rants about it on his homepage).

The author is complaining about this particular (bad) implementation of ceil and related functions giving incorrect results (some due to said rounding rule).


2 things here:

- firstly binary ops really need not 4G ops to test them all, but 4G2 ops which is not a 'few seconds' but 4G times a few seconds

- secondly, as a chip designer, by the time you've got silicon it's essentially too late - testing 4G or 4G2 operations in the original verilog/vhdl model is very much not a few seconds or 4G times a few seconds




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: