Hacker News new | past | comments | ask | show | jobs | submit login
There Are Only Four Billion Floats–So Test Them All (2014) (randomascii.wordpress.com)
252 points by albrewer on Feb 9, 2023 | hide | past | favorite | 65 comments



Not often you can employ exhaustive testing, but it's useful in some cases.

I did a project in college for a VLSI minor where we implemented a 16bit multiplier - after we got the chips back we were supposed to do scan testing, but I just wrote up a little program in an fpga, wired up my chip to test, and swept all the possible inputs verifying the output. I think I could only run this at 10MHz, and you had to shift in the operands one bit at a time, but it still only took on the order of an hour or two to complete.

I also verified this setup by capturing some cycles on a logic analyzer - then pulled together a little awk script to turn a dump of that into the actual numbers I could check. Fun project.


https://semiengineering.com/knowledge_centers/test/scan-test...

scan test: it involves scanning test patterns into internal circuits within the device under test (DUT). The design’s flip-flops are modified to allow them to function as stimulus and observation points, or “scan cells” during test, while performing their intended functional role during normal operation.


I used a similar idea the other day, while implementing in Rust the CRC algorithm that a device required, but the vendor didn't document 100% clearly.

Fortunately, I had some device vendor sample code in C (of varying quality and provenance), which I believed was probably correct, but I wasn't 100% certain which (if any) of numerous CRC standard variants it was actually implementing (so I couldn't just use an off-the-shelf Rust library).

So, the fastest/best way to get a perfectly compatible Rust implementation seemed to be:

1. Write idiomatic Rust code to try to mimic the behavior of the C code.

2. Write some C code to generate a text file of exhaustive results from the known-good C implementation, for all possible inputs up to length N.

3. Exhaustively check the Rust implementation against that text file of inputs and outputs.


For future reference there's an excellent tool for discovering the parameters of a CRC model with pairs of inputs and outputs called RevEng [0] (github copy: [1]). Once you have the parameters you can use the CRC crate [2] with a predefined CRC or a custom algorithm (highly unlikely).

[0] https://reveng.sourceforge.io/

[1] https://github.com/berney/reveng

[2] https://crates.io/crates/crc


Going on this "Brute Force" discussion... the CRC Zoo is computing the minimal hamming distances of various CRCs.

https://users.ece.cmu.edu/~koopman/crc/notes.html

I don't know what algorithm they're using, but the idea that you can make a hamming distance calculation over all possible input sets these days is pretty spectacular.


Thanks! And another tool recommended to me: https://crccalc.com/


https://en.wikipedia.org/wiki/Test_oracle

Test oracles are great for porting software.


This sounds like things I heard in Rob Pike's presentation regarding how they wrote the go parser. I am no expert though.


I also did this when searching for a fast string-to-float conversion function for ClickHouse:

https://github.com/ClickHouse/ClickHouse/blob/master/src/IO/...

- we create a table, insert all four billion floats there, and check our functions with SQL queries.

Currently, the state-of-the-art library is https://github.com/lemire/fast_double_parser, but we use a slightly worse (but more performant) routine.

It is also interesting how to format floats in string in an efficient and user-friendly way.

For this problem, we switched iteratively: strtod -> Google's Double Conversion -> Ryu (patched) -> Dragonbox. Currently, we use Dragonbox.


> ...but we use a slightly worse (but more performant) routine.

If it (dragonbox?) is more performant, why is it also worse?


The fast algorithm is less accurate in the least significant bits. In zX41ZdbW's first link above, there's a code comment that includes the exact count of numbers grouped by how inaccurate the results were.


Presumably it's not as accurate?


Someone needs to make a bumper sticker for developers that says “fast and wrong”. I’d imagine non-programmers would buy more of them for other reasons…


It's only on edge cases that floating point isn't mathematically wrong. And sometimes you don't mind an extra couple ULPs of drift.


fast and wrong? that's not a great bumper sticker to have when you're getting pulled over for speeding.


A modern GPU with 5120 shaders and 2Ghz speeds can perform 2^32 ops in just 0.4 milliseconds. (AMD's Rx 6900xt. Yes, sub-millisecond times)

The truth of todays computers is that the 32-bit space is a sub-second problem, possibly faster than human reaction speeds.

Test the whole 32 bit space. It's cheap today.

-------

Extrapolating a bit: the 2^48 space is 27 seconds per op. Maybe 100 ops are needed for your test code, so the 48-bit space (aka: 3x 16-bit input space) can be exhausted and tested in roughly 50 minutes.


And a 64b sweep is ~21 days. As a HW guy who tests fp units... just saying.

The real issue is that the verification code is much, much, slower.


> And a 64b sweep is ~21 days

Per GPU. A consumer GPU at that (A100 is faster).

4x GPUs (the AMD 6900 XT) per node and maybe 10x nodes in a rack (4U per node, 42U cabinet) can do 64-bit sweep in half a day.

> The real issue is that the verification code is much, much, slower.

Isn't verification of 64b multipliers actually a 128-bit problem? (x * y == z, so you have 2x 64-bit numbers to work with).

Also, I think they use BDDs, which although multiplication is exponential, its still small enough exponent that 128-bit BDDs fit on today's computers. So its not a brute force thing like discussed in this topic, but actually a very elegant data-structure (though with tons of complications, as is the nature of NP-complete and expspace problems).


It’s a lot easier to do formal verification… which is a fancy BDD, in many ways.


At that point I might just take the first 2^32 numbers, the last 2^32 numbers, and some range in between, and feel good about it.


https://en.wikipedia.org/wiki/Bounds_checking still seems sensible, if 2^32-1 works why wouldn't 2^32-2?


Even modern CPUs can explore the 32-bit space extremely quickly. If you get a 128-core CPU running at 4 GHz then can do 2^32 ops in 1/128 seconds. Pretty amazing. The improvement in the nine years since I wrote the article is fun to see.


Ah, but you've missed two key advantages of CPUs:

1. Superscalar (multiple instructions in one clock tick)

2. SIMD (not as good as GPU, but still substantial)

Ryzen 9 7950X (Zen4 cores, 16-cores) has AVX512. Its only 256-bit wide, but there are 4 pipelines (!!). (Actually, 8 pipelines, but only 4 of them are SIMD. Its complicated, lets just assume 4)

So each pipeline can handle 8x32 bit integers per clock, and up to 4 pipelines can be active at once. I do believe Zen4 allows for 4-pipelines to all be performing the classic "MAD / Multiply-and-accumulate" instruction (very important for padding those matrix multiplication TFLOPS numbers), though how much you want to count this is... rather difficult because the 4x pipelines don't all have the same capabilities.

-----------

I can't say I've tested this out myself. But the Ryzen 9 7950X has a theoretical throughput of 16 cores x 8 ops per pipeline x 4 pipelines == 512 x 32-bit ops per clock tick, and is therefore 4x faster than your "128-core" estimate.

Running this code in practice is difficult... but similar to GPU programming. Intel ispc is a good tool for generating optimized AVX512 in the "GPU programming style". I know that OpenMP has a SIMD-statement, and the autovectorizers on GCC/CLang/LLVM are getting better.

A lot of options, all difficult but "valid"... at least valid for a simple brute-force embarrassingly parallel problem as discussed here. Things get rather complicated rather quickly in this space when we talk about practical applications...

------------

Note: to achieve this in practice requires some very careful coding and magic. You'll need to fit inside of the micro-op cache, you'll need to keep the pipeline perfectly open, etc. etc.


4 billion inputs is a little on the extreme end, especially for slower interpreted languages, but it's always useful to look for functions which can be exhaustively tested; e.g. booleans and enums.

I spotted a bug in a code review today, which was formatting dates with ordinal numbers, like "9th Feb". The ordinal logic only has to work for inputs 1 to 31, so I wrote an exhaustive test, which it failed by outputting "11st", "12nd" and "13rd" (due to only looking at the last digit)


A while back for some videogame related data processing I ended up writing a "regular number to roman numeral" (or maybe the other way around) function in idk ruby or something. Not super convinced that I had gotten it right, I sat down to write a few tests, and then I realized I only needed it to work for numbers up to like 16. So I deleted the actual code, extended the table of test cases to cover 1 through 20 just in case, and just did a lookup in that.


It happens. Writing the test puts you in the headspace of “assume the problem is already solved”, which is sometimes wildly more effective. There’s some story (that I hope an archivist can link) of someone writing a shell script to verify some giant program. Well, turns out the script could just _replace_ the program. Ran way faster. Sales wouldn’t have that, though.


That’s actually the right solution to that specific problem.


true senior eng shit right there.

https://i.imgflip.com/7alx6q.jpg


Never underestimate the power of a properly placed lookup table.



I remember being surprised when Intel released a Pentium chip that couldn't do floating point right. It would have been trivial to, I don't know, try dividing by 10 and then multiplying back and checking the answer. It couldn't do that.

At the time I was thinking, their test group could have tested every single mantissa over several operations exhaustively. Why didn't they? What excuse could they have had? It was a computer; they feel no pain when asked to do laborious repetitive tasks.


You'd think so, except no. Floating point operations are innately lossy and even non-associative, and non communative.

A+B+C does not equal B+C+A in floating point.


Which should happen in entirely predictable and orderly fashion. In fact, they should be testing for that too.


Hmmmmm.

Reading the aside

In fact IEEE floating-point math is designed to, whenever practical, give the best possible answer (correctly rounded)

Made me realize - there are not "just" four billion floats, there are four billion floats multiplied by however many levels of rounding you want to test against.

The "ninety seconds" doesn't qualify if the test harness was multithreaded. If it wasn't, a modern >30 core box (maybe buried in CI somewhere) should be able to manage quite a few rounding levels without difficulty thankfully.


Yup - I've just finished bringing up some FP hardware - you can't do all the 64-bit double space but you can do all the 32-bit single op instructions (converts, sqrt, etc) - but add/sub/multiple (32x32) are not doable

But this is even more appropriate for 16-bit floats - you can realistically do all the adds/multiplies/etc, but maybe not the 3op multiply-accumulates


If I were to use a mental analogy with numerical methods that you get taught in grad school, isn't this kind of just a problem of identifying where poles are in your functions?

If you're cycling through all possible numbers and seeing where the output might blow up or go pathological, you probably instead (to be efficient about it) test the far limits of the input number range, and then sample the internal space with an adaptive spacing to see where things are changing and whether it's about to go crazy somewhere. Like divide by zero, subtract 2 small numbers etc.? Or analyze the function to see where subtraction (or whatever operation is applicable here) might misbehave? Rather than spend all your compute equally sampling every possible number.

I'm no expert but it seems to be the same type of problem.


The insight is that 32-bit space is so cheap to exhaustively explore now that you don’t need to resort to possible misses through sampling or spend the effort to make and verify a smart way of sampling. Just brute force the darn thing and be done.


Ah, I see, for a test that you'll only run once or rarely, might as well!


No no no! That’s really the realisation they’re pushing at, this is fine for routine tests too! In this almost ten year old post the runtime was ~90s, which isn’t too painful. On some threads here, times on modern hardware are quoted as milliseconds or less.

The takeaway might be more: spend time to optimise testing everything, rather than spending time optimising the selection of the subset to test.

Massive caveat of course is that some search spaces are just big, but 32bit space shouldn’t be considered big any more.


Thanks for the point!


If you have a function that fails on a substantial number of inputs, random testing is probably a better idea, as it will find a failure in short expected time. An exhaustive testing harness may waste large amounts of time on a segment of the input space it hits first, if that initial segment is free of bugs.

You can use random testing for exhaustive testing. On an input space of size N, you will cover that space completely in O(NlogN) tests on average, and you will include any single input in O(NlogN) trials with high probability.


For many things (like floating-point), saving the computer time is not any kind of goal. It's not gonna complain if you ask it to try all the things.

Further, writing clever shotgun tests may actually take you longer than writing the simple loop. In which case it's wasting developer time too.

Be very sure you need to do anything but exhaustive testing, before expending the time money and effort on it. That's my advice.


I loved this, since this resonates with my own experience. If the problem space is small enough, just test against a known good (but slower) implementation. With custom algorithms this can also be very interesting for the person writing the test: Their reference code needs to be correct beyond doubt, but can forfeit mostly every manual optimization. Also, for larger input spaces I like to do a nested for-loop over vectors of "interesting" values, plus add a large set of random values.


Ages ago I was testing firmware that implemented RAID data handling for a DLT tape array. I constructed a test data set containing every four byte value - and found one value that exposed a bizarre bug that would reliably trigger over-temp logic and shut down the device under test. Fun times.


*Only 4 billion floats for 32-bit floats.

When you get into double-precision floats, well, there's a lot more.


Or I could be using two floats (adding or multiplying). Or three.


I was looking for this comment. In my experience functions that work with large multi-variate datasets cause TDD to fall flat on its face.


No they don't. As far as Test Driven Development is commonly understood, at least. TDD would be extremely stupid if the idea would fall flat on its face when tests can not cover the whole input.


Aren't most float operations binary? You'd need to iterate over a 64-bit space to test those.


> For very small numbers (less than about FLT_EPSILON * 0.25) adding 0.5 gives 0.5 exactly, and this then rounds to zero

Is that not correct? Isn't it expected that very small numbers round to zero?


The quote is talking about code that is supposed to find the ceiling of a number, and does so by adding 0.5 and then rounding to the nearest value (breaking ties by rounding towards even).

If you have a very small but non-zero positive number, it gets rounded to 0.5 after the addition, and then the second rounding incorrectly rounds toward 0 instead of 1.

The result is that you have a value x such that x > ceil(x), which shouldn't ever happen.


> supposed to find the ceiling

Oh duh. I somehow forgot that over the span of a few sentences. Thanks.


This is in the context of the implementation of a ciel() function: a very small but still positive float should have a ciel() of 1. As said near this part, IEEE default rounding is round half to nearest even: 0.5 rounds to 0, the closest even integer, so if your addition of 0.5 to the argument to ciel() results in exactly 0.5, the result will incorrectly be 0 instead of 1.


I recently had to implement floating point math. Can confirm that testing them all was by far the cheapest way to fix all the bugs and then verify the final solution was correct.


If you're writing functions for basic mathematical operations, then sure.

But how often is anyone doing that? If you design microprocessors, or programming languages (maybe), or numerical libraries. But hopefully if you do any of those, you already know to test every value against a reference algorithm.

Is there anything in this post that is more generally applicable to the other 99.99% of developers? I'm trying to think of anything I've ever written where it would even be possible to test billions of input values, but there'd never be a reference to test it against.


Or if you're coding for very small microcontrollers.


Until you have to test a binary operation. Or, god forbid you have to verify that fma works


Silly me. As far as I can remember I've always used epsilon when comparing floats.


That's not a general solution.

(And if you know what you are doing, you can also compare floats exactly in some circumstances.)

But the article already mentions that.


Absolutely. I was once refactoring some badly written floating point code. To make sure I did not screw up, I want the result of the old and new code to be identical, since I know the refactoring should not change the float operations. (I also knew the code won't produce NaNs so == suffices.)


Oh, yes... This. Had a project like this not long ago. NaNs almost killed me. The code used complex numbers, and you know who their userbase is. Anyway, exact floating point comparisons made all of this possible.

In the process my coworker and found something we believed to be a bug with complex nans in some recent GCC versions, but where not smart enough to understand whether it is actually a bug or undefined behaviour in the standart.

Anyway, we where so exhausted from the refactor, we didn't follow it up again.


>"And if you know what you are doing"

I knew what I was doing and that is why using Epsilon was the most reliable thing in my data processing algos.


Hence the 'in some circumstances'.


I can not recall when did I ever need to do "if X=Y" without actually needing to account for tolerances.


Your experience seems a bit limited then?

Obviously, you need to account for tolerances, but that accounting does not need to be via an epsilon in your count, it can also take the form of careful numerical analysis. For example, in some numerical algorithms, it is right and proper to run them until an equality test (without epsilon) succeeds. And using any positive epsilon would give a worse result.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: