Hacker News new | past | comments | ask | show | jobs | submit login
The Floppy Disk of Floating Point (evanmiller.org)
128 points by todsacerdoti on July 14, 2020 | hide | past | favorite | 92 comments



> You’ll find F2XM1, which computes in a swoop (2𝑥−1) – a kind of pauper’s fused multiply-add, some twenty-five years ahead of its time.

No. It computes pow(2, x) - 1.

Suppose x is close to 0, so that pow(2, x) is close to 1, like 1.00000000000000002345678. Because of that leading 1 (to the left of the radix point), all those zeros also have to be stored, meaning some of the low-end digits will have to be dropped, so you end up rounding to 1.00000000000000002. But if you can subtract 1 from the exact result before rounding, then you can store more of the low-end digits: 2.345678e-18. Hence the F2XM1 instruction.


expm1 is the modern equivalent. That (plus the related log1p) are probably some of the most underutilized standard library math functions. They can substantially improve the accuracy of tons of calculations.


Can you give an example of a common situation where a programmer should have used expm1 or log1p but didn't? (the way it's common for people to test if something is inside/outside a circle or sphere by comparing the square root of the sum of squares to the radius as opposed to compare the sum of squares to the square of the radius, or use trig functions when linear algebra will do)

I do a reasonable amount of computational programming but to my recollection I've never used either of those functions.


One example I can think of is doing an interest calculation with small interest rates/short periods, where you'd want to know that someone owes you 0.000000001234 dollars in interest today rather than owing you 1.000000001 - 1 dollars.


One common situation is when evaluating CDFs for exponential or weibull probability distributions. The CDF formula for both of those distributions includes a 1 - e^(x) term that can easily be transformed into expm1 for additional precision.


I think the sigil for 'raise to the power of' got dropped in that line, rather than that the author did not understand what it did.


The author then continues: "a kind of pauper’s fused multiply-add", so I'm afraid your interpretation is too charitable.


Ah yes, good point. So maybe it already got dropped further upstream.


I'm sad to see the x87 go. I've never been able to convince anyone of the value of 80 bits of precision. Never mind that I've seen engineers convinced their sums were correct (because the computer can't be wrong) because they used insufficient precision for adding too many numbers.

But you can thank (or blame!) me for extending its life - back when Microsoft was developing their 64 bit Windows, they were all set to not save the x87 state during a context switch, effectively killing the x87. I talked them out of it.

Microsoft's VC still doesn't support the x87, their "long double" is 64 bits. I suppose that's for compatibility with other processors it supports.

I'm proud of the Digital Mars C/C++ compiler which still sports full 80 bit long doubles, as well as the D programming language which still (despite opposition) delivers 80 bit precision for type real, even for 64 bit Windows.


I had to debug a nasty x87-related issue with a piece of scientific computing software. The output of each version of the software was deterministic, but different versions usually (not always) gave very, very slightly different output, even if the numerical code was unchanged. This bothered the original author, but he was a hardware engineer, and his attitude towards it was basically "software sucks, you can only depend on computers if you lay out the circuits yourself." So the mystery remained until I was brought on and was able to figure out that the default settings of gcc at the time did not use any of the SSE2 instructions supported on the workstations we were targeting, instead using x87 floating point instructions. The state of the x87 stack was affected by code interleaved with the numerical code (there was a bit more going on than just number-crunching) causing the 80-bit representations to be copied to and from 64-bit registers (and hence rounded) at slightly different points in the computation. This resulted in tiny changes in the output. I added a flag to enable SSE2 instructions, and thereafter the output only changed when the numerical code was changed.


The state of the x87 stack was affected by code interleaved with the numerical code (there was a bit more going on than just number-crunching) causing the 80-bit representations to be copied to and from 64-bit registers (and hence rounded) at slightly different points in the computation.

The x87 can load/store 80-bit floats from/to memory, so it can definitely save intermediate results in full precision if asked to; I'd call that more of a compiler flaw than anything else.


Had the experience on several occasions in C# on 32 bit. The whole idea of the 80bit operations is flawed in an environment where you can’t control the native code generated with register allocation and so on (so in most high level languages). We became so used to these bugs that we immediately recognized them when some calculation differed on just some machines.

As C# is jit-compiled, you could never be sure what code would actually run at the end user machine, and where the truncation from 80 to 64 bits would occur.

In the end the best cure is to ensure you never ever use x87, which happened automatically when dropping 32 bit support.

Determinism is too important to give up for 16 extra bits.


I feel like I read something once that said when writing numerical/scientific code, traditionally there were so many weird high performance computers that were used, e.g. Crays or whatever, you'd have to be robust to all sorts of different types of FP anyway.

Nowadays maybe that sort of diversity is less of an issue? Expecting determinism in the sense you mean it just seems weird to me.


Having absolute determinism is probably still difficult but using SSE on x64 on Windows, where all users have compatible compilers (I.e determinism without diversity) is at least “good enough” nowadays. I haven’t seen any issues with that scenario so far, even though it’s certainly possible problems can arise.


I think it’s in Goldberg91.


The round to 64 bits (it's a round, not a truncation) never occurs if you use a language type that is 80 bits.


The question, though, is which give you more accurate results? :-)


The one that let us build a library of bit-for-bit regression tests ;-) but for that I have to explain a bit more. There were computational "features" that could be turned on and off to tweak the computation for different problems and for different speed/accuracy trade-offs, and sometimes these features had very slight effects, so regressions could result in very small errors.

An experienced eye could easily see the difference between the tiny x87-related indeterminacy and other kinds of changes, but we were uncomfortable automating this comparison, and it took a while for someone without strong domain knowledge (such as myself or any other software engineer being hired) to become comfortable eyeballing it. With deterministic output, we could use automated tests to verify that, for example, the work we did to add a new computational feature did not change the output when that feature was not enabled, or that small changes intended as performance optimizations did not inject tiny numerical errors.

Our customers were also a lot more comfortable when they could use "diff" to validate that our latest X-times-faster release was really doing the same computation as the last one :-)

EDIT: We also got a noticeable speed-up by enabling the SSE2 instructions. The bulk of the numeric work was done in hardware, so it wasn't dramatic, but it was measurable.


Yes, that makes sense.

As for speed, Intel has neglected to keep up the x87.


This has happened to thousands of people throughout the ages. The damn --ffloat-store option! Ugh!!!


> they used insufficient precision for adding too many numbers.

80-bit, 128-bit or sum of 64-bit floats, doesn't matter. Summing an arbitrarily large list of floating point numbers that maximizes precision is actually a hard problem, because the order in which you add the numbers really matters due to incremental errors. Even when this problem is recognized, it turns out the most precise algorithm escaped us for decades--it's not as simple as adding them up low to high. It's only been within the past 10 years that this optimal algorithm has been discovered.

80-bit is a non-portable abomination. It's caused us decades of us "average" programmers headaches and generally has not even given experts better numerical precision.

Good riddance IMHO!


> Even when this problem is recognized, it turns out the most precise algorithm escaped us for decades--it's not as simple as adding them up low to high. It's only been within the past 10 years that this optimal algorithm has been discovered.

Interesting, do you have a link for us lazy but curious folk?


> this optimal algorithm

I'm curious - link please?


There is some good background starting from the Kahan algorithm:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

This paper from 1993 analyzes five algorithms:

https://www.semanticscholar.org/paper/The-Accuracy-of-Floati...

Sorry I don't a reference for the "optimal" algorithm--I had it explained to me in person by someone working in the field and they didn't give me a specific place to go track it down. The gist of it is that you need to "bin" the numbers by their approximate order of magnitude and then even in each bin you need to keep a separate running estimate of the error within each bin. It's a kind of divide-and-conquer approach. AFAIR it is "optimal" with respect to minimum error, but not optimally efficient.

It's apparently a hard problem!


I did find this:

http://code.activestate.com/recipes/393090/

but according to the comments on it it isn't entirely accurate, either. It is indeed a hard problem, and just having more bits is easy :-)


The non-portability of long double is just bonkers, though. Floating point is super-tricky to get right at all levels, and programmers shouldn't be constantly surprised, and often infuriated, when their compiler introduces crazy amounts of nondeterminism by "helpfully" utilizing 80-bits underneath.

I am actually pretty OK with just having an 80-bit float type in the language, as long as it is always 80 bits, and no 64-bit floats get silently promoted. But C didn't have that because it's not what x87 does the fastest. C just has whatever crap is the fastest. I hate that! I imagine that you did a better job of it in D.

In designing WebAssembly we were very careful to go for as much determinism as possible and stick closely to IEEE 754. There were loud voices calling for nondeterminism (even pushing for the default to be nondeterminism as to whether FTZ mode was enabled!) but in the end the portability considerations won out. I was very heavily on the determinism side.


> back when Microsoft was developing their 32 bit Windows, they were all set to not save the x87 state during a context switch, effectively killing the x87

Do you mean '64-bit Windows'? (When 32-bit Windows was being developed, killing x87 would have meant killing all hardware floating point on x86.)


yes, 64-bit


I edited that now. Hope that's ok.


edit: Microsoft was developing their 64 bit Windows


I've doubled the 32 in your original comment. Hope that's ok.


Thank you.


One of the interesting things about the 80-bit format was that allowed the use of 64-bit integers on a 16 or 32-bit machine. This was what Thomas Nicely was using them for when he found the Pentium FDIV bug: https://faculty.lynchburg.edu/~nicely/pentbug/pentbug.html


https://en.wikipedia.org/wiki/Quadruple-precision_floating-p... shows (non IEEE) 128-bit floats existed in the 1970s, and that IEEE ones were added to Power in 2015, and the clang, gcc and Intel C compilers support them in software. There clearly is a market for them.

Truth is that successive rounding errors accumulate complex calculations can lose a lot of bits of precision. That’s why you sometimes need to do calculations in a lot more bits than needed in the answer.

Also, the x87 had 80 bits floats, but not all instructions computed all of them correctly (https://www.infania.net/misc/coproc.html)

I don’t know how current FPUs fare there, but would hope they have become optimal.


> but not all instructions computed all of them correctly (https://www.infania.net/misc/coproc.html)

I failed to find which instructions weren't computed correctly in the text linked, and I'd like to know about that, that is, if it's something I wasn't aware of. Can you specify a bit more?


Search for “Accuracy of calculations performed by a coprocessor” in the article I linked to.

it says “Intel claims that maximum error in the 8087, 80287, and 80387 for all transcendental functions does not exceed two bits in the mantissa of the double extended format”. Typically, those reposts get rounded to double format, but that doesn’t guarantee that such errors get dropped.

Edit: http://www.math.utah.edu/~beebe/software/ieee/ has lots of links and code for testing floats.

http://li.mit.edu/Archive/Activities/Archive/CourseWork/Ju_L... gives quite a few corner cases, so it would be helpful when writing a test suite.

https://slixe.de/stash/projects/MAS/repos/libfirm/browse/ir/... has source code for a/the compliance checker

https://core.ac.uk/download/pdf/151193633.pdf describes what seems to be a Better version of that test framework (it adds “++” to the name). Unfortunately I couldn’t find its source code.


That statement that you refer to regarding the instructions for transcendental functions in 80 bits doubles is the result of the way the hardware and the microcode for these functions are implemented. Internally, out of 80 bits, 64 bits are available for mantissa and the steps for calculating the result accumulate the error, and then the last two bits in these transcendental functions are "noisy" which was still very good for its time. And it was documented.

But that's still only for the "convenient" input range for the implemented algorithm.

In practice, the programmers sometimes want to pass the values from the less convenient input range, and then it got much worse when the library functions in high level languages have expected too much from the implementation in the processor:

https://randomascii.wordpress.com/2014/10/09/intel-underesti...

The resulting problem in the high level implementation described described was blind belief in the wrong (in not correctly stating the input range) documentation by some that then implemented the C library functions for some environments, only a few years ago:

The Intel's documentation was "buggy. By claiming near-perfect accuracy across a huge domain and then only providing it in a tiny domain Intel has misled developers and caused them to make poor decisions."

And it was fixed in 2014:

https://software.intel.com/blogs/2014/10/09/fsin-documentati...


The x87 FPU finds an interesting niche application in size-optimised demoscene productions, where the stack-based architecture and scientific-calculator operations allows for some impressively compact 3D demo effects. For example, this one is 256 bytes:

http://www.pouet.net/prod.php?which=53816


A good article. However the real floppy disk is the floating point standard . These days other number representation are exceeding floating point in terms of dynamic range, flexibility, fewer exception cases, better rounding, and cheaper/faster hardware implementations.

Read about Unum computing, pioneered by John Gustafson, and implemented on many modern software and hardware platforms.

https://en.wikipedia.org/wiki/Unum_(number_format)


> ...a kind of pauper’s fused multiply-add, some twenty-five years ahead of its time.

Maybe 12 years. The PowerPC had fused multiply-add in 1992. For whatever reason, it didn't really appear in x86 until 2012.


And many DSPs had it long before then.


One problem with x87 is that it performed all operations in 80 bits, converting to float and double on load/store.

This sounds nice, but in practice was quite problematic, as results depended on whether or not intermediate results spilled into memory.


I have a feeling that Kahan lived in a time where matrix inversion and eigenvalue computation was considered the new hotness, just like neural networks today.

It is very easy to build small invertible matrices that cannot be inverted with 32-bit floats, or even 64-bit floats, thus Kahan's insistence on very high precision floating points.

Edit: small invertible matrices


Kahan is a numerical analyst, and probably had extensive experience with worrying about known numerical analysis pitfalls, such as the tablemaker's dilemma, which means you don't necessarily know a priori how much precision you need on the input or intermediate computations to get the desired output precision.

As it turns out, though, most people don't need more than 3 or 4 decimal digits of precision, so while a float may easily accumulate enough error to corrupt that last needed digit, a double tends to be more than roomy enough for almost everybody.


>lived

The man lives still, not to be numbered among the dead:

https://en.m.wikipedia.org/wiki/William_Kahan


Then he must be apoplectic about the new bfloat16 format - only 7 bit mantissa - what the heck

It's totally possible that scientists in the future will laugh at our extremely wide 16-bit neural networks. Computation at the biological synapse level is considered to have an accuracy of 1 to 3 bits only - opinions differ.


I assume you mean small invertible matrices that cannot be inverted with …


> small matrices that cannot be inverted with 32-bit floats, or even 64-bit floats

Presumably these all have eigenvalues near zero and away from zero (i.e. the condition number is large)?


https://en.wikipedia.org/wiki/Hilbert_matrix

"The Hilbert matrices are canonical examples of ill-conditioned matrices, being notoriously difficult to use in numerical computation."


Yes. Ill conditioned matrices are hard to “invert” (more like solve Ax=b for a given A and b) almost by definition. The condition number is basically a measure if sensitivity. If I change b a little, the change in x is proportional to b multiplied by the condition number. If the matrix is very ill conditioned (say more than 1e10 for a double precision algorithm), the matrix is really singular for any practical purpose, so the problem is really ill posed.


Solar system scale 128-bit floats can be useful for storing coordinates. Any other reasons to go beyond 64 (or 80) bits?


One example could be when computing sin (or any trig function) of a large number - first you perform range reduction, ie bring it into the range 0..2pi. For that, you're (potentially) subtracting two large numbers that are close together, so you lose a lot of accuracy.

This implementation of mod2pi in Julia, for example, uses double doubles ("TwicePrecision") here and there:

https://github.com/pkofod/RemPiO2.jl/blob/master/src/RemPiO2...

And, pertinent to the article, here's a bug in mod2pi that was probably triggered by 80 bit extended floats:

https://github.com/JuliaLang/julia/issues/8799


I have a book somewhere in the basement on organising calculations from the early twentieth century, back when computer was a job title and job runtime was measured in weeks...

===Update===

Couldn't find the one I was looking for, but I did dig up Eckert, Punched Card Methods in Scientific Computation (1940).

From Chapter VII "The Multiplication of Series":

p.68 "By a continuation of this process we obtain about 100 groups of cards each with its rate card. Since the capacity of the multiplier is eight digits, those groups which have 9 and 10 digit multipliers and eight digit multiplicand are done in two groups. The few terms with large multiplier and multiplicand are done by hand. In a series with reasonable convergence there may be about 20,000 cards altogether. The reproducer is of course used to check the punching."

10 decimal digits is what, 33 bits?

p.74 "The machine time for 100 series is as follows:

    Original punching and verifying . . . . 2   days
    Listing and summary punching checksums. 1/4 day
    Sorting and gang punching duplicates. . 1   day
    Multiplying . . . . . . . . . . . . . . 1   day
    Sorting, tabulating, and summary punch. 1   day
In this example the checks would have been exact had we included a card for each product regardless of size, but this would have required about forty thousand cards instead of four or five thousand, and the multiplying time would have been over a week instead of one day."

I guess in principle a punched card could have contained 80 decimal places, but it seemed like the ALU equivalents were much narrower.

(this was published by the Thomas J. Watson Astronomical Computing Bureau)


I actually ran into this problem in grad school. I was trying to simulate an optical phased array with a target at the moon, it turns out that a single bit of a double is about 1 wavelength of light at the distance of the moon. I ended up doing most of the calculations symbolically and only using floats for the differences in the distances between each beam which was enough to get out of the solar system but not out of the galaxy...


You need multiprecision floating point for very deep mandelbrot zooms - it's really easy to hit the limits of a 53 bit mantissa on today's computers.

Interestingly, you don't need multiprecision floating point for every pixel: you can do precise calculations for a few tentpole pixels and calculate most of the rest at adouble precision based on the differences between their location and a tentpole. But you also need some significant cleverness to accurately fill in the gap betwen "most" and "all":

http://www.fractalforums.com/announcements-and-news/pertubat...


> Any other reasons to go beyond 64 (or 80) bits?

Algorithms stability. As precision decreases when magnitude rises, many tricks are used in sci. comp. to scale everything into ranges where computations won't diverge due to a lack of precision.


Why do you need 128-bit floats for storing solar system coordinates? 64-bit floats will give you millimeter precision on Pluto, if your coordinate system has an origin in the sun.


Similarly, this NASA blog post has some fun facts about the precision of pi that JPL uses in its calculations:

https://www.jpl.nasa.gov/edu/news/2016/3/16/how-many-decimal...

* JPL's highest accuracy calculations, which are for interplanetary navigation, we use 3.141592653589793.

* How many digits of pi would we need to calculate the circumference of a circle with a radius of 46 billion light years to an accuracy equal to the diameter of a hydrogen atom (the simplest atom)? The answer is that you would need 39 or 40 decimal places.


So if he'd only had interplanetary probes, غیاث الدین جمشید کاشانی‎ Ghiyās-ud-dīn Jamshīd Kāshānī could in principle have navigated them to the proper accuracy in 1424.

By 1596 Ludolph van Ceulen had 20 digits, and even more by the time he designed his tombstone: https://upload.wikimedia.org/wikipedia/commons/e/e8/Monument...

39 or 40 decimal places would have to wait until 1699.

https://en.wikipedia.org/wiki/Chronology_of_computation_of_π


Note that 3.141592653589793 is the shortest decimal representation for the closest double precision float to pi.


Is that 2 or 3 dimensions?


As many dimensions as you like, you would use one coordinate for each dimension.


well, 2 dimensions of 64 bits is 128 bits, while 3 dimensions of 64 bit precision is 192 bits... so mm precision to pluto is along 1 dimension I take. I just wasn't sure of the magnitude we were talking.


The precision doesn’t change when you change the number of dimensions.

If you are using a 64-bit float, then you have 53 bits of precision in 1D, 53 bits of precision in 2D, and 53 bits of precision in 3D. The precision doesn’t change as you add or remove dimensions.


The system will also be more tolerant for ill-formed numbers, and thus numerically more stable (not running into NAN / INF as frequently).


Babbage's Difference Engine Number 2 has 31 digits of precision (about 103 bits), but it is fixed point rather than floating point. https://en.wikipedia.org/wiki/Difference_engine


High resolution timestamps relative to 1970-01-01T00:00:00


Yes, 64 bits is not really comfortable for a long-term timestamp format.

The NTP nanokernel uses a 64.64 integer represention https://papers.freebsd.org/2000/phk-nanokernel/ because 32 bits isn't enough above the binary point because it runs out in 2038, and below the binary point you need 64 bits to accurately represent the nanosecond-scale frequency of the CPU's cycle counter to translate it to wall-clock time.


The trick I've seen to deal with this problem is to use a pair of doubles: one for the integer part and one for the fractional part.


You can use double-doubles for near-128-bit floating point [1]. It's much more powerful than 53.53 fixed point you mentioned, but more complicated as well.

[1] http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4...


Isn't it simpler to use fixed point in this case?


Calculating the heat death of the universe down to the nanosecond are we?

(i mean this is a legitimate reason, it's just amusing the extremes you'd have to go to to break 64bits).


I know it is a joke, but 64bit fixed-point representation with nanosecond resolution has just ~585 years of range. With 1e100 years of range needed to measure out to the heat-death of the universe, you end up with resolution of ~5e80 years, which is 60 orders of magnitude longer than the current age of the universe.

It is easy to run out of bits when trying to use a single number as both an absolute date/time, and to compute relative durations between timestamps with reasonable precision. I've lost track of the number of bugs I've fixed where someone assumed a double would be big enough without doing the math.


Hunh? January 1, 1970 + 9.007199254740992e15 nanoseconds gets you all the way to... tax day[1].

Microseconds gets you to the year 2255; milliseconds to heat death.

1. https://www.wolframalpha.com/input/?i=January+1%2C+1970+%2B+...


Qwibble: Microseconds gets you to the year 2540 with double precision, because the implied 1-bit means you effectively have 54-bit mantissa.


Intermediates in computations. There's lots of quantities of the form z=x*y or z=x+y where while z may well be in the realm of "reasonably sized numbers that fit comfortably in 64 bits", x and y are (beyond) astronomical.


Multiplication doesn't lose precision, does it?

Addition has problems where x is close to negative y, sure.


Well you may still have an x larger than the largest value representable in your floating point system, and y=1/x. It is not insane to want z=x*y=1 to be approximately true.


At that point you're not multiplying x and y to begin with.

So of course you want to avoid error states but a double already has an exponent range well beyond beyond astronomical. The width of the universe is around 10^27, a planck length is around 10^-35, and the limit of the format is ±10^308.


While I know there are better and more numerically stable ways to implement it, think of the softmax function. It is perfectly possible to have a list of non-astronomical/non-Plancklength numbers, naively softmax it, and die because of precision.


Softmax goes out the other side. Naively tossing around exponentials is so bad that it can easily explode any float, even 128 bit. It only adds another 4 bits to the exponent, after all. Even a 256 bit float has less than 20 bits of exponent! So that's an example where floats in general can go wrong, but it's not an example of where using larger floats is a meaningful help.


Fair enough.


IMHO there's no reason for hardware support for floats beyond 64 bit.

If 64 bit IEEE-754 floating point numbers do not support your needs, adding more bits will almost certainly not fix the problem. The problem isn't lack of bits, it's going to be something fundamental about the nature of IEEE-754. You'll probably need something like a rational data type, or a symbolic algebra system that can abstractly represent everything that shows up in your domain without losing any precision. Generally you'll be required to do this in software.

32 bit is the sweet spot between precision and performance. There are only a handful of things that require 64 bits, (notably latitude and longitude) but these things are common enough that hardware support is (IMO) valuable.


There's quite a lot of citations to problems fixed by high precision numbers here: https://www.mdpi.com/2227-7390/3/2/337/htm


The first example won't be solved by adding a higher precision floating point type.

Basically, they're summing up a bunch of intermediate calculations. The intermediate calculations are 64 bit floats, and are precise to 17 decimal digits or whatever. They sum all the intermediate components, and get a result that's only precise to 13 digits or whatever. They use double doubles to perform the summation, and they get a result that's back to being precise to 17 decimal digits. Great.

Now imagine doing this with 128 bit floats, which are precise to 33 digits. So you sum the intermediate results, now you're precise to 29 digits. So you've lost 4 digits of precision again. So you add a 192 bit floating point type...

IEEE-754 floating points are always going to have that problem. If you add two numbers of differing magnitude, you're going to lose precision. Finding the sum of a large sequence is generally going to result in the addition of a very large running total with small individual members.

(I perused the other examples, but they seem to be a variation on the same theme)


I think you are missing the point here; sure you still lose significant digits regardless of how many you started with. But the point was that for many practical real world applications quadruple floats have so much headroom that you are able to perform the needed calculations and still end up with sufficient number of significant digits, whereas with doubles you don't. Choice quote:

> This permitted benchmark results to be accurately reproduced for a significantly longer time, with virtually no change in total run time

They clearly understand that bigger floats do not magically change the fundamental behavior, but the additional headroom makes a significant difference


Here's the rest of the quote:

> A larger example of this sort arose in an atmospheric model (a component of large climate model). While such computations are by their fundamental nature “chaotic,” so that computations will eventually depart from any benchmark standard case, nonetheless it is essential to distinguish avoidable numerical error from fundamental chaos.

> Researchers working with this atmospheric model were perplexed by the difficulty of reproducing benchmark results. Even when their code was ported from one system to another, or when the number of processors used was changed, the computed data diverged from a benchmark run after just a few days of simulated time. As a result, they could never be sure that in the process of porting their code or changing the number of processors that they did not introduce a bug into their code.

> After an in-depth analysis of this code, He and Ding found that merely by employing double-double arithmetic in two critical global summations, almost all of this numerical variability was eliminated. This permitted benchmark results to be accurately reproduced for a significantly longer time, with virtually no change in total run time [3].

The keyword is "numerical variability." Here's the referenced article: https://link.springer.com/article/10.1023/A:1008153532043 Choice quote from the article:

> In climate model simulations, for example, the initial conditions and boundary forcings can seldomly be measured more accurately than a few percent. Thus in most situations, we only require 2 decimal digits accuracy in final results. But this does not imply that 2 decimal digits accuracy arithmetic (or 6-7 bits mantissa plus exponents) can be employed during the internal intermediate calculations. In fact, double precision arithmetic is usually required.

The problem isn't lack of precision. The problem is numerical instability when adding up a bunch of numbers with high absolute values but since they were roughly evenly positive/negative, their sum was approximately 1. IEEE-754 floats, as useful as they are, are just bad at this, and adding more bits isn't a solution, it's a punt. They used Kahan summation or Bailey summation and the problem went away. No 128 bit hardware floats required. (Kahan summation is very well known, Bailey summation is new to me)

Here's my point: if double precision floating point doesn't satisfy your needs, you should dig into the problem and understand why. Understand first, write code second. 999/1000 the solution isn't "we need 128 bit floats", and for that .1%, we're waaaay better off telling those people "Sorry, do it in software and take the performance hit."


> Professor Kahan was last seen ranting about benchmarks.

Incorrect. Scroll to the bottom of this page to find works dated 2016:

https://people.eecs.berkeley.edu/~wkahan/



I hope my new Mill CPU will natively support Unums. And the whole thing should be memristor-based of course, running on carbon nanotube batteries.


"There’s even FYL2XP1, for all those occasions when you need to compute ylog2(x+1) on the double" When would I want to do that?


“ The (ε+1) expression is commonly found in compound interest and annuity calculations. The result can be simply converted into a value in another logarithm base by including a scale factor in the ST(1) source operand.” https://www.felixcloutier.com/x86/fyl2xp1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: