Hacker News new | past | comments | ask | show | jobs | submit login

There's quite a lot of citations to problems fixed by high precision numbers here: https://www.mdpi.com/2227-7390/3/2/337/htm



The first example won't be solved by adding a higher precision floating point type.

Basically, they're summing up a bunch of intermediate calculations. The intermediate calculations are 64 bit floats, and are precise to 17 decimal digits or whatever. They sum all the intermediate components, and get a result that's only precise to 13 digits or whatever. They use double doubles to perform the summation, and they get a result that's back to being precise to 17 decimal digits. Great.

Now imagine doing this with 128 bit floats, which are precise to 33 digits. So you sum the intermediate results, now you're precise to 29 digits. So you've lost 4 digits of precision again. So you add a 192 bit floating point type...

IEEE-754 floating points are always going to have that problem. If you add two numbers of differing magnitude, you're going to lose precision. Finding the sum of a large sequence is generally going to result in the addition of a very large running total with small individual members.

(I perused the other examples, but they seem to be a variation on the same theme)


I think you are missing the point here; sure you still lose significant digits regardless of how many you started with. But the point was that for many practical real world applications quadruple floats have so much headroom that you are able to perform the needed calculations and still end up with sufficient number of significant digits, whereas with doubles you don't. Choice quote:

> This permitted benchmark results to be accurately reproduced for a significantly longer time, with virtually no change in total run time

They clearly understand that bigger floats do not magically change the fundamental behavior, but the additional headroom makes a significant difference


Here's the rest of the quote:

> A larger example of this sort arose in an atmospheric model (a component of large climate model). While such computations are by their fundamental nature “chaotic,” so that computations will eventually depart from any benchmark standard case, nonetheless it is essential to distinguish avoidable numerical error from fundamental chaos.

> Researchers working with this atmospheric model were perplexed by the difficulty of reproducing benchmark results. Even when their code was ported from one system to another, or when the number of processors used was changed, the computed data diverged from a benchmark run after just a few days of simulated time. As a result, they could never be sure that in the process of porting their code or changing the number of processors that they did not introduce a bug into their code.

> After an in-depth analysis of this code, He and Ding found that merely by employing double-double arithmetic in two critical global summations, almost all of this numerical variability was eliminated. This permitted benchmark results to be accurately reproduced for a significantly longer time, with virtually no change in total run time [3].

The keyword is "numerical variability." Here's the referenced article: https://link.springer.com/article/10.1023/A:1008153532043 Choice quote from the article:

> In climate model simulations, for example, the initial conditions and boundary forcings can seldomly be measured more accurately than a few percent. Thus in most situations, we only require 2 decimal digits accuracy in final results. But this does not imply that 2 decimal digits accuracy arithmetic (or 6-7 bits mantissa plus exponents) can be employed during the internal intermediate calculations. In fact, double precision arithmetic is usually required.

The problem isn't lack of precision. The problem is numerical instability when adding up a bunch of numbers with high absolute values but since they were roughly evenly positive/negative, their sum was approximately 1. IEEE-754 floats, as useful as they are, are just bad at this, and adding more bits isn't a solution, it's a punt. They used Kahan summation or Bailey summation and the problem went away. No 128 bit hardware floats required. (Kahan summation is very well known, Bailey summation is new to me)

Here's my point: if double precision floating point doesn't satisfy your needs, you should dig into the problem and understand why. Understand first, write code second. 999/1000 the solution isn't "we need 128 bit floats", and for that .1%, we're waaaay better off telling those people "Sorry, do it in software and take the performance hit."




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: