Hacker News new | past | comments | ask | show | jobs | submit login
The quadratic formula and low-precision arithmetic (johndcook.com)
147 points by ingve on April 28, 2018 | hide | past | favorite | 41 comments



We have, in the past, had code that computes the roots of quadratic equations in what appears to be a needlessly complex way. It's deliberately uncommented, and deliberately has an innocuous name.

When a newer employee comes across the code they often "simplify" it using the quadratic formula they know so well, and which they believe always works.

The tests go red.

They roll back and look to try to understand. Eventually they either work it out, or they come and ask for advice. We work through the test case with them, and point out just how important it is that:

* There are tests;

* Routines have useful and enlightening names;

* If the code is complex, there's probably a reason, although not always;

* Sometimes a comment might be useful.

It's proven to be a really useful learning experience, and gets people up to speed very quickly with things. It's then left again as a learning experience waiting to happen.

Of course, this can only work in a culture where people work constructively to understand what's going on, and support each other - a place where it's OK to say that you don't understand something, and know that others will work with you. I've been in places where this wouldn't work.

All in all, I've been lucky.

The tests, by the way, only have a moderate size b, but have a very small a. In this way b^2-4ac is very close to b, so one of the solution paths is subtracting nearly equal quantities, getting small rubbish. But then you go on to divide by a very small number, giving big rubbish.

The geometric interpretation is that the parabola is very, very flat, so if you move it even the tiniest amount up or down the roots move very far, very fast.

It's a useful example - my thanks to johndcook for the write-up.



From the book "real computing made real" by Acton, the suggestion is to first view (not always!) the quadratic as x^2 + 2bx + c, so its roots are -b +/- sqrt(b^2 - c). The product of the roots is c and therefore we can compute the root that does not do the subtraction and then divide c by that root to get the other root.

This avoids any of the bad subtraction.

Some sample js code that corresponds to quadratic in the post:

    b = 5e7
    c = 1

    d = b*b - c

    if (d >= 0) {
        m = Math.sqrt(d)   
        if ( b> 0) {
            r = -b - m
            s = c/r
        } else if ( b <  0) {
            s = -b + m
            r = c/s
        } else {
            r = m
            s = -r
        }
        console.log(r, s)
    } else {
        console.log("no real roots");
    }
This gives the correct roots for this problem in one go.


Numerical Recipes says to find q=-1/2[b+sgn(b)sqrt(b^2-4ac)] and then find x1=q/a and x2=c/q - is that better?

http://www.solipsys.co.uk/cgi-bin/sews.py?NumericallyStableS...


> The quadratic equation violated the cardinal rule of numerical analysis: avoid subtracting nearly equal numbers. The more similar two numbers are, the more precision you can lose from subtracting them.

Are there other "rules" like this, regarding numerical analysis? I understand this, but I've never thought of articulating it like that.


If you add together many numbers, do it in ascending order, numbers with smallest absolute values first.

Comparing floats is tricky. Use either epsilon, epsilon with relative error or ULP's (all are tricky)


That costs a sort, better to use the Kahan summation formula https://en.wikipedia.org/wiki/Kahan_summation_algorithm


Look at the accuracy comment, it's faster but less accurate than sorting first.


If the numbers happen to be >=0, one could do a little better by merging in Huffman order. In big 'O' the time complexity is the same.


There are surprisingly many situations where the unwary end up violating this cardinal rule. For example: solving a system of linear equations. Mathematically, you can use any pivot, but numerical stability suffers if you don't take the largest possible one. All of this uses to be standard in 1st year engineering classes, so I'm always quite surprised when I see articles like this on HN.


> So we might be interested in low-precision arithmetic to save energy in a battery powered mobile device, or to save clock cycles in a server application manipulating a lot of data. This means that the numerical tricks that most people have forgotten about are relevant again

The more I experience engineering advances the more I see how cyclical techniques and technologies are. Almost everything old that was useful will eventually be needed again. Early computers were severely memory and cpu restricted. Then we got giant computers. Then we're putting microcontrollers everywhere and need those frugal techniques again. And so on the cycle will continue.

A possible conclusion is that the danger of obsolescence from a really profound knowledge in a field is far lower than one would expect, specially if you can weather the cycles of technology and find new markets to apply your expertise.


Or do what I do and work with integers and do the heavy stuff on a big computer with lots of CPU and memory.

Integer math is fast, easy, and not error prone, unless you cross the limits in limits.h . But those are simple bounds checking, or you size the vars large enough so that you never need to worry about them. Primarily, if you use floats, you grab the data and send it on its way. Don't process if you don't have to. Drop it to MQTT/CoAP/AMQP.

I'm thinking in the similar way Klipper offloads 3d printer math to a heavy computer with GHz and GBs of ram for trig math and array compensation.


I'm pretty sure integer (i.e., fixed point) math suffers from the same problem here. The only difference is that (a) for a given size, you get a little more precision, since you're not spending bits on the exponent, and (b) you have to hand-code the shift where you lose precision, instead of letting the FPU automatically make the shift where you lose precision.

Sometimes you can get clever with the modular arithmetic--like, it's okay to add (or subtract) a list of fixed-point numbers where partial sums overflow and wrap, as long as the final result doesn't. I don't know any useful analogy to that here, though.


When dealing with floating point, every conversion and computation leads to 'wrong answers', that are only in close proximity to the correct answer.

This paper discusses a great deal the sources of error, and how the IEEE standard does things. https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...

Using integers instead bypasses this source of error. The only caveat that I keep attenion to is that you have to be able to expand the size of the space to prevent decimals from being used (think of multiplying by 100 when dealing with cash). And if you do deal with floats, do the math you have to and then drop back into ints, or save them and send them off.


Oh, and to be complete: The reason why multiplying by 100 = 2^2 * 5^2 helps when dealing with cash is that 0.20 isn't exactly representable in a finite-length binary number (since that's 1/5, and 5 isn't a power of two). It's the multiplication by 5^2 = 25 that makes previously inexact things exact. The additional multiplication by 2^2 = 4 just moves the decimal place by two digits, nothing the FPU couldn't have done on its own. You could represent cash in units of four cents (or two cents, or 256 cents) instead of one cent, and the result would be unchanged except for the location of the decimal place.

There is no useful, general analogy to this in numerical analysis. If you're using any fixed-sized representation of numbers--floating point, fixed point, or otherwise--then some ways to calculate an expression will be less accurate than others. You can avoid this with variable-length representations, but then some ways are slower. There is no way to avoid thinking about numerical conditioning.


Your comment is deeply misleading. Floating point is mostly just fixed point plus some rules for automatically moving the decimal point. The point is that depending how you express the formula, you need to "expand the size of the space to prevent decimals from being used" (what do you even mean by that, if your answer has no exact binary representation? how would you express 1/7, or 10000/7, or 65536/7?) by spectacularly more or less.

If you think I'm wrong: Can you code the quadratic formula using integer or other fixed-point math, and demonstrate why it doesn't need the attention to bad numerical conditioning noted in the parent?


> Your comment is deeply misleading. Floating point is mostly just fixed point plus some rules for automatically moving the decimal point.

Uhhhhh..... I hope this is a farcical comment. Because if you believe this, then you might want to read up here: https://en.wikipedia.org/wiki/Floating-point_arithmetic


Perhaps I'm misunderstanding you somewhere; but this is basically what I do for a living, so I don't think I'm wrong. I have carefully re-read my comments above, and would urge you to do the same.

Again: Can you write the quadratic formula in the way that you think is better (in pseudocode, or a language of your choice), and give examples of inputs where it performs better than floating point? If you don't do this and I'm wrong, then you're missing the opportunity to teach me (and all the people I'm misleading with my comments) something. If I'm right, then you're missing the opportunity to learn something yourself.


Rounding issues in the quadratic formula have bitten me several times: in particular, the inverse of a 2D bilinear mapping requires solving a quadratic equation which (if things are skewed) can be very badly conditioned. On the other hand, if one uses a bilinear map on a parallelogram then the quadratic term is gone and one root is zero: also annoying.

In practice I just check the coefficients and then use appropriate versions of the quadratic formula. I guess that I am luckier than Mr. Cook here since I only need one root.

code (some of it was written by me, some improvements are by other people): https://github.com/dealii/dealii/blob/master/source/fe/mappi...

This set of cases (which can probably be further improved) is equivalent to his first answer. I don't think its possible to do better in this test case due to the catastrophic cancelation in subtracting the discriminant from b.


inverse bilinear map is a good example where, even if there is an analytical solution, it's often more robust to just do a couple newton iterations until you get down to machine precision.

added benefit is that it carries on in 3D (although with a larger jacobian of course).


You are absolutely right re: robustness (and that code falls back to the iterative method), but IIRC the Newton scheme was at least one order of magnitude slower than the formula. It was also slightly less accurate simply because it did more floating point operations.

I have not yet figured out how to make the 3D formula work so we always do the iterative method in 3D.

A nice trick: we approximate the (bi/tri)linear map as affine (usually quite accurate) to get a good initial guess.


I always try to relate these floating point issues back to real situations. Is there really a case where you would use a quadratic equation such as this and what kind of precision would you really be interested in? In the first example, the root is wrong by 25% in percentage terms but is wrong by less than 2.5e-9 in absolute terms.

What is important is to tailor your maths to the real problem you are solving (and remember garbage in = garbage out). Very rarely do you actually need 53 bits of precision (as you get in a double), or even need to store intermediate results with 53 bits of precision.


This exact issue turns out to be critical in the Hessenberg QR algorithm, which is at the heart of MATLAB's eig.


Yeah I understand why a product like Matlab which needs accuracy across a wide range of end use cases would need to consider things like this.


This is great and reminds me of an issue we ran into developing the numerical engine in circuit simulation software; user-specified values spanning many orders of magnitude mean you're likely to run into issues like this. In our case the answer was to go to 128-bit floating point operations inside our sparse matrix solvers, which we published here: https://www.edn.com/design/analog/4418707/1/Extended-precisi...


We measured less than 3% slowdown within a typical well-behaved circuit simulation.

That's pretty amazing. CPU pipelining at work, or was the profile just dominated by other things?


A raytracer I worked on had an interesting bug, where a manual optimization caused some unforeseen floating-point issues down the line.

The short story was that we had an expression of the form x = a/b * c/d * e/f in a fairly hot code path. As most developers know, divisions are expensive, so this got rewritten to x = (a * c * e) / (b * d * f).

The problem was that in certain edge cases, the factors were all extremely small. This caused the product b * d to become a denormal number, and when multiplied with f it became zero. This of course caused the division to return infinity.

However, closer inspection of the source of the terms a to f revealed that the pairs a,b etc always had roughly the same magnitude. Thus even though a and b were almost denormal numbers, the division a / b was well-behaved and gave a sensible result, and similar for the others.

Undoing the optimization and using the original form made the code well-behaved again.

Another expression that had to be de-optimized for the same reason was x = a / (b * b), which had to be rewritten as x = (a / b) / b.

So as with subtraction, division of IEEE floating-point numbers can be tricky.

This[1] page contains a nice set of tables with the results of performing IEEE floating-point operations on edge-case inputs.

[1]: https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...


I wonder how much the sqrt function is simplifying our lives here — might it be simpler to deal these issues in terms of Newton’s method directly, just ignoring the quadratic formula?


Unum's? I bet that many of us here would be very interested in the thoughts on the topic of folks here who know about it. Is it a revolution or is it nonsense, an engineering mistake?


Posits are a hardware-friendly implementation of unums, a sort of halfway house on the road to Gusafson's full vision of unums. Posits have a fixed size and are meant to be a drop-in replacement for IEEE floats. They can and are being implemented in hardware.

Unums are more ambitious. They would have variable lengths, so implementing them would require more changes than just changing the way a fixed-length set of bits is interpreted.


As part of the spec, a true posit implementation includes an exact accumulator unit. This has been described for decades, is pipelineable, and allows for, say 10e20+1-10e20 == 1. We've shown that this gives speedup and about 10^5 improvement in accuracy for linpack 1000.


Unums are a bit of a technically questionable idealism. Posits on the other hand seem simply seem technically superior to floats in almost every way (including many measures of simplicity and efficiency).

But technical inertia is such that not always the best newcomer can beat the established ways.

edit: Somewhat relevant xkcd :)

https://xkcd.com/567/


Unums are an odd duck, but the related posits from the same person make a lot of sense. They extend the range of floating point numbers by slowly tapering the relative accuracy the further you get away from 1. I find the paper[0] very readable, but if you just want some quick examples of the numbers, I wrote a blog post[1] some time ago which you may find helpful.

[0] http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf [1] http://nhaehnle.blogspot.de/2017/09/posits.html


Whenever I use floating point for anything serious, I always feel tempted to use modules like decimal or bigfloat from the start, just to eliminate potential problems that can (and eventually will) crop up.


I worked at a small company that had a mathematician on staff. He wanted to retire, and hired a replacement.

The new hire made some comments that I thought sounded irresponsible, such as almost nobody does numerical analysis and instead they use double precision because, usually, even with catastrophic cancellation there are enough bits to give a useful answer. But, apparently, “use a higher precision” is a common answer in scientific programming, and it’s much easier to get right than numerical analysis ( https://www.davidhbailey.com/dhbtalks/dhb-expm-sun.pdf , first and last slides).

If you can afford it, that probably is the best approach. Cook’s article specifically mentions that the issue crops up with single precision floating point (which people might want for SIMD or GPU reasons, or on mobile to save power). In other words, Cook is talking about cases where you can’t afford extended precision.


Interesting historical piece from George Forsythe: http://i.stanford.edu/pub/cstr/reports/cs/tr/66/40/CS-TR-66-...


> CPUs have gotten much faster, but moving bits around in memory has not.

Would it then not suffice to just load 32 bit floats, convert them to 64 bits on the processor, and then store them again as 32 bits? Essentially, we are 'compressing' our data in memory, whilst doing the computational steps with full precision.


There is the 80 bit "long double" type that contains excess precision over the 64 bit double type. But if you want to write code that's both, portable, and reliable you have to turn off a lot of optimizations and use 32 or 64 bit types with standardised behavior.


Am I missing something or is it just an example from "What every computer scientist should know about floating point arithmetic"? There are some other nice ones there as well. And when any of this ceased being relevant?


Well, there’s an interesting wrinkle. When the linear coefficient b is large relative to the other coefficients, the quadratic formula can give wrong results when implemented in floating point arithmetic.

No, there is no wrinkle whatsoever: your computer, guided by your assumptions and given inadequate input (let alone processing logic) got it wrong.

The day you start to blame inputs for errors in output, for which you have complete control is the one that you should mark as a good point to really consider whether you are in the right game.


He’s not blaming the inputs. He’s simply saying that some inputs can produce incorrect outputs. He’s not saying the incorrect outputs are the _fault_ of the inputs. He goes on to explain where the blame actually lies, in floating point precision.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: