Hacker News new | past | comments | ask | show | jobs | submit login
Floating point visually explained (2017) (fabiensanglard.net)
288 points by luisha on Nov 28, 2021 | hide | past | favorite | 57 comments



Why is everyone complaining about people finding floats hard? Sure, scientific notation is easy to grasp, but you can't honestly tell me that it's trivial AFTER you consider rounding modes, subnormals, etc. Maybe if hardware had infinite precision like the real numbers nobody would be complaining ;)

One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:

- precision = how many bits you have available

- exponent range = lower/upper bounds for exponent

- radix = 2 for binary floats

Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].

[1]: https://cs.nyu.edu/~overton/book/index.html


I think the binary representation is the essence of floating point numbers, and if you go beyond the "sometimes, the result is slightly wrong" stage, you have to understand it.

And so far the explanation in the article is the best I found, not least because subnormal numbers appear naturally.

There is a mathematical foundation behind it of course, but it is not easy for a programmer like me. I think it is better to think in term of bits and the integers they make, because that's what the computer sees. And going this way, you get NaN-boxing and serialization as a bonus.

Now, I tend to be most comfortable with a "machine first", bottom-up, low level approach to problems. Mathematical and architectural concepts are fine and all, but unless I have some idea about how it looks like in memory and the kind of instructions being run, I tend to feel lost. Some people may be more comfortable with high level reasoning, we don't all have the same approach, that's what I call real diversity and it is a good thing.


Sorry, I didn't mean to downplay the value of using concrete examples. I absolutely agree that everyone learns better from concrete settings, which is why my original comment fixed the parameters for people to play with. I was referring more to the discussions of how exponents are stored biased, the leading bit in the mantissa is implied = 1 (except for subnormals), and so on. All these are distracting features that can (and should) be covered once the reader has a strong intuition of the more fundamental aspects.


The binary form is important to understand the implementation details. You even mention underflow. It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

The binary form handily demonstrates the limitations. Understanding the floating point instructions is kinda optional but still valuable.

Otherwise everyone should just use varint-encoded arbitrary precision numbers.


It also explains why 0.1+0.2 is not 0.3. With binary IEEE-754 floats, none of those can be represented exactly[a]. With decimal IEEE-754 floats, it's possible, but the majority of hardware people interact with works on binary floats.

[a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but it's not possible to express it in binary exactly; only after rounding. 0.5, however, is exactly representable.


    Python 3.9.5
    >>> 0.1.hex()
    '0x1.999999999999ap-4'
    >>> 0.2.hex()
    '0x1.999999999999ap-3'
    >>> (0.1 + 0.2).hex()
    '0x1.3333333333334p-2'
    >>> 0.3.hex()
    '0x1.3333333333333p-2'


But they are repeating. So, by definition, they are not exactly representable in a (binary) floating point system. Again, that’s why 0.1 + 0.2 is not 0.3 in binary floating point.


These are not "repeating". This is showing the exact binary representation of the nearest double precision value in each case.


> It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

Because you don't have all the digits available just for the mantissa? That seems quite intuitive to me, even if you don't know about the corner cases of FP. This isn't one of them.


I was going to respond with something like this. I think for getting a general flavor*, just talking about scientific notation with rounding to a particular number of digits at every step is fine.

I guess -- the one thing that explicitly looking at the bits does bring to the table, is the understanding that (of course) the number of bits or digits in the mantissa must be less than the number of bits or digits in an equivalent length integer. Of course, this is pretty obvious if you think about the fact that they are the same size in memory, but if we're talking about sizes in memory, then I guess we're talking about bits implicitly, so may as well make it explicit.

* actually, much more than just getting a general flavor, most of the time in numerical linear algebra stuff the actual bitwise representation is usually irrelevant, so you can get pretty far without thinking about bits.


>"Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery"

Might you or someone else elaborate on what "NaN-boxing trickery" is and why such "trickery" is needed?


There are 13 bits that are unused in the NaN representation, so you can put data in there without impeding your ability to represent every floating point value.

For example, if you are designing a scripting language interpreter, you can make every value a floating point number and use some of the NaNs to represent pointers, booleans, etc.

See also http://wingolog.org/archives/2011/05/18/value-representation...


As one who understands floats I really wish there was a better notation for literals, the best would be a floating literal in binary representation.

For integers you can write 0x0B, 11, 0b1011 and have a very precise representation

For floats you write 1e-1 or 0.1 and you get an ugly truncation. If it were possible to write something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)... people would be incentivate to use nice negative power of 2 floats, which are much less error prone than ugly base conversions.

This way you can overcame the fears around floats being nonexact and start writing more accurate tests (bit per bit) in many cases


> better notation for literals [...] something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)

There are floating point hex literals. These can be written as 0x1p-1 == 0.5 and 0x1p-2 == 0.25.

You can use them in C/C++, Java, Julia, Swift, ..., but they are not supported everywhere.

https://observablehq.com/@jrus/hexfloat


C++ hex floats are an interesting combination of 3 numeric bases in one!

the mantissa is written in base 16

the exponent is written in base 10

the exponent itself is a power of 2 (not of 16 or 2), so that's base 2

One can only wonder how that came to be. I think they chose base 10 for the exponent to allow using the 'f' suffix to denote float (as opposed to double)


Julia is missing 32 bit and 16 bit hex floats unfortunately.


You can just wrap the literal in a conversion function, eg Float32(0x1p52), which should get constant propagated at compile time.


I know, it just isn't as nice to read.


Are you asking for pow(2, n)? Or for `float mkfloat(int e, int m)` which literally implements the given formula? I doubt that the notation you’re suggesting will be used in code, except for really rare bitwise cases.

The initial precision doesn’t really matter, because if you plan to use this value in a computation, it will quickly accumulate an error, which you have to deal with anyway. There are three ways to deal with it: 1) ignore it, 2) account for it, 3) use numeric methods which retain it in a decent range. You may accidentally (1)==(3), but the problem doesn’t go away in general.


I'm not sure this is what you are asking for, but would this be suitable?

> ISO C99 and ISO C++17 support floating-point numbers written not only in the usual decimal notation, such as 1.55e1, but also numbers such as 0x1.fp3 written in hexadecimal format. [...] The exponent is a decimal number that indicates the power of 2 by which the significant part is multiplied. Thus ‘0x1.f’ is 1 15/16, ‘p3’ multiplies it by 8, and the value of 0x1.fp3 is the same as 1.55e1.

https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html


The "default" formula as presented in the article seems… strange. Is this really how it's normally taught?

    (-1)^S * 1.M * 2^(E - 127)
This seems unnecessarily confusing. And 1.M isn't notation I've seen before. If we expand 1.M into 1 + M : 0 < M < 1, then we pretty quickly arrive at the author's construction of "windows" and "offsets".

    (-1)^S * (1 + M) * 2^(E - 127)
    (-1)^S * 2^log_2(1 + M) * 2^(E - 127)
    
    let F := log_2(1 + M)
    0 < F < 1
    
    Note: [0 < M < 1] implies [1 < M + 1 < 2] implies [0 < log_2(1+M) < 1]

    (-1)^S * 2^(E - 127 + F)

Since we know F is between 0 and 1, we can see that F controls where the number lands between 2^(E - 127) and 2^(E - 127 + 1) (ignoring the sign). It's the "offset".


I don't understand how M * 2^E (modulo small representation details) is difficult to grasp. Then you have decimal types as M * 10^E.

It's certainly much clearer than the windowing and bucketing in this article.


His approach works well for me. I don't retain arbitrary facts. At least part of this is a fear that if I don't properly understand its dynamics I will misapply it, better to discard it.

The windowing explanation shows me a path from the intent of the designer through to the implementation (the algorithm). Now I can retain that knowledge.


> I don't retain arbitrary facts.

I still fail to understand what is arbitrary in scientific notation. The designers almost certainly didn't think in terms of windows when creating the data types, they probably thought about the mantissa and exponent the way it's usually explained, and maybe about information density (entropy per bit) when comparing with other possibilities.

Anyway, if it helps, great. Different explanations are always welcome. But this one is post-fact.


> I still fail to understand what is arbitrary in scientific notation

I used the word arbitrary, but did not say that scientific notation was arbitrary.

I will try to explain another way. The equation is an assertion. For me, it does not plainly follow from the underlying problem. Once the author has introduced the windowing approach, the equation follows easily for me.

> almost certainly [..] they probably [..] Different explanations are always welcome. But this one is post-fact.

By your own words, it is not clear that it is. It is clear that you doubt that the author of the system was thinking in terms of windows and buckets.


Ahem, exponential forms go back to Archimedes, and no design (or intent thereof) assumed windows or offsets in FP. It’s just a fixed-precision floating-point notation, no more no less. The problem persists with decimal tax calculations like $11111.11 x 12% == $1333.33|32, where 0.0032 is lost in a monetary form. It also persists with on-paper calculations because there may be no room for all the digits that your algorithm produces (think x/7) and you have to choose your final acceptable precision.

At least part of this is a fear that if I don't properly understand its dynamics I will misapply it

The explanation that you like can’t save from it either. It’s not something you think through and just write the correct code. Quick exercise with the “new” knowledge you’ve got: you have an array of a million floats, add them up correctly:


I think windowing and bucketing are a useful conceptual model for understanding the precision of the encoding space. I understand how FP values work in practice but this article really hits the nail on it's head when explaining the less tangible aspects of FP encoding.


Some past related threads:

Floating Point Visually Explained (2017) - https://news.ycombinator.com/item?id=23081924 - May 2020 (96 comments)

Floating Point Visually Explained (2017) - https://news.ycombinator.com/item?id=19084773 - Feb 2019 (17 comments)

Floating Point Visually Explained - https://news.ycombinator.com/item?id=15359574 - Sept 2017 (106 comments)


As someone who never learned the "dreadful notation... discouraging legions of programmers", this was really helpful! The "canonical" equation doesn't mean anything to me and the window/offset explanation does.

I'm sure the author intended to keep the article short, but I think it would benefit from more examples, including a number like 5e-5 or something. It isn't clear to me how the window/offset explains that.

Edit: to clarify, I do not understand how the floating point representation of 0.00005 is interpreted with windows/offsets.


5e-5 is not really related to FP, it’s just scientific notation. The number itself is still stored as shown in the post, it’s just being printed in a more compact form.

The number after e is a power of 10:

    5e2  = 5 * 10^2  = 5 * 100 = 500
    5e-5 = 5 * 10^-5 = 5 / 100000 = 0.00005
Once you internalize this, you just read the exponent as “x zeroes to the left”.


Scientific notation is easily seen as just floating point in base ten, though. Had the same rules about leading 1 and such. Add in significant digits, and you have most all of the same concerns.


Scientific notation IS a floating point.


See the Intel math coprocessor was a blast from the past. I sold a few of them and installed a few of them when I used to work in computer stores.



It probably differs a lot per person, and I usually do find graphical explanations of stuff easyer to graph. But that formula was way easyer to understand for me than the other explanation. Maybe it is related to the fact that I am pretty ok at math, but I got kinda bad dyslexia.


I was kinda hoping for a visualization of which numbers exists in floating point.

While I always new about

    0.1 + 0.2 -> 0.30000000000000004
it was still kind of an epiphany realizing that floating point numbers don’t so much have rounding errors as they are simply discreet numbers.

You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.

I also feel like that is a much more important take away from a user of floating points knowing what the mantissa is.


I highly recommend watching this video, which explains floats in the context of Mario 64: https://www.youtube.com/watch?v=9hdFG2GcNuA

Games are a great way to explain floats because they're so visual. Specifically, check out 3:31, where the game has been hacked so that the possible float values in one of the movement directions is quite coarse.

(If you're curious what a PU is, and would like some completely useless knowledge, this video is also an absolute classic and very entertaining: https://www.youtube.com/watch?v=kpk2tdsPh0A)


> You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.

What do you mean by continuous? Obviously if you take a number line and remove either the rational or irrational numbers, you will end up with infinitely many holes.

The thing that makes floating point numbers unique is that, for any given representation, there are actually only finitely many. There’s a largest possible value and a smallest possible value and each number will have gaps on either side of it. I think you meant that the rationals and irrationals are dense (for any two distinct numbers, you can find another number between them), which is also false for floating-point numbers.


> What do you mean by continuous?

I meant continuous in the “mathematical sense”. But maybe continuous is actually statistical term, and dense is the correct mathematical term.


I dunno, I feel it better to consider floats to not be a single discrete number but an interval.


Wild, just got done listening to Coding Blocks podcast episode on data structure primitives where they go in depth on floating point, fixed point, binary floating point, and more. Great listen! See around 54min mark - https://www.codingblocks.net/podcast/data-structures-primiti...


I found a tool[0] that helps me debug potential floating point issues when they arise. This one has modes for half-, single- and double-precision IEEE-754 floats, so I can deal with the various issues when converting between 64-bit and 32-bit floats.

[0] https://news.ycombinator.com/item?id=29370883


This is the article that finally helped me grasp and understand IEEE-754. An excellent read.


I also found this video from SimonDev educative (and entertaining): https://www.youtube.com/watch?v=Oo89kOv9pVk


Anyone got a good reference for pratical implications of floating point vs fixed point in measurement calculations (especially timing measurements)? Gotchas, rules of thumb, best practice etc?


Having read this article I understand floating point even less now


Why is there no float type with linear precision?


that's called fixed point. there isn't hardware for it because it is cheap to make in software using integer math.


But wouldn't it be sweet to have SIMD acceleration for fixed point?

Or is that something you can do with integers and SIMD today?

Say 4x4 matrix multiplication?


Having implemented a nested PID controller using fixed-point arithmetic in an FPGA, I can tell you that fixed-point is a pain in the ass, and you only use it when floating-point is too slow, too large (in terms of on-chip resources), or consumes too much power, or when you really need to control the precision going into and out of an arithmetic operation.


your add is just an integer add. your multiply is a mul_hi, 2 shifts, 1 add and 1 mul.


I know some of those words, but still I'm curious as to why OpenGL has not explored fixed for 3D graphics?


Are you asking if integer SIMD exists? (It does)


Some DSP chips had hardware for fixed point. I think it's a shame that C never added support for fixed point.


There was a draft and GCC supports it in stdfix.h. The downside is that the types have limited integer range since they're tailored for DSP applications where values are kept scaled between +/-1.0.


But the stdfix does not take advantage of SIMD style acceleration right?


Several years ago I made a basic, interactive explanation of the same-

https://dennisforbes.ca/articles/understanding-floating-poin...

At the time it was to educate a group I was working with about why their concern that "every number in JavaScript is a double" doesn't mean that 1 is actually 1.000000001 (e.g. everyone knows that floating point numbers are potentially approximations, so there is a widespread belief that every integer so represented is the same).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: