Why is everyone complaining about people finding floats hard? Sure, scientific notation is easy to grasp, but you can't honestly tell me that it's trivial AFTER you consider rounding modes, subnormals, etc. Maybe if hardware had infinite precision like the real numbers nobody would be complaining ;)
One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:
- precision = how many bits you have available
- exponent range = lower/upper bounds for exponent
- radix = 2 for binary floats
Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].
I think the binary representation is the essence of floating point numbers, and if you go beyond the "sometimes, the result is slightly wrong" stage, you have to understand it.
And so far the explanation in the article is the best I found, not least because subnormal numbers appear naturally.
There is a mathematical foundation behind it of course, but it is not easy for a programmer like me. I think it is better to think in term of bits and the integers they make, because that's what the computer sees. And going this way, you get NaN-boxing and serialization as a bonus.
Now, I tend to be most comfortable with a "machine first", bottom-up, low level approach to problems. Mathematical and architectural concepts are fine and all, but unless I have some idea about how it looks like in memory and the kind of instructions being run, I tend to feel lost. Some people may be more comfortable with high level reasoning, we don't all have the same approach, that's what I call real diversity and it is a good thing.
Sorry, I didn't mean to downplay the value of using concrete examples. I absolutely agree that everyone learns better from concrete settings, which is why my original comment fixed the parameters for people to play with. I was referring more to the discussions of how exponents are stored biased, the leading bit in the mantissa is implied = 1 (except for subnormals), and so on. All these are distracting features that can (and should) be covered once the reader has a strong intuition of the more fundamental aspects.
The binary form is important to understand the implementation details. You even mention underflow. It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.
The binary form handily demonstrates the limitations. Understanding the floating point instructions is kinda optional but still valuable.
Otherwise everyone should just use varint-encoded arbitrary precision numbers.
It also explains why 0.1+0.2 is not 0.3. With binary IEEE-754 floats, none of those can be represented exactly[a]. With decimal IEEE-754 floats, it's possible, but the majority of hardware people interact with works on binary floats.
[a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but it's not possible to express it in binary exactly; only after rounding. 0.5, however, is exactly representable.
But they are repeating. So, by definition, they are not exactly representable in a (binary) floating point system. Again, that’s why 0.1 + 0.2 is not 0.3 in binary floating point.
> It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.
Because you don't have all the digits available just for the mantissa? That seems quite intuitive to me, even if you don't know about the corner cases of FP. This isn't one of them.
I was going to respond with something like this. I think for getting a general flavor*, just talking about scientific notation with rounding to a particular number of digits at every step is fine.
I guess -- the one thing that explicitly looking at the bits does bring to the table, is the understanding that (of course) the number of bits or digits in the mantissa must be less than the number of bits or digits in an equivalent length integer. Of course, this is pretty obvious if you think about the fact that they are the same size in memory, but if we're talking about sizes in memory, then I guess we're talking about bits implicitly, so may as well make it explicit.
* actually, much more than just getting a general flavor, most of the time in numerical linear algebra stuff the actual bitwise representation is usually irrelevant, so you can get pretty far without thinking about bits.
There are 13 bits that are unused in the NaN representation, so you can put data in there without impeding your ability to represent every floating point value.
For example, if you are designing a scripting language interpreter, you can make every value a floating point number and use some of the NaNs to represent pointers, booleans, etc.
As one who understands floats I really wish there was a better notation for literals, the best would be a floating literal in binary representation.
For integers you can write 0x0B, 11, 0b1011 and have a very precise representation
For floats you write 1e-1 or 0.1 and you get an ugly truncation. If it were possible to write something like 1eb-1 (for 0.5) and 1eb-2 (for 0.25)... people would be incentivate to use nice negative power of 2 floats, which are much less error prone than ugly base conversions.
This way you can overcame the fears around floats being nonexact and start writing more accurate tests (bit per bit) in many cases
C++ hex floats are an interesting combination of 3 numeric bases in one!
the mantissa is written in base 16
the exponent is written in base 10
the exponent itself is a power of 2 (not of 16 or 2), so that's base 2
One can only wonder how that came to be. I think they chose base 10 for the exponent to allow using the 'f' suffix to denote float (as opposed to double)
Are you asking for pow(2, n)? Or for `float mkfloat(int e, int m)` which literally implements the given formula? I doubt that the notation you’re suggesting will be used in code, except for really rare bitwise cases.
The initial precision doesn’t really matter, because if you plan to use this value in a computation, it will quickly accumulate an error, which you have to deal with anyway. There are three ways to deal with it: 1) ignore it, 2) account for it, 3) use numeric methods which retain it in a decent range. You may accidentally (1)==(3), but the problem doesn’t go away in general.
I'm not sure this is what you are asking for, but would this be suitable?
> ISO C99 and ISO C++17 support floating-point numbers written not only in the usual decimal notation, such as 1.55e1, but also numbers such as 0x1.fp3 written in hexadecimal format. [...] The exponent is a decimal number that indicates the power of 2 by which the significant part is multiplied. Thus ‘0x1.f’ is 1 15/16, ‘p3’ multiplies it by 8, and the value of 0x1.fp3 is the same as 1.55e1.
The "default" formula as presented in the article seems… strange. Is this really how it's normally taught?
(-1)^S * 1.M * 2^(E - 127)
This seems unnecessarily confusing. And 1.M isn't notation I've seen before. If we expand 1.M into 1 + M : 0 < M < 1, then we pretty quickly arrive at the author's construction of "windows" and "offsets".
(-1)^S * (1 + M) * 2^(E - 127)
(-1)^S * 2^log_2(1 + M) * 2^(E - 127)
let F := log_2(1 + M)
0 < F < 1
Note: [0 < M < 1] implies [1 < M + 1 < 2] implies [0 < log_2(1+M) < 1]
(-1)^S * 2^(E - 127 + F)
Since we know F is between 0 and 1, we can see that F controls where the number lands between 2^(E - 127) and 2^(E - 127 + 1) (ignoring the sign). It's the "offset".
His approach works well for me. I don't retain arbitrary facts. At least part of this is a fear that if I don't properly understand its dynamics I will misapply it, better to discard it.
The windowing explanation shows me a path from the intent of the designer through to the implementation (the algorithm). Now I can retain that knowledge.
I still fail to understand what is arbitrary in scientific notation. The designers almost certainly didn't think in terms of windows when creating the data types, they probably thought about the mantissa and exponent the way it's usually explained, and maybe about information density (entropy per bit) when comparing with other possibilities.
Anyway, if it helps, great. Different explanations are always welcome. But this one is post-fact.
> I still fail to understand what is arbitrary in scientific notation
I used the word arbitrary, but did not say that scientific notation was arbitrary.
I will try to explain another way. The equation is an assertion. For me, it does not plainly follow from the underlying problem. Once the author has introduced the windowing approach, the equation follows easily for me.
> almost certainly [..] they probably [..] Different explanations are always welcome. But this one is post-fact.
By your own words, it is not clear that it is. It is clear that you doubt that the author of the system was thinking in terms of windows and buckets.
Ahem, exponential forms go back to Archimedes, and no design (or intent thereof) assumed windows or offsets in FP. It’s just a fixed-precision floating-point notation, no more no less. The problem persists with decimal tax calculations like $11111.11 x 12% == $1333.33|32, where 0.0032 is lost in a monetary form. It also persists with on-paper calculations because there may be no room for all the digits that your algorithm produces (think x/7) and you have to choose your final acceptable precision.
At least part of this is a fear that if I don't properly understand its dynamics I will misapply it
The explanation that you like can’t save from it either. It’s not something you think through and just write the correct code. Quick exercise with the “new” knowledge you’ve got: you have an array of a million floats, add them up correctly:
I think windowing and bucketing are a useful conceptual model for understanding the precision of the encoding space. I understand how FP values work in practice but this article really hits the nail on it's head when explaining the less tangible aspects of FP encoding.
As someone who never learned the "dreadful notation... discouraging legions of programmers", this was really helpful! The "canonical" equation doesn't mean anything to me and the window/offset explanation does.
I'm sure the author intended to keep the article short, but I think it would benefit from more examples, including a number like 5e-5 or something. It isn't clear to me how the window/offset explains that.
Edit: to clarify, I do not understand how the floating point representation of 0.00005 is interpreted with windows/offsets.
5e-5 is not really related to FP, it’s just scientific notation. The number itself is still stored as shown in the post, it’s just being printed in a more compact form.
Scientific notation is easily seen as just floating point in base ten, though. Had the same rules about leading 1 and such. Add in significant digits, and you have most all of the same concerns.
It probably differs a lot per person, and I usually do find graphical explanations of stuff easyer to graph. But that formula was way easyer to understand for me than the other explanation. Maybe it is related to the fact that I am pretty ok at math, but I got kinda bad dyslexia.
I was kinda hoping for a visualization of which numbers exists in floating point.
While I always new about
0.1 + 0.2 -> 0.30000000000000004
it was still kind of an epiphany realizing that floating point numbers don’t so much have rounding errors as they are simply discreet numbers.
You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.
I also feel like that is a much more important take away from a user of floating points knowing what the mantissa is.
Games are a great way to explain floats because they're so visual. Specifically, check out 3:31, where the game has been hacked so that the possible float values in one of the movement directions is quite coarse.
(If you're curious what a PU is, and would like some completely useless knowledge, this video is also an absolute classic and very entertaining: https://www.youtube.com/watch?v=kpk2tdsPh0A)
> You can move from one float to the next, which is a meaningfull operation on discreet numbers like integers, but not continuous numbers like rational and irrational numbers.
What do you mean by continuous? Obviously if you take a number line and remove either the rational or irrational numbers, you will end up with infinitely many holes.
The thing that makes floating point numbers unique is that, for any given representation, there are actually only finitely many. There’s a largest possible value and a smallest possible value and each number will have gaps on either side of it. I think you meant that the rationals and irrationals are dense (for any two distinct numbers, you can find another number between them), which is also false for floating-point numbers.
Wild, just got done listening to Coding Blocks podcast episode on data structure primitives where they go in depth on floating point, fixed point, binary floating point, and more. Great listen! See around 54min mark - https://www.codingblocks.net/podcast/data-structures-primiti...
I found a tool[0] that helps me debug potential floating point issues when they arise. This one has modes for half-, single- and double-precision IEEE-754 floats, so I can deal with the various issues when converting between 64-bit and 32-bit floats.
Anyone got a good reference for pratical implications of floating point vs fixed point in measurement calculations (especially timing measurements)? Gotchas, rules of thumb, best practice etc?
Having implemented a nested PID controller using fixed-point arithmetic in an FPGA, I can tell you that fixed-point is a pain in the ass, and you only use it when floating-point is too slow, too large (in terms of on-chip resources), or consumes too much power, or when you really need to control the precision going into and out of an arithmetic operation.
There was a draft and GCC supports it in stdfix.h. The downside is that the types have limited integer range since they're tailored for DSP applications where values are kept scaled between +/-1.0.
At the time it was to educate a group I was working with about why their concern that "every number in JavaScript is a double" doesn't mean that 1 is actually 1.000000001 (e.g. everyone knows that floating point numbers are potentially approximations, so there is a widespread belief that every integer so represented is the same).
One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:
- precision = how many bits you have available
- exponent range = lower/upper bounds for exponent
- radix = 2 for binary floats
Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].
[1]: https://cs.nyu.edu/~overton/book/index.html