Because if your algorithm contains an "-" and you don't have complete control ov...

fwip · 2024-07-16T18:45:56 1721155556

Okay, so, in practice, we do know lots of things about the inputs. It may be impossible "in the general case," but engineering is done in the specific case. Anybody writing control software for spacecraft knows whether the mass of the moon or the spacecraft is larger, and can rearrange their equations accordingly.

IshKebab · 2024-07-15T21:10:40 1721077840

That's not really related to floating point though. You'll have the same issues for fixed point.

constantcrying · 2024-07-15T21:24:44 1721078684

>That's not really related to floating point though. You'll have the same issues for fixed point.

You don't. "-" is exact for fixed point unless the operation falls outside the range of valid values.

pclmulqdq · 2024-07-15T21:39:10 1721079550

Exact and precise are different ideas. In fixed point, all operations but division are exact. In fixed point, all operations have precision related to the magnitude of the numbers being operated on. You can have a 64-bit fixed point number system that gives you 16 bits of precision on most of your operations.

In floating point, almost every operator (other than subtraction) has precision of the full width of the mantissa minus 0.5 ULPs. All operators are not guaranteed to always be exact, but they are far more precise on average than equivalent operators in fixed point.

Cancellation isn't an issue of exactness, it's an issue of precision.

constantcrying · 2024-07-15T22:02:13 1721080933

Sure, but errors for fixed point happen very differently to floating point errors.

E.g. (a-b)*c, which is the common example for cancellation, if a and b are very close, can have an unbounded error compared to the result in the real numbers, in floating point. Since all operations besides "/" are exact in fixed point, no error can be introduced by this operation in fixed point (if all operations are representable).

Claiming that fixed and floating point are suffering the same way is just wrong.

pclmulqdq · 2024-07-15T22:38:16 1721083096

(a - b) in fixed point will have very low precision, too, and will generally be exact in floating point. The following multiplication by c may or may not be exact in floating point (or in fixed point, mind you - multiplication of two n bit numbers exactly needs 2n bits of precision, which I doubt you will give yourself in fixed point). The "unbounded" error comes in when a and b themselves are not precise, and you have that same precision problem in fixed point as you do in floating point.

For example, suppose your fixed point format is "the integers" and your floating point format has 6 significant digits: if you have real-valued a = 100000.5 and b = 100001.9, both number systems will round a to 100001 and b to 100002. In both cases, (b - a) will be 1 while (b - a) should be 1.4 if done in the reals. That rounding problem exists in fixed point just as much as in floating point. In both systems, the operation that causes the cancellation is itself an exact calculation, but the issue is that it's not precise. Fixed point will just give you 1 in the register while floating point will add a bunch of spurious trailing zeros. Floating point can represent 1.4, though, while fixed point can't. If a and b were represented exactly (a = 100001 and b = 100002 in the reals), there would be no problem in either number system.

The only times that you get better cancellation behavior are when you have more precision to the initial results, which when comparing double precision float to 64-bit fixed point comes when your operands in fixed point have their MSB at the 53rd position or above. That only happens when your dynamic range is so deeply limited that you can't do much math.

When you are thinking about cancellation numerically, exact is a red herring. Precise is what you want to think about.

IshKebab · 2024-07-17T07:10:02 1721200202

If a and b are very close this will break in fixed point too.

It depends on how many bits you have. For example compare f128 with Q8.8. Which one do you think would give better astronomical calculation results?