FloatX: A C++ Library for Customized Floating-Point Arithmetic

ArtWomb · on Jan 8, 2020

Link to ACM library paper:

https://dl.acm.org/doi/10.1145/3368086

helltone · on Jan 8, 2020

I have encountered many real world cases where I needed less precision/range than what float/double had available. But usually I found fixed point was a better solution than reduced precision floats. I wonder what applications are there that can deal with reduced precision but somehow still need the range you get with an exponent?

CodesInChaos · on Jan 8, 2020

16-bit floats with an 8-bit exponent and a 7+1 bit mantissa are popular for neural networks, because they have the same range as standard 32-bit floats while taking have the memory and memory bandwidth.

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

helltone · on Jan 8, 2020

Interesting, although I believe most neural nets nowadays have moved to use linear (relu) activation, which again removes the need for exponent and would work really well with fixed point.

Here's a reference using 16-bit fixed point neural nets http://ieeexplore.ieee.org/document/7011421/?part=1

nwallin · on Jan 8, 2020

How does this compare to boost::multiprecision?

https://www.boost.org/doc/libs/1_72_0/libs/multiprecision/do...

brandmeyer · on Jan 8, 2020

Ironically, the section labeled "What FloatX is NOT" tells you more about what it is than what it isn't.

Its a system for emulating narrower precision floating-point arithmetic using native machine-width floating point. The author's claim that its much faster than using the integer unit for this purpose.

Boost::multiprecision and MPFR are libraries to execute higher-precision arithmetic, commonly using the integer hardware to do so.

jcranmer · on Jan 8, 2020

Emulating narrower using larger floating-point arithmetic can be dodgy, since you open yourself up to double rounding scenarios. For the IEEE 754 types (half, single, double, and quad), it is the case that the primitive operations (+, -, *, /, sqrt) are all correctly rounded if you emulated it by converting to the next size up, doing the math, and converting it down. For non-IEEE 754 types (such as bfloat16, or the x87 80-bit type), this is not the case, so double rounding is a possible concern.

stkdump · on Jan 8, 2020

That section also says:

> it is not likely that FloatX will be useful in production codes (sic).

I would be interested in why the author thinks that. Quality of implementation issue? Or is it in reference to the statement before that it is WIP?

nwallin · on Jan 8, 2020

Wow thanks! Yeah that wasn't clear to me at all. I hadn't considered the need for a floating point library that is less precise (and less performant!) than the 32/64 bit native types.

bhuthesh_r · on Jan 8, 2020

I'm trying to use 16 bit floats for matrix multiplication on x86-64. I found solutions for ARM and some NVIDIA GPUs but none for any X86-64 chips. Any pointers in this direction would be helpful.

integricho · on Jan 8, 2020

Here is a little sample code I threw together, it shows the whole cycle, from conversion to half-floats to the conversion back to floats and performing a simple multiplication of the values:

https://godbolt.org/z/FYu_rK

Hope it helps.

bhuthesh_r · on Jan 9, 2020

Thank you.

helltone · on Jan 8, 2020

https://github.com/google/gemmlowp

igorkraw · on Jan 8, 2020

Ha, very cool to see this here, I briefly worked with one of the authors in Zurich^ ^