Hacker News new | past | comments | ask | show | jobs | submit login
FloatX: A C++ Library for Customized Floating-Point Arithmetic (github.com/oprecomp)
62 points by ArtWomb on Jan 8, 2020 | hide | past | favorite | 14 comments



Link to ACM library paper:

https://dl.acm.org/doi/10.1145/3368086


I have encountered many real world cases where I needed less precision/range than what float/double had available. But usually I found fixed point was a better solution than reduced precision floats. I wonder what applications are there that can deal with reduced precision but somehow still need the range you get with an exponent?


16-bit floats with an 8-bit exponent and a 7+1 bit mantissa are popular for neural networks, because they have the same range as standard 32-bit floats while taking have the memory and memory bandwidth.

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format


Interesting, although I believe most neural nets nowadays have moved to use linear (relu) activation, which again removes the need for exponent and would work really well with fixed point.

Here's a reference using 16-bit fixed point neural nets http://ieeexplore.ieee.org/document/7011421/?part=1


How does this compare to boost::multiprecision?

https://www.boost.org/doc/libs/1_72_0/libs/multiprecision/do...


Ironically, the section labeled "What FloatX is NOT" tells you more about what it is than what it isn't.

Its a system for emulating narrower precision floating-point arithmetic using native machine-width floating point. The author's claim that its much faster than using the integer unit for this purpose.

Boost::multiprecision and MPFR are libraries to execute higher-precision arithmetic, commonly using the integer hardware to do so.


Emulating narrower using larger floating-point arithmetic can be dodgy, since you open yourself up to double rounding scenarios. For the IEEE 754 types (half, single, double, and quad), it is the case that the primitive operations (+, -, *, /, sqrt) are all correctly rounded if you emulated it by converting to the next size up, doing the math, and converting it down. For non-IEEE 754 types (such as bfloat16, or the x87 80-bit type), this is not the case, so double rounding is a possible concern.


That section also says:

> it is not likely that FloatX will be useful in production codes (sic).

I would be interested in why the author thinks that. Quality of implementation issue? Or is it in reference to the statement before that it is WIP?


Wow thanks! Yeah that wasn't clear to me at all. I hadn't considered the need for a floating point library that is less precise (and less performant!) than the 32/64 bit native types.


I'm trying to use 16 bit floats for matrix multiplication on x86-64. I found solutions for ARM and some NVIDIA GPUs but none for any X86-64 chips. Any pointers in this direction would be helpful.


Here is a little sample code I threw together, it shows the whole cycle, from conversion to half-floats to the conversion back to floats and performing a simple multiplication of the values:

https://godbolt.org/z/FYu_rK

Hope it helps.


Thank you.



Ha, very cool to see this here, I briefly worked with one of the authors in Zurich^ ^




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: