Why stop at 16-bit? I'd be curious to see a study that tries every number of bit...

jmalicki · on May 23, 2023

Microsoft Research published results with 1-bit gradients.

https://www.microsoft.com/en-us/research/publication/1-bit-s...

int4 (fixed point) has already been popular for inference https://developer.nvidia.com/blog/int4-for-ai-inference/ and int3 has seem some use for LLaMA-at-home

_nalply · on May 23, 2023

Posits seems to be better for 8-bit or even 6-bit. There is only one not-a-number place, the NaR (not a real). This means for 6-bit posits you have 63 points in the number space.

https://en.wikipedia.org/wiki/Unum_(number_format)#Unum_III

Dylan16807 · on May 23, 2023

The way posits focus on numbers near 1.0 is probably going to have a bigger effect. A 6 bit float with 4 exponent bits is the best competitor to a 6 bit posit, and it would only have four non-finite numbers.

mensetmanusman · on May 23, 2023

What would it mean though? If the information is embedded in the network, and we find some performance figure of merit, wouldn't it probably be about the same performance when normalized to power utilization?

Maybe it's about optimizing every clock cycle?

eru · on May 23, 2023

Why would you expect performance to be constant when normalized to power utilisation?

If your 16 bit floats perform about the same as 32 bit floats in an absolute sense, then they will probably perform even better when normalised for power utilisation.

And if 16 bit work, 15 bit floats might perform well, too, for all we know. That's what the original commenter was getting at, I think.

em3rgent0rdr · on May 23, 2023

Yes, that is what I was getting at. Floating point hardware with fewer bits have less complexity and so have smaller transistor die area and power consumption.

WithinReason · on May 23, 2023

If I recall correctly, area/power is proportional to the square of the length of the mantissa.

mensetmanusman · on May 23, 2023

I'm guessing that "about the same" will be hard to measure, and that at some point, thermodynamics will dictate the maximum performance per power output (assuming fixed transistor architecture).

eru · on May 24, 2023

Thermodynamics will dictate the performance in some sense of bit-operations per Joule sense.

The more important performance metric is not the number of bit-operations, but the quality of the neural network output.

The hypothesis is that fewer bits in your numbers give you the same or nearly the same output quality, but at drastically fewer bit operations performed, and thus less Joule spent.