> Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?
Probably, but is it worth the cost? One of the goals behind BitNet and this paper is to find a way to implement LLMs as efficiently in hardware as possible, and foregoing floating point semantics is a big part of it. I'm not sure if there's a way to encode -0 that doesn't throw out half the performance gains.
But if I understand it correctly, they already need to use 2 bits, one for the sign and another one for the value, so there is already one wasted state, which could be used for -0.
How exactly would you do that? 3 states need 1.58 bits which is a tad more than 1.5. Two 3-states have 3²=9 states while three bits only give you 2³=8 states.
I wonder if there's some encoding tricks you can use to reduce it to 8 (or less?) effective states, given that you're only using them with a reduced set of mathematical operations. E.g., can you automatically convert all (-1, 1) to (1, -1) and save one encoded state, since they add up to the same result anyway?
You can use a bit for zero or non-zero and then use bits only for providing the sign to non-zero values, for example. The sign part will be variable length but can probably be made very fast with hardware support.
Probably, but is it worth the cost? One of the goals behind BitNet and this paper is to find a way to implement LLMs as efficiently in hardware as possible, and foregoing floating point semantics is a big part of it. I'm not sure if there's a way to encode -0 that doesn't throw out half the performance gains.