There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.
The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)
> At the end of March 2014, Graham stepped away from his leadership role at Y Combinator, leaving Hacker News administration in the hands of other staff members. The site is currently moderated by Daniel Gackle who posts under the username "dang".
You're talking about mapping floating-point vector representations, i.e., embeddings, computed by a pretrained LLM to binary vector representations, right? And you're talking about doing this by first having someone else's pretrained LLM compute the embeddings, right? Sorry, but that seems only minimally, tangentially related to the topic of running LLMs in ternary space. I don't see how your comment is relevant to the discussion here.
Yeah, sorry, needed a much bigger canvas than a comment to explain. Let me try again. The example I took was to show mapping from one space to another space and it may have just come across as not learning anything. Yes. You are right it was someone else's pretrained LLM. But this new space learnt the latent representations of the original embedding space. Now, instead of the original embedding space it could also have been some image representation or some audio representation. Even neural networks take input in X space and learn a representation in Y space. The paper shows that any layer of a neural network can in fact be replaced with a set of planes and we can represent a space using those planes and that those planes can be created in a non iterative way.
Not sure if I am being clear, but have written a small blog post to show for MNIST how an NN creates the planes(https://gpt3experiments.substack.com/p/understanding-neural-...). Will write more on how once these planes are written, how we can use a bit representation instead of floating point values to get similar accuracy in prediction and next how we can draw those planes without the iterative training process.
We are currently trying a build a full fledged LLM using just this approach(no LLM training etc) and also an ASR. We should have something to share in a couple of months.
The attempt is not to replace a particular neural network which has already been trained by using Sigmoid or Rel functions. If one does this then one would necessarily have to use non-linear maps.
The whole point is that such a non-linear technique is not necessary for classifications.
It is not necessary to confine clusters by hyperplanes for solving a classification problem.
Our focus is on individual points.
(1) Take your data as a stream. Use your machine learning gadget to give you the (predicted) probability for each of the possible next tokens. Then use those probability in arithmetic coding to specify which token actually came next.
(2) Take your data D. Apply lossy compression to it. Store the result L := lossy(D). Also compute the residue R := D - uncompress(L). If your lossy compression is good, R will be mostly zeroes (and only a few actually differences), so it will compress well with a lossless compression algorithm.
Approach (1) is a more sophisticated version of (2). None of this is anything I came up with, those approaches are well known.
There is a mathematical proof that binary representation is enough to capture the latent space. And in fact we don't even need to do "training" to get that representation.
The practical application we tried out for this algorithm was to create an alternate space for mpnet embeddings of Wikipedia paragraphs. Using Bit embedding we are able to represent 36 million passages of Wikipedia in 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-dat...)