Interesting that it takes an LLM with 405 BILLION parameters to accurately recall text from a document with slightly less than 728 THOUSAND words. (not quite three decimal orders of magnitude smaller but still).
I guess the challenge is that the parameters have to encode much more intricacy than just the Bible. Even if one were produced purely with text from the bible, it would likely not be able to converse as well as conversationally tuned ones behave.
Perhaps there's a middle ground of a fine-tuned LLM on scripture recall + o1-style background reasoning to produce the best output. Or even just a RAG.
I'm talking about it mostly from an entropy standpoint here. With larger size, you can represent more information.
Think of it like how you have a JPEG, you can compress it further and further, but eventually you lose the ability to understand the original image.
With the models, if they had infinite size, I imagine they could recall values in their training data extremely accurately. But as you compress down further and further to smaller and smaller models, you are trying to distil the same amount of information in less space, and so things cannot be perfectly recalled.
We have people tuning these smaller models to squeeze every ounce of what we consider meaningful out of them (passing certain benchmarks, seeming coherent in dialogue, etc), but in the process the things we don't tune them for, i.e. accurate recall of scripture (not that we should) they lose that ability.
I will say with all of that though, that I only have a high level understanding of LLMs, I've integrated them into products on the job, but I am by no means an ML engineer.
As I mentioned in my edit, I'm not being snarky I'm really curious about this. I found your post very thought provoking.
The JPEG example is a good one, and I was using the FFT in my edit which is like the magic in JPEG. The FFT converts from the time domain to the frequency domain and you can run it both directions to get from one to the other, JPEG uses weighted DCTs but similar concept.
The reason this is useful is because there is a lot of stuff in the signal that isn't "important", or more specifically doesn't contribute to the overall signal. As a result, when you do this encoding to you need fewer 'parameters' in your dataset to recreate the picture than you needed 'pixels' in the original picture. So the total number of things you have to have to recreate the other is less than the original source material.
So now lets say you take two pictures and encode them. You can take that and you can store only those constants that are needed for the second picture when they differ. So now instead of a single additional picture you do a thousand additional pictures. If you think of those pictures as a series of frames in time, then there is a vector of changes that get you from any picture to any other picture. And if you have enough pictures, pretty soon every value of the 64 x 64 pixel block is covered in your dataset and now you can generate any picture from a the 'reference' set of pixel blocks.
Looking that as a time sequence you can transform it so that your picture is a vector path through an n-dimensional space that is your picture baseline. And the 'weights' of that vector need only be the delta from where you are to where you need to go next in your block-weight space.
But if LLMs were like this, then the parameter expansion from one picture to thousand pictures to a million pictures would asymptotically approach a constant number of parameters for the model over all because there is only a finite number of pixel values you can have.
When we were at Blekko we were using Dirichlet accumulators to hold probability vectors for identifying the contents of pages added vectors rapidly for the first 10,000 pages and then added fewer and fewer as we got deeper into pages of that content. And for all pages and all content what we needed to identify any document had this asymptotic approach which was maybe a decimal order wider than the number of documents but not the number of words because the set of words is finite and while the combination of words is infinite, the useful and unique combination is significantly smaller than that.
Now one of the things that those accumlators could do was re-generate phrases and words from the document using a simple statistical algorithm, they are kind of super Markov generators in that regard. So for me, that was how I had been thinking of LLM parameters. However, if that were the case then a 400B parameter model would be able to perfectly recall at least 40B unique documents? But to understand that I have been looking at what the 'parameter' in an LLM actually represents.
I don't think it's necessarily about the parameter count, but the amount of training material about the Bible relative to the rest of the training material, with higher parameter models able to retain more Bible information with a higher proportion of training on other topics.
That is a good question, and it implies the definition of a parameter as a compression artifact/constant? If you've read chapter 4 of Feynman's lectures in computation where he talks about information coding, you get a sense of where I'm coming from. There is some reversible function in LLMs that go from book/document => parameters => book/document. The 'parameters' are the controlling information of that function, what does the information contained in a parameter represent with respect to a book document?