This is intuitively obvious. If I give you some data x and you transform it with a non-reversible function f into f(x) then you are losing information. Repeated applications of the function, f(f(f(...f(x)...))), can only make the end result worse. The current implementations inject some random bits, b ~ N(u, s), but this can be thought of as a convolution operation with the distribution function g of the random data, g*f, that is injected which, after repeated applications, (g*f)((g*f)((g*f)(...(g*f)(x)...))), reduces the information content of the data you started with because the transformation is still not reversible as convolutions can not really change the non-reversible aspect of the original function.
I'm sure there is some calculation using entropy of random variables and channels that fully formalizes this but I don't remember the references off the top of my head. The general reference I remember is called the data processing inequality.¹
This seems obvious, but you're forgetting the inputs may actually have low entropy to begin with. Lossy compression is non-reversible, but usually the expectation is that we don't care about the parts we lost.
How might this cash out with recursive LLMs? Generalizing is very similar to compression: imagine recovering the Schrodinger equation from lots of noisy physical experiments. You might imagine that an LLM could output a set of somewhat general models from real data, and training it on data generated from those models generalizes further in future passes until maybe it caps out at the lowest entropy model (a theory of everything?)
It doesn't seem like it actually works that way with current models, but it isn't a foregone conclusion at the mathematical level at least.
So correct me if I’m wrong here but wouldn’t another way to look at this be something like re-compressing a JPEG? Each time you compress a compressed jpeg you strip more and more information out of it? Same with any lossy compression, really.
These LLM’s are inherently a bit like lossy compression algorithms. They take information and pack it in a way that keeps its essence around (at least that is the plan). But like any lossy compression, you cannot reconstruct the original. Training a lossy compression scheme like an LLM using its own data is just taking that already packed information and degrading it.
I hope I’m right framing it this way because ultimately that is partly what an LLM is, it’s a lossy compression of “the entire internet”. A lossless model that can be queried like an LLM would be massive, slow and probably impossible with today’s tech.
I suspect that we will develop new information theory that mathematically proves these things can’t escape the box they were trained in, meaning they cannot come up with new information that isn’t already represented in the relationships between the various bits of data they were constructed with. They can “only” find new ways to link together the information in their corpus of knowledge. I use “only” in quotes because simply doing that alone is pretty powerful. It’s connecting the dots in ways that haven’t been done before.
Honestly the whole LLM space is cool as shit when you really think about it. It’s both incredibly overhyped yet very under hyped at the same time.
It’s not intuitively obvious losing information makes things worse. In fact, it’s not even true. Plenty of lossy functions make the problem under consideration better off, such as denoising, optimizing, models that expose underlying useful structures, and on and on.
Also injecting noise can improve many problems, like adding jitter before ADC (think noise shaping, which has tremendous uses).
So claiming things like “can only make the end result worse” is “intuitive obvious” is demonstrably wrong.
> with a non-reversible function f into f(x) then you are losing information.
A non-reversible function f does not necessarily lose information. Some non-reversible functions, like one-way functions used in cryptography, can be injective or even bijective but are computationally infeasible to invert, which makes them practically irreversible while retaining all information in a mathematical sense. However, there is a subset of non-reversible functions, such as non-injective functions, that lose information both mathematically and computationally. It’s important to distinguish these two cases to avoid conflating computational irreversibility with mathematical loss of information.
On the arguments involving modeling inference as simply some function f, the specific expression OP used discounts that each subsequent application would have been following some backpropagation and so implies a new f' at each application, rendering the claim invalid.
At that point, at least chaos theory is at play across the population of natural language, if not some expressed, but not yet considered truth.
This invalidates the subsequent claim about the functions which are convolved as well, I think all the GPUs might have something to say whether the bits changing the layers are random or correlated.
if a hash can transform any size input, into a fixed length string, then that implies irreversibility due to the pigeonhole principle. It's impossible, not infeasible
If the goal of an image improvement algorithm is effectively "how would this image have looked IN THE REAL WORLD if it had been taken with a better camera", then training on previous "virtual upscaled images" would be training on the wrong fitness function.
It is real information, it is just information that is not targeted at anything in particular. Random passwords are, well, random. That they are random and information is what makes them useful as passwords.
As said by others, There is nothing terribly insightful about making something estimate the output of another by a non-perfect reproduction mechanism and noticing the output is different. Absent any particular guidance the difference will not be targeted. That is tautologically obvious.
The difference is still information though, and with guidance you can target the difference to perform some goal. This is essentially what the gpt-o1 training
was doing. Training on data generated by itself, but only when the generated data produced the correct answer.
> The only way AI can create information is by doing something in the real world.
Everything done is done in the real world, but the only way an AI can gather (not create) information about some particular thing is to interact with that thing. Without interacting with anything external to itself, all information it can gather is the information already gathered to create it.
Maybe information needs to be understood relationally as in "information for a subject x". So if we have an image with a license plate that is unreadable and there's an algorithm that makes it readable to x, there is an information gain for x, although the information might have been in the image all along.
If the license plate was not readable, then the additional information is false data. You do not know more about the image than you knew before by definition.
Replacing pixels with plausible data does not mean a gain of information.
If anything, I'd argue that a loss of information occurs: The fact that x was hardly readable/unreadable before is lost, and any decision later on can not factor this in as "x" is now clearly defined and not fuzzy anymore.
Would you accept a system that "enhances" images to find the license plate numbers of cars and fine their owners?
If the plate number is unreadable the only acceptable option is to not use it.
Inserting a plausible number and rolling with it even means that instead of a range of suspects, only one culprit can be supposed.
Would you like to find yourself in court for crimes/offenses you never comitted because some black box decided it was a great idea to pretend it knew it was you?
Edit: I think I misunderstood the premise. Nonetheless my comment shall stay.
Sure, but what if the upscaling algorithm misinterpreted a P as an F? Without manual supervision/tagging, there's an inherent risk that this information will have an adverse effect on future models.
“Made up” information is noise, not signal (OTOH, generated in images are used productively all the time in training, but the information content added is not in the images themselves but in their selection and relation to captions.)
Once more and more new training images are based off of those new upscaled images the training of those upscaling algorithms will tend to generate even more of the same type of information drowning out the other information
That's assuming that the same function is applied in the same way at each iteration.
Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.
Simply adding a bit of noise and then selecting good outputs after each iteration based on a high-level heuristic such as "utility" and "self consistency" may be sufficient to reproduce the growth of human knowledge in a purely mathematical AI system.
Something that hasn't been tried yet because it's too expensive (for now) is to let a bunch of different AI models act as agents updating a central wikipedia-style database.
These could start off with "simply" reading every single text book and primary source on Earth, updating and correcting the Wikipedia in every language. Then cross-translate from every source in some language to every other language.
Then use the collected facts to find errors in the primary sources, then re-check the Wikipedia based on this.
Train a new generation of AIs on the updated content and mutate them slightly to obtain some variations.
Iterate again.
Etc...
This could go on for quite a while before it would run out of steam. Longer than anybody has budget for, at least for now!
> The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.
Is human knowledge really derived in a similar manner though? That reduction of biological processes to compression algorithms seems like a huge oversimplification.
It's almost like saying that all of of human knowledge derives from Einstein's Field Equations, the Standard Model Lagrangian, and the Second Law of Thermodynamics (what else could human knowledge really derive from?) and all we have to do to create artificial intelligence is just to model these forces to a high enough fidelity and with enough computation.
It's not just any compression algorithm, though, it's a specific sort of algorithm that does not have the purpose of compression, even if compression is necessary for achieving its purpose. It could not be replaced by most other compression algorithms.
Having said that, I think this picture is missing something: when we teach each new generation what we know, part of that process involves recapitulating the steps by which we got to where we are. It is a highly selective (compressed?) history, however, focusing on the things that made a difference and putting aside most of the false starts, dead ends and mistaken notions (except when the topic is history, of course, and often even then.)
I do not know if this view has any significance for AI.
The models we use nowadays operate on discrete tokens. To overly reduce the process of human learning, we take a constant stream of realtime information. It never ends and it’s never discrete. Nor do we learn in an isolated “learn” stage in which we’re not interacting with our environment.
If you try taking reality and breaking into discrete (ordered in the case of LLMs) parts, you lose information.
Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.
Not true. No amount of such iteration gets you from buffalo cave paintings to particle accelerators.
Humans generate knowledge by acting in the world, not by dwelling on our thoughts. The empiricists won a very long time ago.
When I pursued creative writing in my teens and early 20s, it became clear to me that originality is extremely difficult. I am not entirely sure I have ever had an original thought--every idea I've put to paper thinking it was original, I later realized was a recombination of ideas I had come across somewhere else. The only exceptions I've found were places where I had a fairly unusual experience which I was able to interpret and relate, i.e. a unique interaction with the world.
Perhaps more importantly, LLMs do not contain any mechanism which even attempts to perform pure abstract thought, so even if we accept the questionable assumption that humans can generate ideas ex nihilo, that doesn't mean that LLMs can.
Unless your argument is that all creative writing is inspired by God, or some similar "external" source, then clearly a closed system such as "humanity" alone is capable of generating new creative works.
This sounds fascinating! I know what a Sierpiński triangle triangle is but I'm having so me trouble seeing the connection from picking functions randomly to the triangle. Is there some graphics or animation somewhere on the web that someone can point me to visualize this better?
It basically using the fact that fractal is self similar. So picking one function (that scales whole triangle to one of the one thirds) and transforming single point on a fractal into a new point also gets you a point on the fractal.
If you repeat this process many times you get a lot of points of the fractal.
You can even start the process at any point and it will "get attracted" to the fractal.
That's why fractals are called strange attractors.
Good one but these theorems are useful to have when thinking about information processing systems and whatever promises the hype artists are making about the latest and greatest iteration of neural networks. There is no way to cheat entropy and basic physics so if it sounds too good to be true then it probably is too good to be true.
Humans are not immune to the effect. We invented methodologies to mitigate the effect.
Think about science. I mean hard science, like physics. You can not say a theory is proven[0] if it is purely derived from existing data. You can only say it when you release your theory and successfully predicate the future experiment results.
In other words, you need to do new experiements, gather new information and effectively "inject" the entropy into the humanity's scientific consensus.
[0]: Of course when we say some physical theory is proven, it just means the probablilty that it's violated in certain conditions is negligible, not that it's an universal truth.
I'm sure there is some calculation using entropy of random variables and channels that fully formalizes this but I don't remember the references off the top of my head. The general reference I remember is called the data processing inequality.¹
¹ https://en.wikipedia.org/wiki/Data_processing_inequality?use...