This is intuitively obvious. If I give you some data x and you transform it with...

habitue · 2024-12-07T19:50:31 1733601031

This seems obvious, but you're forgetting the inputs may actually have low entropy to begin with. Lossy compression is non-reversible, but usually the expectation is that we don't care about the parts we lost.

How might this cash out with recursive LLMs? Generalizing is very similar to compression: imagine recovering the Schrodinger equation from lots of noisy physical experiments. You might imagine that an LLM could output a set of somewhat general models from real data, and training it on data generated from those models generalizes further in future passes until maybe it caps out at the lowest entropy model (a theory of everything?)

It doesn't seem like it actually works that way with current models, but it isn't a foregone conclusion at the mathematical level at least.

cruffle_duffle · 2024-12-07T16:54:46 1733590486

So correct me if I’m wrong here but wouldn’t another way to look at this be something like re-compressing a JPEG? Each time you compress a compressed jpeg you strip more and more information out of it? Same with any lossy compression, really.

These LLM’s are inherently a bit like lossy compression algorithms. They take information and pack it in a way that keeps its essence around (at least that is the plan). But like any lossy compression, you cannot reconstruct the original. Training a lossy compression scheme like an LLM using its own data is just taking that already packed information and degrading it.

I hope I’m right framing it this way because ultimately that is partly what an LLM is, it’s a lossy compression of “the entire internet”. A lossless model that can be queried like an LLM would be massive, slow and probably impossible with today’s tech.

I suspect that we will develop new information theory that mathematically proves these things can’t escape the box they were trained in, meaning they cannot come up with new information that isn’t already represented in the relationships between the various bits of data they were constructed with. They can “only” find new ways to link together the information in their corpus of knowledge. I use “only” in quotes because simply doing that alone is pretty powerful. It’s connecting the dots in ways that haven’t been done before.

Honestly the whole LLM space is cool as shit when you really think about it. It’s both incredibly overhyped yet very under hyped at the same time.

entangledqubit · 2024-12-07T17:52:30 1733593950

Relevant article by a fun author: https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

SideQuark · 2024-12-08T00:58:19 1733619499

It’s not intuitively obvious losing information makes things worse. In fact, it’s not even true. Plenty of lossy functions make the problem under consideration better off, such as denoising, optimizing, models that expose underlying useful structures, and on and on.

Also injecting noise can improve many problems, like adding jitter before ADC (think noise shaping, which has tremendous uses).

So claiming things like “can only make the end result worse” is “intuitive obvious” is demonstrably wrong.

wslh · 2024-12-07T18:51:58 1733597518

> with a non-reversible function f into f(x) then you are losing information.

A non-reversible function f does not necessarily lose information. Some non-reversible functions, like one-way functions used in cryptography, can be injective or even bijective but are computationally infeasible to invert, which makes them practically irreversible while retaining all information in a mathematical sense. However, there is a subset of non-reversible functions, such as non-injective functions, that lose information both mathematically and computationally. It’s important to distinguish these two cases to avoid conflating computational irreversibility with mathematical loss of information.

meltyness · 2024-12-07T19:50:23 1733601023

On the arguments involving modeling inference as simply some function f, the specific expression OP used discounts that each subsequent application would have been following some backpropagation and so implies a new f' at each application, rendering the claim invalid.

At that point, at least chaos theory is at play across the population of natural language, if not some expressed, but not yet considered truth.

This invalidates the subsequent claim about the functions which are convolved as well, I think all the GPUs might have something to say whether the bits changing the layers are random or correlated.

vrighter · 2024-12-08T05:28:26 1733635706

if a hash can transform any size input, into a fixed length string, then that implies irreversibility due to the pigeonhole principle. It's impossible, not infeasible

wslh · 2024-12-08T20:00:21 1733688021

Hashes with that property are just a special case of one-way functions.

ofrzeta · 2024-12-07T12:49:48 1733575788

What about something like image improvement algorithms or NeRFs? They seem to increase information even if some of it is made up.

vunderba · 2024-12-07T16:09:45 1733587785

If the goal of an image improvement algorithm is effectively "how would this image have looked IN THE REAL WORLD if it had been taken with a better camera", then training on previous "virtual upscaled images" would be training on the wrong fitness function.

antihero · 2024-12-07T13:53:36 1733579616

It isn't real information though. This is effectively a Chinese whispers.

The only way AI can create information is by doing something in the real world.

Lerc · 2024-12-07T20:10:23 1733602223

It is real information, it is just information that is not targeted at anything in particular. Random passwords are, well, random. That they are random and information is what makes them useful as passwords.

As said by others, There is nothing terribly insightful about making something estimate the output of another by a non-perfect reproduction mechanism and noticing the output is different. Absent any particular guidance the difference will not be targeted. That is tautologically obvious.

The difference is still information though, and with guidance you can target the difference to perform some goal. This is essentially what the gpt-o1 training was doing. Training on data generated by itself, but only when the generated data produced the correct answer.

dragonwriter · 2024-12-07T17:37:20 1733593040

> The only way AI can create information is by doing something in the real world.

Everything done is done in the real world, but the only way an AI can gather (not create) information about some particular thing is to interact with that thing. Without interacting with anything external to itself, all information it can gather is the information already gathered to create it.

tomrod · 2024-12-07T17:59:21 1733594361

Is there a formalization of this idea? Would love to read more.

antihero · 2024-12-08T12:19:06 1733660346

That's a better way of putting it, yes.

ofrzeta · 2024-12-07T13:58:47 1733579927

Maybe information needs to be understood relationally as in "information for a subject x". So if we have an image with a license plate that is unreadable and there's an algorithm that makes it readable to x, there is an information gain for x, although the information might have been in the image all along.

ablob · 2024-12-07T16:40:25 1733589625

If the license plate was not readable, then the additional information is false data. You do not know more about the image than you knew before by definition. Replacing pixels with plausible data does not mean a gain of information. If anything, I'd argue that a loss of information occurs: The fact that x was hardly readable/unreadable before is lost, and any decision later on can not factor this in as "x" is now clearly defined and not fuzzy anymore.

Would you accept a system that "enhances" images to find the license plate numbers of cars and fine their owners? If the plate number is unreadable the only acceptable option is to not use it. Inserting a plausible number and rolling with it even means that instead of a range of suspects, only one culprit can be supposed. Would you like to find yourself in court for crimes/offenses you never comitted because some black box decided it was a great idea to pretend it knew it was you?

Edit: I think I misunderstood the premise. Nonetheless my comment shall stay.

immibis · 2024-12-07T23:55:21 1733615721

For an example of this, see "Xerox scanners and photocopiers randomly alter numbers in scanned documents"

https://news.ycombinator.com/item?id=6156238

bippihippi1 · 2024-12-07T14:49:51 1733582991

eliminating the noise makes the useful information clearer, but the information describing the noise is lost

vunderba · 2024-12-07T16:11:25 1733587885

Sure, but what if the upscaling algorithm misinterpreted a P as an F? Without manual supervision/tagging, there's an inherent risk that this information will have an adverse effect on future models.

jppittma · 2024-12-07T14:01:01 1733580061

It’s information taken from many other photos and embedded into a single one of interest no?

dragonwriter · 2024-12-07T17:44:27 1733593467

“Made up” information is noise, not signal (OTOH, generated in images are used productively all the time in training, but the information content added is not in the images themselves but in their selection and relation to captions.)

raincole · 2024-12-07T17:46:32 1733593592

Image improvement algorithms are basically injecting statistical information (collected from other images) into one image.

The above statement applies for non-neural-network algorithms as well.

dartos · 2024-12-07T13:48:23 1733579303

Do they gain information, or just have lower loss?

Too much information encoded in a model can lower performance (called overfitting)

That’s why many NN topologies include dropout layers.

blensor · 2024-12-07T15:08:57 1733584137

Once more and more new training images are based off of those new upscaled images the training of those upscaling algorithms will tend to generate even more of the same type of information drowning out the other information

jiggawatts · 2024-12-07T11:40:04 1733571604

That's assuming that the same function is applied in the same way at each iteration.

Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Simply adding a bit of noise and then selecting good outputs after each iteration based on a high-level heuristic such as "utility" and "self consistency" may be sufficient to reproduce the growth of human knowledge in a purely mathematical AI system.

Something that hasn't been tried yet because it's too expensive (for now) is to let a bunch of different AI models act as agents updating a central wikipedia-style database.

These could start off with "simply" reading every single text book and primary source on Earth, updating and correcting the Wikipedia in every language. Then cross-translate from every source in some language to every other language.

Then use the collected facts to find errors in the primary sources, then re-check the Wikipedia based on this.

Train a new generation of AIs on the updated content and mutate them slightly to obtain some variations.

Iterate again.

Etc...

This could go on for quite a while before it would run out of steam. Longer than anybody has budget for, at least for now!

zkry · 2024-12-07T12:03:23 1733573003

> The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Is human knowledge really derived in a similar manner though? That reduction of biological processes to compression algorithms seems like a huge oversimplification.

It's almost like saying that all of of human knowledge derives from Einstein's Field Equations, the Standard Model Lagrangian, and the Second Law of Thermodynamics (what else could human knowledge really derive from?) and all we have to do to create artificial intelligence is just to model these forces to a high enough fidelity and with enough computation.

mannykannot · 2024-12-07T14:16:33 1733580993

It's not just any compression algorithm, though, it's a specific sort of algorithm that does not have the purpose of compression, even if compression is necessary for achieving its purpose. It could not be replaced by most other compression algorithms.

Having said that, I think this picture is missing something: when we teach each new generation what we know, part of that process involves recapitulating the steps by which we got to where we are. It is a highly selective (compressed?) history, however, focusing on the things that made a difference and putting aside most of the false starts, dead ends and mistaken notions (except when the topic is history, of course, and often even then.)

I do not know if this view has any significance for AI.

ethbr1 · 2024-12-07T12:09:53 1733573393

Human knowledge also tends to be tied to an objective, mostly constant reality.

jiggawatts · 2024-12-07T12:19:15 1733573955

The AIs could also learn form and interact with reality, same as humans.

dartos · 2024-12-07T13:50:19 1733579419

Not really.

The models we use nowadays operate on discrete tokens. To overly reduce the process of human learning, we take a constant stream of realtime information. It never ends and it’s never discrete. Nor do we learn in an isolated “learn” stage in which we’re not interacting with our environment.

If you try taking reality and breaking into discrete (ordered in the case of LLMs) parts, you lose information.

chongli · 2024-12-07T15:24:58 1733585098

Think about this: The sum total of the human-generated knowledge was derived in a similar manner, with each generation learning from the one before and expanding the pool of knowledge incrementally.

Not true. No amount of such iteration gets you from buffalo cave paintings to particle accelerators.

Humans generate knowledge by acting in the world, not by dwelling on our thoughts. The empiricists won a very long time ago.

brookst · 2024-12-07T17:11:07 1733591467

It’s not binary. Humans generate plenty of knowledge from pure abstract thought.

kerkeslager · 2024-12-07T18:21:18 1733595678

Do they?

When I pursued creative writing in my teens and early 20s, it became clear to me that originality is extremely difficult. I am not entirely sure I have ever had an original thought--every idea I've put to paper thinking it was original, I later realized was a recombination of ideas I had come across somewhere else. The only exceptions I've found were places where I had a fairly unusual experience which I was able to interpret and relate, i.e. a unique interaction with the world.

Perhaps more importantly, LLMs do not contain any mechanism which even attempts to perform pure abstract thought, so even if we accept the questionable assumption that humans can generate ideas ex nihilo, that doesn't mean that LLMs can.

jiggawatts · 2024-12-07T21:24:14 1733606654

Unless your argument is that all creative writing is inspired by God, or some similar "external" source, then clearly a closed system such as "humanity" alone is capable of generating new creative works.

kerkeslager · 2024-12-08T00:44:31 1733618671

Did you even read the post you're responding to?

jiggawatts · 2024-12-07T19:35:52 1733600152

You’re right, we obtained the knowledge externally. It was aliens! I knew it!

chongli · 2024-12-08T08:16:12 1733645772

Externally, yes, we obtain knowledge from the world around us. We’re not brains in vats conjuring knowledge from the void of our isolated minds.

scotty79 · 2024-12-07T11:40:36 1733571636

If you repeatedly apply one of three simple functions picked at random you might end up with Sierpinski triangle.

eminent101 · 2024-12-07T12:27:34 1733574454

This sounds fascinating! I know what a Sierpiński triangle triangle is but I'm having so me trouble seeing the connection from picking functions randomly to the triangle. Is there some graphics or animation somewhere on the web that someone can point me to visualize this better?

scotty79 · 2024-12-07T13:53:00 1733579580

You can read section Chaos Game here:

https://en.m.wikipedia.org/wiki/Sierpi%C5%84ski_triangle

It basically using the fact that fractal is self similar. So picking one function (that scales whole triangle to one of the one thirds) and transforming single point on a fractal into a new point also gets you a point on the fractal.

If you repeat this process many times you get a lot of points of the fractal.

You can even start the process at any point and it will "get attracted" to the fractal.

That's why fractals are called strange attractors.

optimalsolver · 2024-12-07T07:54:06 1733558046

Relevant XKCD:

https://xkcd.com/1683/

benchmarkist · 2024-12-07T07:57:52 1733558272

Good one but these theorems are useful to have when thinking about information processing systems and whatever promises the hype artists are making about the latest and greatest iteration of neural networks. There is no way to cheat entropy and basic physics so if it sounds too good to be true then it probably is too good to be true.

HeatrayEnjoyer · 2024-12-07T12:02:25 1733572945

If it is entropy and basic physics why are humans immune to the effect?

raincole · 2024-12-07T18:01:15 1733594475

Humans are not immune to the effect. We invented methodologies to mitigate the effect.

Think about science. I mean hard science, like physics. You can not say a theory is proven[0] if it is purely derived from existing data. You can only say it when you release your theory and successfully predicate the future experiment results.

In other words, you need to do new experiements, gather new information and effectively "inject" the entropy into the humanity's scientific consensus.

[0]: Of course when we say some physical theory is proven, it just means the probablilty that it's violated in certain conditions is negligible, not that it's an universal truth.