Some of the topics in the parent post should not be a major surprise to anyone w...

jackblemming · 2024-02-05T02:11:57.000000Z

> the topics in the parent post should not be a major surprise to anyone who has read https://people.math.harvard.edu/~ctm/home/text/others/shanno... !

> which clearly explains it (and said emergent phenomena)

Very smart information theory people have looked at neural networks through the lens of information theory and published famous papers about it years ago. It couldn't explain many things about neural networks, but it was interesting nonetheless.

FWIW it's not uncommon for smart people to say "this mathematical structure looks like this other idea with [+/- some structure]!!" and that it totally explains everything... (kind of with so and so exceptions, well and also this and that and..). Truthfully, we just don't know. And I've never seen theorists in this field actually take the theory and produce something novel or make useful predictions with it. It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

There was this one posted recently on transformers being kernel smoothers: https://arxiv.org/abs/1908.11775

Nevermark · 2024-02-05T10:00:17.000000Z

I think there is more here than a backward look.

The article introduced a discrete algorithm method for approximating the gradient optimization model.

It would be interesting to optimize the discrete algorithm for both design and inference times, and see if any space or time advantages over gradient learning could be found. Or if new ideas popped as a result of optimization successes or failures.

It also might have an advantage in terms of algorithm adjustments. For instance, given the most likely responses at each step, discard the most likely whenever follow ups are not too far below - and see if that reliably avoided copyright issues.

A lot easier to poke around a discrete algorithm, with zero uncertainty as to what is happening, vs. vast tensor models.

randomNumber7 · 2024-02-05T02:32:52.000000Z

> It's all try stuff and see what works, and then retroactively make up some crud on why it worked

People have done this in earlier days too. The theory around control systems was developed after PID controllers had been succesfully used in praxis.

rrr_oh_man · 2024-02-05T09:37:04.000000Z

> It's all try stuff and see what works, and then retroactively make up some crud on why it worked, if it did work (otherwise brush it under the rug).

Reminds me of how my ex-client's data scientists would develop ML models.

patcon · 2024-02-05T02:54:59.000000Z

I appreciate what you're saying, but convergence (via alternative paths, of various depths) is its own signal. Repeated rediscovery perhaps isn't necessarily wastefulness, but affirmation and validation of deep truth for which there are multiple paths of arrival :)

tysam_and · 2024-02-06T00:33:16.000000Z

I wish that this worked out in the long run! However, watching the field spin its wheels in the mud over and over with silly pet theories and local results makes it pretty clear that a lot of people are just chasing the butterfly, then after a few years grow disenchanted and sort of just give up.

The bridge comes when people connect concepts to those that are well known and well understood, and that is good. It is all well and good to say in theory that rediscovering things is bad -- it is not necessarily! But when it becomes groundhog day for years on end without significant theoretical change, then that is an indicator that something is amiss in general in how we learn and interpret information in the field.

Of course, this is just my crotchety young opinion coming up on 9 years in the field, so please take that that with a grain of salt and all that.

imtringued · 2024-02-06T19:14:58.000000Z

Meanwhile in economics you have economists argue that the findings of anthropologists are invalid, because they don't understand modern economic theory. It's history that needs to change.

supriyo-biswas · 2024-02-05T08:21:46.000000Z

In another adjacent thread, people are talking about the implications of a neural network conforming to the training data with some error margin with regards to copyright.

Many textbooks on information theory already call out the content-addressable nature of such networks[1], and they’re even used in applications like compression due to this purpose[2][3], and therefore it’s no surprise that the NYT prompting OpenAI models with a few paragraphs of their articles reproduced them nearly verbatim.

[1] https://www.inference.org.uk/itprnn/book.pdf

[2] https://bellard.org/nncp/

[3] https://pub.towardsai.net/stable-diffusion-based-image-compr...

tysam_and · 2024-02-06T00:34:50.000000Z

Yes! This is a consequence of empirical risk minimization via maximum likelihood estimation. To have a model not reproduce the density of data it trained on would be like trying to get a horse and buggy to work well at speed, "now just without the wheels this time". It would generally not necessarily go all that well, I think! :'D

uptownfunk · 2024-02-05T04:57:24.000000Z

Ok but why didn’t Shannon get us gpt

david_draco · 2024-02-05T06:31:58.000000Z

He was busy getting us towards wifi first.

tysam_and · 2024-02-06T00:38:12.000000Z

I get the feeling you may not have read the paper as closely as you could have! Section 8 followed by Section 2 may look a tiny bit different if you consider it from this particular perspective.... ;)

3abiton · 2024-02-05T02:08:30.000000Z

Kudos for pluggimg shannomçs masterpiece