Only very remotely "kind of". LHC isn't mindless "will it work if we scale it up...

FeepingCreature · on June 25, 2020

"Yes, but."

2-wide layers can compute any computable function, but they do not scale in any manageable fashion with problem complexity. (Because you have to model problems as increasingly large lookup tables, whose size explodes rapidly.) The big thing about GPT is that so far, improvement in GPT size has corresponded to a .. I think logarithmic?- improvement in both quality of response and complexity of the problem domain that GPT can model. GPT scales, in exactly the way that a 2-layer network doesn't, and we don't yet have a handle on at what size it stops scaling.

krick · on June 25, 2020

Uh. No, you must be misreading me. It's just "no". Let me simplify: I'm plainly stating that you are absolutely wrong, and while GPT-3 is a "will it work if we scale it up?" type of experiment, LHC totally isn't and even comparing these 2 things is silly.

p1esk · on June 26, 2020

Are you sure we know exactly what's going to happen when we scale up LHC experiments? You don't think we can discover something completely new?

YeGoblynQueenne · on June 26, 2020

I think what OP is saying is that physics experiments are guided by theory, in the sense that their goal is to prove or disprove this or that theoretical claim. It's not to fish for interesting correlations in a set of observations collected at random.

I don't know physics enough to give a sensible example, but in physics, as generally in science, a theory is first formed to explain a set of observations, then the theory is tested against new observations and either discarded, if the new observations do not agree with the theory, or accepted otherwise (where "accepted" doesn't necessarily mean that all debate will cease and everyone will agree the theory is true). So, in short, the point of the LHC is to test the predictions of some theory (I think what it's testing is the standard model but not sure) and its size is a function of how easy or hard it is to test such preditions, not an attempt to "look harder" just in case something comes up. And if "something completely new" does come up, then the cycle starts all over again- with a new theory and new experiments to test its predictions. Just "discovering something completely new", i.e. making a new observation, doesn't really tell us anything until we have an explanation for it, in the form of a scientific theory that can be tested.

In machine learning we do not have any theory to guide our experiments so the done thing is to try things and see what works. So just because LHC and GPT-3 are both "large", doesn't mean they have the same goals- or that the goal of the LHC is just to be large, because large is better.

To clarify "will it work if we scale it?" is not much of a scientific claim. Because it doesn't tell us anything new. We've known that it's possible to improve performance by spending more computing resources for a long time now. If I run bubblesort on a supercomputer for a month and tell people that I have sorted a list of integers larger than anyone has ever done before- will people fall over their chairs in amazement? Of course not. Not in any other field of computer science (except perhaps high performance computing) is scaling resources an acceptable way to claim progress. That it is in machine learning is a result of the fact that we have no guiding theory to drive our experiments. So people just try things to see what works.

Obviously, if someone has more resources than almost everyone else, they can try more things and hope to luck out on more interesting results. That's the reason why it's always Google, OpenAI, Uber, Facebook etc. that are in the news for "new machine learning results". They got more stuff to throw at more walls and more eyes to see what sticks.

p1esk · on June 26, 2020

I think you might be downplaying the scientific method involved in development of novel ML models and methods. It's not exactly a random walk process. Most of the progress has been the result of people trying to model what's going on in our heads: convnets modeling our vision system, rnns modeling feedback loops, reinforcement learning modeling sparse reward signals, attention based models modeling, well, attention. Network training methods (e.g. SGD) are based on optimization theory. There are plenty of theories trying to explain why or how things work in deep learning. Most of these are probably wrong, but some, sooner or later, will turn out to be right. Not unlike physics which for years had competing theories (e.g. string theory vs quantum gravity, etc).

The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model". GPT experiments test the hypothesis "if we feed the transformer more data its language model might become rich and consistent enough to produce human level results". To me this sounds like an well defined scientific experiment.

Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is. To me that deserves falling over my chair in amazement.

YeGoblynQueenne · on June 26, 2020

Specifically about attention:

>> The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model".

I read the paper when it came out and I checked it again now to confirm: there's no such hypothesis in there. In fact "Attention is all you need" is a typical example of a post-hoc paper that describes what a research team did that worked and how well it worked. "We tweaked these knobs and out came STUFF!". Typically for deep learning papers it lacks a theory section, the space of which is instead taken by an "Architecture" section wich well, describes the architecture. There are no theorems, or proofs. There is nothing that connects some kind of theoretical claim to the experiments. The main claim of the paper is "we build this system and it has better performance than previous systems". Like I say in my earlier comment, that's not an interesting scientific claim.

I'm sorry but personally I find that kind of work irritating. "We tried some stuff and got some results". Woo-hoo. But, why did you try that stuff and why did you get those results? Did you try to get the same results without that stuff? Did you try to get some other results with the same stuff? Can you explain what is going on in your system and why it does that when I twist this knob? If I twist that knob, can you tell me what it will do without having to run it first to find out? Typically, 99% of the time, the answer to all this is "no" (i.e. no ablation experiments, no theoretical explanations, etc. etc, no nothing). It's like I say above, just throwing stuff at the wall to see what sticks. And then writing a paper to describe it.

Oh, and calling stuff suggestive names like "attention". If I call it "boredom", am I more or less justified than the authors?

YeGoblynQueenne · on June 26, 2020

>> Most of the progress has been the result of people trying to model what's going on in our heads: convnets modeling our vision system, rnns modeling feedback loops, reinforcement learning modeling sparse reward signals, attention based models modeling, well, attention.

The problem with all those advances is that they all happened more than 20 years ago. My comment discusses the state of machine learning research right now, which is that there are very few new ideas and the majority of the field doesn't have a clear direction.

Note also that the advances you describe were not "guided by theory". They were inspired by ideas about how the mind works. But, finding ideas to try is not the scientific process I describe above. And just because you're inspired by an idea doesn't mean your work is in any way a proof or disproof of that idea. For example, CNNs were not created in an effort to demonsrate the accuracy of a certain model of the visual cortex. In fact, Yan LeCun is on record saying that deep learning is nothing like the brain:

Yann LeCun: My least favorite description [of deep learning] is, “It works just like the brain.” I don’t like people saying this because, while Deep Learning gets an inspiration from biology, it’s very, very far from what the brain actually does.

https://spectrum.ieee.org/automaton/artificial-intelligence/...

>> Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is.

That's not a good example. GPT-3 can't actually do this. In fact, no technology we know of can do this for a k sufficiently large and with accuracy better than chance. Personally I find GPT-3's text generation very underwhelming and not anywhere near the magickal guessing machine your example seems to describe.