Only very remotely "kind of". LHC isn't mindless "will it work if we scale it up?" type of experiment at all. The idea behind it is very simple: there are fundamental properties we know we are after, we have some predictions about how particles behave, and we know we want to smash the particles together hard enough to verify some of these predictions. We know how to accelerate charged particles, so here we go, now we need powerful particle accelerator. Building or even running LHC isn't the experiment by itself, it's just an "easy" way to find out how theoretically predicted particle states look in the real life (hopefully disproving some possible theories).
Many of ML experiments, though, are exactly what you said: we ultimately have no slightest idea, what we are doing, but maybe if we throw enough compute-power into gradient-descent over some generic enough network, it will learn to do something interesting. Well, I tell you what, theoretically you need only 2 wide layers of neurons to compute anything there is to compute, there just isn't enough computational power in the entire world to possibly come up with the "correct" neuron weights entirely by chance, so what the research mostly really is about is coming up with various tricks to make do with less computational resources, which is essentially making sense of some computational problems and making neural networks "less random", to direct the learning process. It is either about new neural architectures, new ways of posing questions for NN to answer, ways to make more training data out of the same amounts of available external data, or the alternatives for the gradient-descent learning approach altogether.
So, to wrap it up. It may be kinda interesting to know that GPT-2 type of network didn't reach its full capacity, and if we scale it up even more, it still learns something new. Inspecting how it behaves also might lead to some insights or new possible applications. But ultimately, if there's no novelty of any kind in an experiment, training an NN nobody can realistically reproduce is a great way to show off (or to achieve some practical goal for the owner of this NN, if it was used to find something practical), but doesn't really contribute much (or anything) to ML research.
2-wide layers can compute any computable function, but they do not scale in any manageable fashion with problem complexity. (Because you have to model problems as increasingly large lookup tables, whose size explodes rapidly.) The big thing about GPT is that so far, improvement in GPT size has corresponded to a .. I think logarithmic?- improvement in both quality of response and complexity of the problem domain that GPT can model. GPT scales, in exactly the way that a 2-layer network doesn't, and we don't yet have a handle on at what size it stops scaling.
Uh. No, you must be misreading me. It's just "no". Let me simplify: I'm plainly stating that you are absolutely wrong, and while GPT-3 is a "will it work if we scale it up?" type of experiment, LHC totally isn't and even comparing these 2 things is silly.
I think what OP is saying is that physics experiments are guided by theory, in the sense that their goal is to prove or disprove this or that theoretical claim. It's not to fish for interesting correlations in a set of observations collected at random.
I don't know physics enough to give a sensible example, but in physics, as generally in science, a theory is first formed to explain a set of observations, then the theory is tested against new observations and either discarded, if the new observations do not agree with the theory, or accepted otherwise (where "accepted" doesn't necessarily mean that all debate will cease and everyone will agree the theory is true). So, in short, the point of the LHC is to test the predictions of some theory (I think what it's testing is the standard model but not sure) and its size is a function of how easy or hard it is to test such preditions, not an attempt to "look harder" just in case something comes up. And if "something completely new" does come up, then the cycle starts all over again- with a new theory and new experiments to test its predictions. Just "discovering something completely new", i.e. making a new observation, doesn't really tell us anything until we have an explanation for it, in the form of a scientific theory that can be tested.
In machine learning we do not have any theory to guide our experiments so the done thing is to try things and see what works. So just because LHC and GPT-3 are both "large", doesn't mean they have the same goals- or that the goal of the LHC is just to be large, because large is better.
To clarify "will it work if we scale it?" is not much of a scientific claim. Because it doesn't tell us anything new. We've known that it's possible to improve performance by spending more computing resources for a long time now. If I run bubblesort on a supercomputer for a month and tell people that I have sorted a list of integers larger than anyone has ever done before- will people fall over their chairs in amazement? Of course not. Not in any other field of computer science (except perhaps high performance computing) is scaling resources an acceptable way to claim progress. That it is in machine learning is a result of the fact that we have no guiding theory to drive our experiments. So people just try things to see what works.
Obviously, if someone has more resources than almost everyone else, they can try more things and hope to luck out on more interesting results. That's the reason why it's always Google, OpenAI, Uber, Facebook etc. that are in the news for "new machine learning results". They got more stuff to throw at more walls and more eyes to see what sticks.
I think you might be downplaying the scientific method involved in development of novel ML models and methods. It's not exactly a random walk process. Most of the progress has been the result of people trying to model what's going on in our heads: convnets modeling our vision system, rnns modeling feedback loops, reinforcement learning modeling sparse reward signals, attention based models modeling, well, attention. Network training methods (e.g. SGD) are based on optimization theory. There are plenty of theories trying to explain why or how things work in deep learning. Most of these are probably wrong, but some, sooner or later, will turn out to be right. Not unlike physics which for years had competing theories (e.g. string theory vs quantum gravity, etc).
The original transformer experiment tested the hypothesis: "if we organize layers of attention mechanisms in a certain way and feed them tons of text they will be able to process that text effectively enough to build very rich and consistent language model". GPT experiments test the hypothesis "if we feed the transformer more data its language model might become rich and consistent enough to produce human level results". To me this sounds like an well defined scientific experiment.
Using your analogy, GPT-3 is more like if you devised an algorithm which produces n + k of pi digits after processing n pi digits - without knowing anything about how to compute pi, or what pi is. To me that deserves falling over my chair in amazement.
>> The original transformer experiment tested the hypothesis: "if we organize
layers of attention mechanisms in a certain way and feed them tons of text
they will be able to process that text effectively enough to build very rich
and consistent language model".
I read the paper when it came out and I checked it again now to confirm:
there's no such hypothesis in there. In fact "Attention is all you need" is a
typical example of a post-hoc paper that describes what a research team did
that worked and how well it worked. "We tweaked these knobs and out came
STUFF!". Typically for deep learning papers it lacks a theory section, the
space of which is instead taken by an "Architecture" section wich well,
describes the architecture. There are no theorems, or proofs. There is nothing
that connects some kind of theoretical claim to the experiments. The main
claim of the paper is "we build this system and it has better performance than
previous systems". Like I say in my earlier comment, that's not an
interesting scientific claim.
I'm sorry but personally I find that kind of work irritating. "We tried some
stuff and got some results". Woo-hoo. But, why did you try that stuff and why
did you get those results? Did you try to get the same results without that
stuff? Did you try to get some other results with the same stuff? Can you
explain what is going on in your system and why it does that when I twist
this knob? If I twist that knob, can you tell me what it will do without
having to run it first to find out? Typically, 99% of the time, the answer to
all this is "no" (i.e. no ablation experiments, no theoretical explanations,
etc. etc, no nothing). It's like I say above, just throwing stuff at the wall
to see what sticks. And then writing a paper to describe it.
Oh, and calling stuff suggestive names like "attention". If I call it
"boredom", am I more or less justified than the authors?
>> Most of the progress has been the result of people trying to model what's
going on in our heads: convnets modeling our vision system, rnns modeling
feedback loops, reinforcement learning modeling sparse reward signals,
attention based models modeling, well, attention.
The problem with all those advances is that they all happened more than 20
years ago. My comment discusses the state of machine learning research right
now, which is that there are very few new ideas and the majority of the field
doesn't have a clear direction.
Note also that the advances you describe were not "guided by theory". They
were inspired by ideas about how the mind works. But, finding ideas to try
is not the scientific process I describe above. And just because you're
inspired by an idea doesn't mean your work is in any way a proof or disproof
of that idea. For example, CNNs were not created in an effort to demonsrate
the accuracy of a certain model of the visual cortex. In fact, Yan LeCun is on
record saying that deep learning is nothing like the brain:
Yann LeCun: My least favorite description [of deep learning] is, “It works
just like the brain.” I don’t like people saying this because, while Deep
Learning gets an inspiration from biology, it’s very, very far from what the
brain actually does.
>> Using your analogy, GPT-3 is more like if you devised an algorithm which
produces n + k of pi digits after processing n pi digits - without knowing
anything about how to compute pi, or what pi is.
That's not a good example. GPT-3 can't actually do this. In fact, no
technology we know of can do this for a k sufficiently large and with accuracy
better than chance. Personally I find GPT-3's text generation very
underwhelming and not anywhere near the magickal guessing machine your example
seems to describe.
Many of ML experiments, though, are exactly what you said: we ultimately have no slightest idea, what we are doing, but maybe if we throw enough compute-power into gradient-descent over some generic enough network, it will learn to do something interesting. Well, I tell you what, theoretically you need only 2 wide layers of neurons to compute anything there is to compute, there just isn't enough computational power in the entire world to possibly come up with the "correct" neuron weights entirely by chance, so what the research mostly really is about is coming up with various tricks to make do with less computational resources, which is essentially making sense of some computational problems and making neural networks "less random", to direct the learning process. It is either about new neural architectures, new ways of posing questions for NN to answer, ways to make more training data out of the same amounts of available external data, or the alternatives for the gradient-descent learning approach altogether.
So, to wrap it up. It may be kinda interesting to know that GPT-2 type of network didn't reach its full capacity, and if we scale it up even more, it still learns something new. Inspecting how it behaves also might lead to some insights or new possible applications. But ultimately, if there's no novelty of any kind in an experiment, training an NN nobody can realistically reproduce is a great way to show off (or to achieve some practical goal for the owner of this NN, if it was used to find something practical), but doesn't really contribute much (or anything) to ML research.