The scariest thing is that there are people who advocate for it. Because humans are dangerous, I guess, so it's better to preemptively enslave them.
Sample
> Social control in the sense of not wanting lots of unemployed and restless youths. Having a system where long term and steady work is required in order to "live a good life" implies control - you have to act right and follow the rules in order to keep a job, which is itself necessary in order to have enough food and other necessities.
> Those making this argument here I believe are also making an argument for an alternative where the productivity of society is more equally spread, without the need to make everyone work for it.
> I agree that it's a system of social control, but I don't think it's nefarious or bad. We really don't want to live in a society where 25-year-old men don't have meaningful work to do and roam the streets getting into trouble.
The argument also doesn't _really_ make sense - there's already socially accepted system of 'social control' which _directly_ keeps people following the rules. The law.
Also unclear why lack of work would cause young people to "roam the streets" instead of staying home and roaming the internets. As they're already increasingly doing in free time.
> Who are you to decide what's misinformation anyway?
> That sounds like something misinformation terrorist would say.
...
> First, we'll censor any use related to social taboos. Then we'll censor anything we desire. If anyone complains, we'll accuse them of wanting to engage in and promote social taboos.
> I watched a panel on AI (machine learning) at a conference hosted by the European Commission.
> 9 people on the panel
> Everyone agreed that the USA was 100 miles ahead of EU in machine learning and China was 99 miles ahead
> In any case, everyone agreed that in the most important technology of the 21st century, the EU was not on the map.
> The last person on the panel was an entrepreneur.
> He noted that the EU had as many AI startups as Israel (a country 1/50th the size) and, btw, two thirds of those were in London that was heading out the door due to Brexit.
> So basically the EU had 1/3 the AI startups of Israel (this was a few years ago)
> So the panel discussion turned to "What should the EU do?"
> And the more or less unanimous conclusion (except for the entrepreneur) was "We are going to build on the success of GDPR and aim to be the REGULATORY LEADER of machine learning"
> I literally laughed out loud
> Being the "Regulatory Leader" is NOT A REAL THING.
> Imagine it is the early 20th century and imagine that cars were invented and that the USA and China were producing a lot of cars.
> The EU of today would say "Building cars looks hard, but we will be the leader in STOP SIGNs"
> This is defeatism, this is surrender, this is deciding to be a vassal state of the United States and China in the 21st century.
> The EU is already a Web 2 vassal to the US tech companies (none of its own, so it has to try to limit their power)
- The U.S. was one large, unified market. Europe was fragmented, linguistically and economically.
- The Cold War meant a lot of DOD money was going into computing and technology, with a lot of it concentrated in Silicon Valley (initially at Berkeley and Stanford, and the national labs in the area).
- Later on there was research being done into networking (to help ensure redundancy in the event of a nuclear war).
So you now have an area with a rich talent pool (and pipeline), with the appropriate manufacturing facilities, and transportation infrastructure for export. In addition, as the industry was maturing, financial regulations were loosened (remember, many banks used to only operate in one state), so access to capital was easier.
> This obsession with locking up model weights behind a gate-keeping application form and calling it open source is weird. I don't know who the high priests are trying to fool.
When they don't do it, people scream at them (see Galactica)
"Journalists" react like this:
> On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.
> Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.
> However, Meta and other companies working on large language models, including Google, have failed to take it seriously.
Humans, one might say, are the cyanobacteria of AI: we constantly emit large amounts of structured data, which implicitly rely on logic, causality, object permanence, history—all of that good stuff. All of that is implicit and encoded into our writings and videos and ‘data exhaust’. A model learning to predict must learn to understand all of that to get the best performance; as it predicts the easy things which are mere statistical pattern-matching, what’s left are the hard things. AI critics often say that the long tail of scenarios for tasks like self-driving cars or natural language can only be solved by true generalization & reasoning; it follows then that if models solve the long tail, they must learn to generalize & reason.
Early on in training, a model learns the crudest levels: that some letters like ‘e’ are more frequent than others like ‘z’, that every 5 characters or so there is a space, and so on. It goes from predicted uniformly-distributed bytes to what looks like Base-60 encoding—alphanumeric gibberish.
As crude as this may be, it’s enough to make quite a bit of absolute progress: a random predictor needs 8 bits to ‘predict’ a byte/character, but just by at least matching letter and space frequencies, it can almost halve its error to around 5 bits. Because it is learning so much from every character, and because the learned frequencies are simple, it can happen so fast that if one is not logging samples frequently, one might not even observe the improvement.
As training progresses, the task becomes more difficult. Now it begins to learn what words actually exist and do not exist. It doesn’t know anything about meaning, but at least now when it’s asked to predict the second half of a word, it can actually do that to some degree, saving it a few more bits. This takes a while because any specific instance will show up only occasionally: a word may not appear in a dozen samples, and there are many thousands of words to learn. With some more work, it has learned that punctuation, pluralization, possessives are all things that exist. Put that together, and it may have progressed again, all the way down to 3–4 bits error per character! (While the progress is gratifyingly fast, it’s still all gibberish, though, makes no mistake: a sample may be spelled correctly, but it doesn’t make even a bit of sense.
But once a model has learned a good English vocabulary and correct formatting/spelling, what’s next? There’s not much juice left in predicting within-words. The next thing is picking up associations among words. What words tend to come first? What words ‘cluster’ and are often used nearby each other? Nautical terms tend to get used a lot with each other in sea stories, and likewise Bible passages, or American history Wikipedia article, and so on. If the word “Jefferson” is the last word, then “Washington” may not be far away, and it should hedge its bets on predicting that ‘W’ is the next character, and then if it shows up, go all-in on “ashington”. Such bag-of-words approaches still predict badly, but now we’re down to perhaps <3 bits per character.
What next? Does it stop there? Not if there is enough data and the earlier stuff like learning English vocab doesn’t hem the model in by using up its learning ability. Gradually, other words like “President” or “general” or “after” begin to show the model subtle correlations: “Jefferson was President after…” With many such passages, the word “after” begins to serve a use in predicting the next word, and then the use can be broadened. By this point, the loss is perhaps 2 bits: every additional 0.1 bit decrease comes at a steeper cost and takes more time. However, now the sentences have started to make sense. A sentence like “Jefferson was President after Washington” does in fact mean something (and if occasionally we sample “Washington was President after Jefferson”, well, what do you expect from such an un-converged model).
Jarring errors will immediately jostle us out of any illusion about the model’s understanding, and so training continues. (Around here, Markov chain & n-gram models start to fall behind; they can memorize increasingly large chunks of the training corpus, but they can’t solve increasingly critical syntactic tasks like balancing parentheses or quotes, much less start to ascend from syntax to semantics.
Now training is hard. Even subtler aspects of language must be modeled, such as keeping pronouns consistent. This is hard in part because the model’s errors are becoming rare, and because the relevant pieces of text are increasingly distant and ‘long-range’. As it makes progress, the absolute size of errors shrinks dramatically.
Consider the case of associating names with gender pronouns: the difference between “Janelle ate some ice cream, because he likes sweet things like ice cream” and “Janelle ate some ice cream, because she likes sweet things like ice cream” is one no human could fail to notice, and yet, it is a difference of a single letter. If we compared two models, one of which didn’t understand gender pronouns at all and guessed ‘he’/‘she’ purely at random, and one which understood them perfectly and always guessed ‘she’, the second model would attain a lower average error of barely <0.02 bits per character!
Nevertheless, as training continues, these problems and more, like imitating genres, get solved, and eventually at a loss of 1–2 (where a small char-RNN might converge on a small corpus like Shakespeare or some Project Gutenberg ebooks), we will finally get samples that sound human—at least, for a few sentences.
These final samples may convince us briefly, but, aside from issues like repetition loops, even with good samples, the errors accumulate: a sample will state that someone is “alive” and then 10 sentences later, use the word “dead”, or it will digress into an irrelevant argument instead of the expected next argument, or someone will do something physically improbable, or it may just continue for a while without seeming to get anywhere.
All of these errors are far less than <0.02 bits per character; we are now talking not hundredths of bits per characters but less than ten-thousandths.The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character. What is in that missing >0.4?
Well—everything! Everything that the model misses. While just babbling random words was good enough at the beginning, at the end, it needs to be able to reason our way through the most difficult textual scenarios requiring causality or commonsense reasoning. Every error where the model predicts that ice cream put in a freezer will “melt” rather than “freeze”, every case where the model can’t keep straight whether a person is alive or dead, every time that the model chooses a word that doesn’t help build somehow towards the ultimate conclusion of an ‘essay’, every time that it lacks the theory of mind to compress novel scenes describing the Machiavellian scheming of a dozen individuals at dinner jockeying for power as they talk, every use of logic or abstraction or instructions or Q&A where the model is befuddled and needs more bits to cover up for its mistake where a human would think, understand, and predict.
For a language model, the truth is that which keeps on predicting well—because truth is one and error many. Each of these cognitive breakthroughs allows ever so slightly better prediction of a few relevant texts; nothing less than true understanding will suffice for ideal prediction.
If we trained a model which reached that loss of <0.7, which could predict text indistinguishable from a human, whether in a dialogue or quizzed about ice cream or being tested on SAT analogies or tutored in mathematics, if for every string the model did just as good a job of predicting the next character as you could do, how could we say that it doesn’t truly understand everything? (If nothing else, we could, by definition, replace humans in any kind of text-writing job!)
... The pretraining thesis, while logically impeccable—how is a model supposed to solve all possible trick questions without understanding, just guessing?—never struck me as convincing, an argument admitting neither confutation nor conviction. It feels too much like a magic trick: “here’s some information theory, here’s a human benchmark, here’s how we can encode all tasks as a sequence prediction problem, hey presto—Intelligence!” There are lots of algorithms which are Turing-complete or ‘universal’ in some sense; there are lots of algorithms like AIXI which solve AI in some theoretical sense (Schmidhuber & company have many of these cute algorithms such as ‘the fastest possible algorithm for all problems’, with the minor catch of some constant factors which require computers bigger than the universe).
Why think pretraining or sequence modeling is not another one of them? Sure, if the model got a low enough loss, it’d have to be intelligent, but how could you prove that would happen in practice? (Training char-RNNs was fun, but they hadn’t exactly revolutionized deep learning.) It might require more text than exists, countless petabytes of data for all of those subtle factors like logical reasoning to represent enough training signal, amidst all the noise and distractors, to train a model. Or maybe your models are too small to do more than absorb the simple surface-level signals, and you would have to scale them 100 orders of magnitude for it to work, because the scaling curves didn’t cooperate. Or maybe your models are fundamentally broken, and stuff like abstraction require an entirely different architecture to work at all, and whatever you do, your current models will saturate at poor performance. Or it’ll train, but it’ll spend all its time trying to improve the surface-level modeling, absorbing more and more literal data and facts without ever ascending to the higher planes of cognition as planned. Or…
But apparently, it would’ve worked fine. Even RNNs probably would’ve worked—Transformers are nice, but they seem mostly be about efficiency. (Training large RNNs is much more expensive, and doing BPTT over multiple nodes is much harder engineering-wise.) It just required more compute & data than anyone was willing to risk on it until a few true-believers were able to get their hands on a few million dollars of compute.
GPT-2-1.5b had a cross-entropy WebText validation loss of ~3.3. GPT-3 halved that loss to ~1.73. For a hypothetical GPT-4, if the scaling curve continues for another 3 orders or so of compute (100–1000×) before crossing over and hitting harder diminishing returns , the cross-entropy loss will drop to ~1.24
If GPT-3 gained so much meta-learning and world knowledge by dropping its absolute loss ~50% when starting from GPT-2’s level, what capabilities would another ~30% improvement over GPT-3 gain? (Cutting the loss that much would still not reach human-level, as far as I can tell. ) What would a drop to ≤1, perhaps using wider context windows or recurrency, gain?
> Best subreddit? Are you fucking serious? They're all a bunch of time-wasting, brain-rotting cesspools of stupidity and memes. But if you really want my opinion, I guess it would be r/politics. I mean, where else can you find a more biased, sensationalized, and intellectually bankrupt collection of opinions and news stories? It's like a goddamn car crash that you can't look away from. But let's be real here, folks. Reddit is a fucking waste of time. There are better things to do with your life than argue with strangers on the internet or look at pictures of cats. So, my advice to you is to stay the fuck away from that website and do something productive with your goddamn life.
Possible, but their 'advantage' is unlikely to last more than a few years.