S1: A $6 R1 competitor?

mtrovo · 2025-02-05T16:48:02 1738774082

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

xg15 · 2025-02-05T19:13:11 1738782791

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

ZeljkoS · 2025-02-05T23:42:01 1738798921

We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

Arthur_ODC · 2025-02-06T19:52:22 1738871542

So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

3abiton · 2025-02-06T05:19:04 1738819144

So more 'mature' models might arise in the near future with less params and better benchmarks?

coder543 · 2025-02-06T11:53:51 1738842831

That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.

raducu · 2025-02-06T08:25:35 1738830335

"Better", but not better than the model they were distilled from, at least that's how I understand it.

salemba · 2025-02-06T13:20:23 1738848023

I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)

cristiancavalli · 2025-02-11T18:04:10 1739297050

Not at all — how many people were geniuses and their parents not? I can name several and I’m sure with a quick search you can too.

iFreilicht · 2025-02-13T18:38:01 1739471881

How is that relevant? A few examples do not disprove anything. It's pretty common knowledge that the more successful/rich etc. your parents were, the more likely you'll be successful/rich etc.

This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.

andreasmetsala · 2025-02-06T08:27:23 1738830443

They might also be more biased and less able to adapt to new technology. Interesting times.

MR4D · 2025-02-06T04:42:43 1738816963

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

umeshunni · 2025-02-06T06:30:30 1738823430

> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

timschmidt · 2025-02-06T13:42:13 1738849333

And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.

t_mann · 2025-02-06T17:43:01 1738863781

Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.

timschmidt · 2025-02-06T18:49:09 1738867749

Maybe I'm just a filthy computationalist, but the way I see it, the most accurate model of the universe is the one which makes the most accurate predictions with the fewest parameters.

The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.

My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.

As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...

I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.

t_mann · 2025-02-06T22:03:49 1738879429

The thing is, Lagrangian mechanics makes exactly the same predictions as Newtownian, and it starts from a foundation of just one principle (least action) instead of three laws, so it's arguably a sparser theory. It just makes calculations easier, especially for more complex systems, that's its raison d'être. So in a world where we don't know about relativity yet, both make the best predictions we know (and they always agree), but Newton's laws were discovered earlier. Do they suddenly stop being natural laws once Lagrangian mechanics is discovered? Standard physics curricula would not agree with you btw, they practically always teach Newtownian mechanics first and Lagrangian later, also because the latter is mathematically more involved.

Cleonis · 2025-02-08T21:05:28 1739048728

I will argue that 'has least action as foundation' does not in itself imply that Lagrangian mechanics is a sparser theory:

Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.

Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.

In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.

We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.

Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.

Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.

To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.

So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.

More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.

Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.

t_mann · 2025-02-09T00:25:21 1739060721

You seem to know more about this than me, but it seems to me that the first law does more than just induce a metric, I've always thought of it as positing inertia as an axiom.

There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.

Cleonis · 2025-02-09T19:43:05 1739130185

Indeed inertia. Theory of motion consists of describing the properties of Inertia.

In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.

In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.

Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.

That is why I argue that for Newtonian mechanics two principles are sufficient.

The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy

The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198

The work-energy theorem is the most important theorem of classical mechanics.

About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.

When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.

The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.

When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.

This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.

The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.

I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.

When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.

Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.

t_mann · 2025-02-10T06:59:21 1739170761

Nicely said, but I think then we are in agreement that Newtownian mechanics has a bit of redundancy that can be removed by switching to a Lagrangian framework, no? I think that's a situation where Occam's razor can be applied very cleanly: if we can make the exact same predictions with a sparser model.

Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.

dragonwriter · 2025-02-06T22:32:58 1738881178

Laws (in science, not government) are just a relationship that is consistently observed, so Newton's laws remain laws until contradictions were observed, regardless of the existence of or more alternative models which would predict them to hold.

The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.

(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)

t_mann · 2025-02-08T02:49:35 1738982975

Newtownian and Lagrangian mechanics are equivalent only in their predictions, not in their complexity - one requires three assumptions, the other just one. Now you say the fact that they have the same predictions makes them equivalent, and I agree. But it's clearly not compatible with what the other poster said about looking for the simplest possible way to explain a phenomenon. If you believe that that's how science should work, you'd need to discard theories as soon as simpler ones that make the same predictions are found (as in the case of Newtownian mechanics). It's a valid philosophical standpoint imho, but it's in opposition to how scientists generally approach Occam's razor, as evidenced eg by common physics curricula. That's what I was pointing out. Having to exclude Newtownian mechanics from what can be considered science is just one prominent consequence of the other poster's philosophical stance, one that could warrant reconsidering whether that's how you want to define it.

timschmidt · 2025-02-06T22:15:42 1738880142

> Do they suddenly stop being natural laws once Lagrangian mechanics is discovered?

Not my question to answer, I think that lies in philosophical questions about what is a "law".

I see useful abstractions all the way down. The linked Asimov essay covers this nicely.

homarp · 2025-02-06T12:09:06 1738843746

hence https://www.newyorker.com/tech/annals-of-technology/chatgpt-... (by Ted Chiang)

(discussed here: https://news.ycombinator.com/item?id=34724477 )

kedarkhand · 2025-02-06T08:56:58 1738832218

Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken

bloomingkales · 2025-02-06T13:22:56 1738848176

And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.

Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).

cmgriffing · 2025-02-06T21:12:28 1738876348

This brings up an interesting thought too. A photo is just a lossy representation of the real world.

So it's lossy all the way down with LLMs, too.

Reality > Data created by a human > LLM > Distilled LLM

ziofill · 2025-02-06T16:03:18 1738857798

What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.

fennecfoxy · 2025-02-10T14:05:02 1739196302

Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.

cztomsik · 2025-02-06T20:53:17 1738875197

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

pertymcpert · 2025-02-05T23:41:44 1738798904

For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

teruakohatu · 2025-02-05T21:13:51 1738790031

> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

gessha · 2025-02-05T22:04:54 1738793094

Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.

raducu · 2025-02-06T08:36:47 1738831007

> _fundamentally_ don’t understand how deep learning systems works.

It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.

Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.

I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.

The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.

gessha · 2025-02-06T13:03:22 1738847002

As a person who has trained a number of computer vision deep networks, I can tell you that we have some cool-looking visualizations on how lower layers work but no idea how later layers work. The intuition is built over training numerous networks and trying different hyperparameters, data shuffling, activations, etc. it’s absolutely brutal over here. If the theory was there, people like Karpathy who have great teacher vibes would’ve explained it for the mortal grad students or enthusiast tinkerers.

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks

I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.

bloomingkales · 2025-02-06T13:29:11 1738848551

It may be as simple as this:

https://youtube.com/shorts/7GrecDNcfMc

Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.

I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.

ClumsyPilot · 2025-02-06T18:56:36 1738868196

> Many many layers of that. It’s not a profound mechanism

Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine

Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make

bloomingkales · 2025-02-06T19:26:08 1738869968

"Cavemen understood stone"

How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...

The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").

"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"

You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).

Anyway, this can discussion can head into any direction.

ClumsyPilot · 2025-02-06T18:51:01 1738867861

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work

Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.

brookst · 2025-02-06T05:14:30 1738818870

Isn't that just scale? Even small LLMs have more parts than any car.

LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.

gessha · 2025-02-06T13:08:43 1738847323

You can make a bigger ICE engine (like a container ship engine) and still understand how the whole thing works. Maybe there’s more parts moving but it still has the structure of an ICE engine.

With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.

On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.

ijk · 2025-02-06T13:23:25 1738848205

One way I've heard it summarized: Computer Science as a field is used to things being like physics or chemistry, but we've suddenly encountered something that behaves more like biology.

bloomingkales · 2025-02-06T13:40:52 1738849252

[flagged]

red1reaper · 2025-02-06T15:04:04 1738854244

"God" as a concept in unproven to exist, it is also impossible to prove, so for all intents and porpouses it doesn't exist.

ChymeraXYZ · 2025-02-06T07:16:08 1738826168

Could be, but it does not change the fact that we do not understand them as of now.

spiorf · 2025-02-05T21:58:02 1738792682

We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.

fennecfoxy · 2025-02-10T14:16:48 1739197008

Eh I feel like that mostly just down to; yes transformers are a "next token predictor" but during fine tuning for instruct the attention related wagon slapped on the back is partially hijacked as a bridge from input token->sequences of connections in the weights.

For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.

I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).

Valgrim · 2025-02-06T03:30:56 1738812656

It feels less like a word prediction algorithm and more like a world model compression algorithm. Maybe we tried to create one and accidentaly created the other?

codeulike · 2025-02-06T08:56:27 1738832187

Its almost like a Model of Language, but very Large

bloomingkales · 2025-02-06T14:01:23 1738850483

Why would asking a question about ice cream trigger a consideration about all possible topics? As in, to formulate the answer, the LLM will consider the origin of Elephants even. It won’t be significant, but it will be factored in.

Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.

None of this is a coincidence.

adamc · 2025-02-06T14:36:00 1738852560

The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.

koala_man · 2025-02-05T17:18:11 1738775891

It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

andrewfromx · 2025-02-05T21:51:15 1738792275

yes, it also feels like we are going to lose our just-in-time global shipments of anything to anywhere any day now. It will soon feel like 1900 in other ways.

eru · 2025-02-06T13:52:06 1738849926

Hope we don't get 1914 again, too.

BobbyTables2 · 2025-02-06T03:31:58 1738812718

We’ll have to raise our own chickens too…

ozgune · 2025-02-06T12:28:00 1738844880

Agreed. Here are three things that I find surreal about the s1 paper.

(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?

pradn · 2025-02-06T16:34:20 1738859660

Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

cubefox · 2025-02-05T17:13:46 1738775626

Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

zoogeny · 2025-02-05T21:12:06 1738789926

This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.

I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.

Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.

It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.

pjc50 · 2025-02-06T15:18:30 1738855110

This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world. I've yet to see really good articulations of what, precisely we should be scared of.

Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.

AI alone is words on a screen.

The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.

Perhaps there will be political instability, but .. we're already there from good old humans.

fennecfoxy · 2025-02-10T14:25:39 1739197539

Depends on the model I suppose. Atm everything is being heavily trained as LLMs without much capability outside of input text->output text aside from non-modelised calls out to the Internet/RAG system etc.

But at some point (still quite far away) I'm sure we'll start training a more general purpose model, or an LLM self-training will break outside of the "you're a language model" bounds and we'll end up with exactly that;

An LLM model in a self-training loop that breaks outside of what we've told it to be (a Language model), becomes a general purpose model and then becomes intelligent enough to do something like put itself out onto the Internet. Obviously we'd catch the feelers that it puts out and realise that this sort of behaviour is starting to happen, but imagine if we didn't? A model that trained itself to be general purpose but act like a constantly executing LLM, uploads itself to Hugging Face, gets run on thousands of clusters by people, because it's "best in class" and yes it's sitting there answering LLM type queries but also in the background is sending out beacons & communicating with itself between those clusters to...idk do something nefarious.

zoogeny · 2025-02-06T19:10:01 1738869001

Some of the scariest horror movies are the ones where the monster isn't shown. Often once the monster is shown, it is less terrifying.

In a general sense, uncertainty causes anxiety. Once you know the properties of the monster you are dealing with you can start planning on how to address it.

Some people have blind and ignorant confidence. A feeling they can take on literally anything, no matter how powerful. Sometimes they are right, sometimes they are wrong.

I'm reminded by the scene in No Country For Old Men where the good guy bad-ass meets the antagonist and immediately dies. I have little faith in blind confidence.

edit: I'll also add that human adaptability (which is probably the trait most confidence in humans would rest) has shown itself capable of saving us from many previous civilization changing events. However, this change with AI is happening much, much faster than any before it. So part of the anxiety is whether or not our species reaction time is enough to avoid the cliff we are accelerating towards.

danans · 2025-02-06T15:46:18 1738856778

> without grounding in the real world.

> I've yet to see really good articulations of what, precisely we should be scared of. Bedroom superweapons?

Loss of paid employment opportunities and increasing inequality are real world concerns.

UBI isn't coming by itself.

mvieira38 · 2025-02-06T19:18:06 1738869486

Worst case scenario humans mostly go back to manual labor, which would fix a lot of modern day ailments such as obesity and (some) mental health struggles, with added enormous engineering advancements based on automatic research.

n4r9 · 2025-02-13T07:48:18 1739432898

Manual labour jobs are not magically going to appear.

pjc50 · 2025-02-06T15:47:45 1738856865

Sure, but those are also real world concerns in the non-AI alternate timeline. As is the unlikelihood of UBI.

danans · 2025-02-06T15:53:43 1738857223

Yes, but they are likely dramatically accelerated in the AI timeline.

ben_w · 2025-02-06T16:50:45 1738860645

> This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world.

It has certainly had this impact on my identity; I am unclear how well-grounded I really am*.

> I've yet to see really good articulations of what, precisely we should be scared of.

What would such an articulation look like, given you've not seen it?

> Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them.

Even with current limited systems — which are not purely desk workers, they're already being connected to and controlling robots, even by amateurs — AI lowers the minimum human skill level needed to do those things.

The fear is: how far are we from an AI that doesn't need a human in the loop? Because ChatGPT was almost immediately followed by ChaosGPT, and I have every reason to expect people to continue to make clones of ChaosGPT continuously until one is capable of actually causing harm. (As with 3d-printed guns, high chance the first ones will explode in the face of the user rather than the target).

I hope we're years away, just as self driving cars turned out to be over-promised and under-delivered for the last decade — even without a question of "safety", it's going to be hard to transition the world economy to one where humans need not apply.

> And the problem of "human alignment" is one unsolved since Cain and Abel.

Yes, it is unsolved since time immemorial.

This has required us to not only write laws, but also design our societies and institutions such that humans breaking laws doesn't make everything collapse.

While I dislike the meme "AI == crypto", one overlap is that both have nerds speed-running discovering how legislation works any why it's needed — for crypto, specifically financial legislation after it explodes in their face; for AI, to imbue the machine with a reason to approximate society's moral code, because they see the problem coming.

--

* Dunning Kruger applies; and now I have first-hand experience of what this feels like from the inside, as my self-perception of how competent I am at German has remained constant over 7 years of living in Germany and improving my grasp of the language the entire time.

palmotea · 2025-02-05T22:26:42 1738794402

> The intelligence that will be available to the average technically literate individual will be frightening.

That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.

AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

andreasmetsala · 2025-02-06T08:34:27 1738830867

> AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

How would that work? If there are no consumers then why even bother producing? If the cost of labor and capital trends towards zero then the natural consequence is incredible deflation. If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Our society isn’t really geared for handling double digit deflation so something does need to change if we really are accelerating exponentially.

palmotea · 2025-02-06T15:30:07 1738855807

> How would that work? If there are no consumers then why even bother producing?

Whim and ego. I think the advanced economy will shift to supporting trillionaires doing things like "DIY home improvement" for themselves. They'll own a bunch of automated resources (power generation, mining, manufacturing, AI engineers), and use it to do whatever they want. Build pyramids on the moon, while the now economically-useless former middle-class laborers shiver in the cold? Sure, why not?

ben_w · 2025-02-06T17:18:43 1738862323

> If there are no consumers then why even bother producing?

> If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Imagine you're a billionaire with a data centre and golden horde of androids.

You're the consumer, the robots make stuff for you; they don't make stuff for anyone else, just you, in the same way and for the same reason that your power tools and kitchen appliances don't commute to work — you could, if you wanted, lend them to people, just like those other appliances, but you'd have to actually choose to, it wouldn't be a natural consequence of the free market.

Their production is, indeed, pointless. This doesn't help anyone else eat. The moment anyone can afford to move from "have not" to "have", they drop out of the demand market for everyone else's economic output.

I don't know how big the impact of dropping out would be: the right says "trickle down economics" is good and this would be the exact opposite of that; while the left criticism's of trickle-down economics is that in practice the super-rich already have so much stuff that making them richer doesn't enrich anyone else who might service them, so if the right is correct then this is bad but if the left is correct then this makes very little difference.

Unfortunately, "nobody knows" is a great way to get a market panic all by itself.

101008 · 2025-02-05T23:02:51 1738796571

I agree with you and I am scared. My problem is: if most people can't work, who is going to pay for the product/services created with IA?

I get a lot of "IA will allow us to create SaaS in a weekend" and "IA will take engineers jobs", which I think they both may be true. But a lot of SaaS surive because engineers pay for them -- if engineer don't exist anymore, a lot of SaaS won't either. If you eat your potential customers, creating quick SaaS doesn't make sense anymore (yeah, there are exceptions, etc., I know).

palmotea · 2025-02-06T05:00:40 1738818040

> My problem is: if most people can't work, who is going to pay for the product/services created with IA?

A lot of those will probably go under, too. I think a lot of people are in for a rude awakening.

The only people our society and economy really values are the elite with ownership and control, and the people who get to eat and have comfort are those who provide things that are directly or indirectly valuable to that elite. AI will enable a game of musical chairs, with economic participants iteratively eliminated as the technology advances, until there are only a few left controlling vast resources and capabilities, to be harnessed for personal whims. The rest of us will be like rats in a city, scraping by on the margins, unwanted, out of sight, subsisting on scraps, perhaps subject to "pest control" regimes.

nazgul17 · 2025-02-06T10:06:43 1738836403

This is the same conclusion I can't help but reach. I would love nothing more but to be convinced that (there is a chance that) that is not going to happen.

kortilla · 2025-02-06T13:34:16 1738848856

> The only people our society and economy really values are the elite with ownership and control

This isn’t true. The biggest companies are all rich because they cater to the massive US middle class. That’s where the big money is at.

palmotea · 2025-02-06T15:16:04 1738854964

> This isn’t true. The biggest companies are all rich because they cater to the massive US middle class..

It is true, but I can see why you'd be confused. Let me ask you this: if members of the "the massive US middle class" can be replaced with automation, are those companies going 1) to keep paying those workers to support the middle-class demand which made them rich, or are they going to 2) fire them so more money can be shoveled up to the shareholders?

The answer is obviously #2, which has been proven time and again (e.g. how we came to have "the Rust Belt").

> That’s where the big money is at

Now, but not necessarily in the future. I think AI (if it doesn't hit a wall) will change that, maybe not instantaneously, but over time.

kortilla · 2025-02-13T22:41:46 1739486506

It’s true, but I can see why you’d be confused. You conflated what the economy rewards (which is what caters to the large middle class pool of money) with what individual companies try to optimize for (eliminating labor costs).

soco · 2025-02-06T16:09:04 1738858144

So you end up with a huge starved mob trying to come all over your mansions and islands. I somehow think Musk totally capable of nuking those mobs, or unleash the (future) AI dogs over them, because the mob cannot produce anymore (because of AI) and cannot pay anymore (because no jobs because of AI). So the mob will be totally worthless to this style of "capitalism". Really why would they bother with UBI when they can let the mob just die out?

palmotea · 2025-02-07T04:00:01 1738900801

> Really why would they bother with UBI when they can let the mob just die out?

Personally, I think UBI is a ploy to keep the "huge starved mob[s]" pacified during the transition, when they still have enough power to act, before the tech oligarchs fully cement their control.

Once the common people are powerless to protect themselves and their interests, then they'll be left to die out.

immibis · 2025-02-05T23:42:49 1738798969

Those people will simply be surplus to requirements. They'll be left alone as long as they don't get in the way of the ruling class, and disposed of if they do. As usual in history.

lodovic · 2025-02-06T06:30:01 1738823401

That's a fallacy. You can't have an advanced economy with most people sitting on the side. Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

danans · 2025-02-06T15:50:30 1738857030

> Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

Plenty of profit was made off feudalism, and technofeudalism has all the tools of modern technology at its disposal. If things go in that direction, they will have an unlimited supply of serfs desperate for whatever human work/livelihood is left.

soco · 2025-02-06T16:10:43 1738858243

Unlimited supply yes, but highly limited usage for them. So even if a few will work for free, the rest will be starving, and angry.

palmotea · 2025-02-06T06:54:58 1738824898

> You can't have an advanced economy with most people sitting on the side.

If AI lives up to the hype, that will become possible.

> If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

I don't think it would be that. There'd also be power generation, manufacturing, mining, and construction, etc.; but all extremely automated. If you get to truly extreme levels of wealth concentration, things would shift out of our capitalist market system model, and concepts like "profit" would become anachronisms.

It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.

andreasmetsala · 2025-02-06T08:38:11 1738831091

> It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.

I guess if the “players” are sociopathic enough they might decide to just wipe out the NPCs. The possibility of someone like Putin or Musk becoming the sole member of the post-singularity humanity does make me pause.

cubefox · 2025-02-06T09:17:57 1738833477

That's assuming people like Altman can keep artificial superintelligence under human control. It very well may escape control and humanity would be disempowered forever. Or worse, wiped out.

luc4sdreyer · 2025-02-06T15:06:05 1738854365

That is assuming the accelerating AI stays under human control.

We're racing up a hill at an ever-increasing speed, and we don't know what's on the other side. Maybe 80% chance that it's either nothing or "simply" a technological revolution.

gom_jabbar · 2025-02-05T18:10:51 1738779051

Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:

  [[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.

> reasoning models which do AI research

In the introduction to my research project on Accelerationism [0], I write:

  Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]

At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.

[0] https://retrochronic.com/

[1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.

pizza · 2025-02-06T04:32:09 1738816329

Hope we get the Nick Land the younger, and not Nick Land the elder, set of outcomes. Somewhere, sometime, along the way, it seems like everything from CCRU and Duginism leapt out of the page into the real. Maybe it's just the beginning of the Baudrilliardian millennium.

versteegen · 2025-02-06T00:10:30 1738800630

Nice. Though I couldn't understand those "opening lines" until I read in your Introduction:

> For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping

Fibonacci, amongst many others, played a critical role that highly accelerative technology.

fennecfoxy · 2025-02-10T14:01:12 1739196072

In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.

nyoomboom · 2025-02-05T17:10:03 1738775403

I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

tomaskafka · 2025-02-06T08:46:31 1738831591

One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

The above 'hack' is one of many realizations of the above differences.

codeulike · 2025-02-06T08:55:23 1738832123

Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

danans · 2025-02-06T15:41:59 1738856519

Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.

gield · 2025-02-06T10:49:27 1738838967

Yes, that's explicitly mentioned in the blog post:

>In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

luc4sdreyer · 2025-02-06T14:58:45 1738853925

Yes, that's one of the tricks.

ascorbic · 2025-02-05T20:18:49 1738786729

I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

lionkor · 2025-02-05T20:38:14 1738787894

Semantically, wait is a bit of a stop-and-breathe point.

Consider the text:

I think I'll go swimming today. Wait, ___

what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:

Wait, I need to help my dad.

ascorbic · 2025-02-06T07:35:26 1738827326

Yes, R1 seems to mostly use it like that. It's either to signal a problem with its previous reasoning, or if it's thought of a better approach. In coding it's often something like "this API won't work here" or "there's a simpler way to do this".

fennecfoxy · 2025-02-10T14:29:32 1739197772

I guess it goes to show how important reiteration is for general logic problems. And tbf when finding a solution to something myself I'll consider each part, and/or consider parts in relation to each other and/or consider all parts in relation to each other (on a higher level) before coming to a final solution.

It's weird because I feel like we should've known that from work in general logic/problem solving studies, surely?

katzenversteher · 2025-02-06T06:00:43 1738821643

I bet a token like "sht!", "f*" or "damn!" would have the same or even stronger effect but the LLM creators would not like to have the users read them

raducu · 2025-02-06T09:05:51 1738832751

It's literally in the article, they measured it and wait was the best token

ascorbic · 2025-02-06T07:37:37 1738827457

Maybe, but it doesn't just use it to signify that it's made a mistake. It also uses it in a positive way, such as it's had a lightbulb moment. Of course some people use expletives in the same way, but that would be less common than for mistakes.

lodovic · 2025-02-06T06:22:19 1738822939

I think you're onto something, however, as the training is done through on text and not actual thoughts, it may take some experimentation to find these stronger words.

rgovostes · 2025-02-06T01:40:40 1738806040

> a branch of computer science

It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.

BobbyTables2 · 2025-02-06T03:35:00 1738812900

Almost seems more like computer engineering. Is it really that different than signal/image processing?

I suspect CS departments don’t want to concede because they are now in the limelight…

maginx · 2025-02-06T20:47:09 1738874829

I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

lostmsu · 2025-02-05T23:07:40 1738796860

Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?

kevin009 · 2025-02-05T22:53:39 1738796019

There are more than 10 different ways that I know for sure will improve LLMs just like `wait`. It is part if the CoT. I assume most researchers know this. CoT in old as 2019

kristianp · 2025-02-06T22:40:21 1738881621

Chain of thought (CoT)?

Melatonic · 2025-02-06T22:15:07 1738880107

Mind elaborating ?

deadbabe · 2025-02-06T01:28:11 1738805291

I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

BobbyTables2 · 2025-02-06T03:29:55 1738812595

May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.

philipswood · 2025-02-06T03:56:07 1738814167

I think NVIDIAs future is pretty bright.

We're getting to the run-your-capable-LLM on-prem or at-home territory.

Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.

https://www.nvidia.com/en-us/project-digits/

Arn_Thor · 2025-02-06T06:22:30 1738822950

Except I can run R1 1.5b on a GPU-less and NPU-less Intel NUC from four-five years ago using half its cores and the reply speed is…functional.

As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.

I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.

nickthegreek · 2025-02-06T15:05:28 1738854328

R1 1.5b won’t do what most people want at all.

Arn_Thor · 2025-02-06T19:39:55 1738870795

No, it won't. But that's not the point I was making

fennecfoxy · 2025-02-10T14:34:07 1739198047

Yeah but what was R1 trained with? 50k GPUs as far as I've heard as well as distillation from OpenAI's models (basically leaning on their GPUs/GPU time).

Besides the fact that consumers will still always want GPUs for gaming, rendering, science compute etc.

No, I don't have any Nvidia stocks.

pradn · 2025-02-06T16:36:55 1738859815

I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?

cyanydeez · 2025-02-05T22:45:25 1738795525

its fascinating how certain political movements avoid that Wait moment...

advael · 2025-02-06T07:36:08 1738827368

I'm strictly speaking never going to think of model distillation as "stealing." It goes against the spirit of scientific research, and besides every tech company has lost my permission to define what I think of as theft forever

eru · 2025-02-06T13:34:21 1738848861

At most it would be illicit copying.

Though it's poetic justice that OpenAI is complaining about someone else playing fast and loose with copyright rules.

downrightmike · 2025-02-06T15:47:18 1738856838

The First Amendment is not just about free speech, but also the right to read, the only question is if AI has that right.

rig666 · 2025-02-06T18:28:25 1738866505

Does my software have the right to read the contents of a DVD and sell my own MP4 of it then no. If a streamer plays a YouTube video on there channel is the content original then yes. When gpt3 was training people saw it as a positive. When people started asking chatgpt more things than searching sites it became a negative.

organsnyder · 2025-02-06T17:41:00 1738863660

If AI was just reading, there would be much less controversy. It would also be pretty useless. The issue is that AI is creating its own derivative content based on the content it ingests.

boxcake · 2025-02-06T19:42:14 1738870934

Isn't any answer to a question which hasn't been previously answered a derivative work? Or when a human write a parody of a song, or when a new type of music is influenced by something which came before.

nrabulinski · 2025-02-06T21:26:00 1738877160

This argument is so bizarre to me. Humans create new, spontaneous thoughts. AI doesn’t have that. Even if someone’s comment is influenced by all the data they have ingested over their lives, their style is distinct and deliberate, to the point where people have been doxxed before/anonymous accounts have been uncovered because someone recognized the writing style. There’s no deliberation behind AI, just statistical probabilities. There’s no new or spontaneous thoughts, at most pseudorandomness introduced by the author of the model interface.

Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset

vidarh · 2025-02-06T23:47:19 1738885639

This argument is so bizarre to me.

There is no evidence whatsoever to support that humans create "new, spontaneous thoughts" in any materially, qualitatively different way than an AI. In other words: As a Turing-computable function over the current state. It may be that current AI's can't, but the notion that there is some fundamental barrier is a hypothesis with no evidence to support it.

> Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset

If you know of any mechanism whereby humans can do anything qualitatively different, then you'd have the basis for a Nobel Prize-winning discovery. We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.

We don't even know how to formalize what it would mean to "come up with a novel idea" in the sense you appear to mean, as presumably, something purely random would not satisfy you, yet something purely Turing computable would also not do, but we don't know of any computable functions that are not Turing computable.

advael · 2025-02-07T07:55:19 1738914919

This argument, by now a common refrain from defenders of companies like OpenAI, misses the entire putative point of intellectual property, and the point of law in general. It is a distraction of a common sort - an attempt to reframe a moral and legal question into an abstract ontological one

The question of whether the mechanism of learning in a human brain and that in an artificial neural network is similar is a philosophical and perhaps technical one that is interesting, but not relevant to why intellectual property law was conceived: To economically incentivize human citizens to spend their time producing creative works. I don't actually think property law is a good way to do this. Nonetheless the question when massive capital investments are used to scrape artists' work in order to undercut their ability to make a living from that work for the benefit of private corporations that do not have their consent to do this is whether this should violate this artificial notion of intellectual property that we have constructed for this purpose, and in that sense, it's fairly obvious that the answer is yes

vidarh · 2025-02-07T17:29:28 1738949368

I wasn't responding to a moral and legal question. I was responding to a comment arguing that humans are some magical special case in nature.

If you want to argue it's a distraction, argue that with the person I replied to, who was the person who changed the focus.

advael · 2025-02-08T12:59:35 1739019575

Yea I'll give you that. But many people seem to have the argument you've made - which is dubious on its own terms, by the way, as we don't really have a complete picture of human learning and the assumption that it simply follows the mechanisms we understand from machine learning is not a null hypothesis that doesn't demand justification - loaded up for these conversations, and it needs to be addressed wherever possible that the ontological question is not what matters here

vidarh · 2025-02-10T08:34:31 1739176471

> which is dubious on its own terms, by the way, as we don't really have a complete picture of human learning and the assumption that it simply follows the mechanisms we understand from machine learning is not a null hypothesis that doesn't demand justification

The argument I made in no way rests on a "complete picture of human learning". The only thing they rest on is lack of evidence of computation exceeding the Turing computable set. Finding evidence of such computation would upend physics, symbolic logic, maths. It'd be a finding that'd guarantee a Nobel Prize.

I gave the justification. It's a simple one, and it stands on its own. There is no known computable function that exceeds the Turing computable, and all Turing computable functions can be computed on any Turing complete system. Per the extended Church Turing thesis this includes any natural system given the limitations of known physics. In other words: Unless you can show knew, unknown physics, human brains are computers with the same limitations as any electronic computer, and the notion of "something new" arising from humans, other than as a computation over pre-existing state, in a way an electronic computer can't also do, is an entirely unsupportable hypothesis.

> and it needs to be addressed wherever possible that the ontological question is not what matters here

It may not be what matters to you, but to me the question you clearly would prefer to discuss is largely uninteresting.

eru · 2025-02-07T00:16:32 1738887392

> In other words: As a Turing-computable function over the current state.

You need to be a bit more expansive. Turing-computable functions need to halt and return eventually. (And they need to be proven to halt.)

> We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.

Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)

vidarh · 2025-02-07T17:33:47 1738949627

> You need to be a bit more expansive. Turing-computable functions need to halt and return eventually. (And they need to be proven to halt.)

This is pedantry. Any non-halting function can be decomposed into a step function and a loop. What matters is that step function. But ignoring that, human existence halts, and so human thought processes can be treated as a singular function that halts.

> Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)

An AI model does not have computational resources. It's a bunch of numbers. The point is not the actual execution but theoretical computational power if unconstrained by execution environment.

The Church-Turing thesis also presupposes an unlimited amount of time and storage.

eru · 2025-02-09T04:45:00 1739076300

Yes, that's why we need something stronger than the Church-Turing thesis.

See https://scottaaronson.blog/?p=735 'Why Philosophers should care about Computational Complexity'

Basically, what the brain can do in reasonable amounts of time (eg polynomial time), computers can also do in polynomial time. To make it a thesis something like this might work: "no physically realisable computing machine (including the brain) can do more in polynomial time than BQP already allows" https://en.wikipedia.org/wiki/BQP

vidarh · 2025-02-10T08:49:05 1739177345

If people were claiming that a computer might be able to, but will be to slow, that might be an angle to take, but to date, in these discussions, none of the people arguing that brains can do more have argued that they're just more efficient, but that they inherently have more capabilities, so it's an unnecessarily convoluted argument.

fennecfoxy · 2025-02-10T15:26:41 1739201201

>Humans create new, spontaneous thoughts I don't believe we do; just look to media, very few plot-lines in Movies/TV are little more than "boy meets girl Pocahontas".

And if you say that a model could not create anything new because of it's static data set but humans could...I disagree with that because us humans are working with a data set that we add to some days, but if we use the example of writing a TV script, the writer draw from their knowledge (gained thru life experience) that is as finite as a model's training set is.

I've made this sort of comment before. Even look to high fantasy; what are elves but humans with different ears? Goblins are just little humans with green skin. Dragons are just big lizards. Minotaurs are just humans but mixed with a bull. We basically create no new ideas - 99% of human "creativity" is just us riffing on things we know of that already exist.

I'd say the incidences of humans having a brand new thought or experience not rooted in something that already exists is very, very low.

Even just asking free chat gpt to make me a fantasy species with some culture and some images of the various things it described does pretty well; https://imgchest.com/p/lqyeapqkk7d. But it's all rooted in existing concepts, same as anything most humans would produce.

Ringz · 2025-02-06T22:25:56 1738880756

> Humans create new, spontaneous thoughts.

The compatibility of determinism and freedom of will is still controversially debated. There is a good chance that Humans don’t „create“.

> There’s no deliberation behind AI, just statistical probabilities. There’s no new or spontaneous thoughts, at most pseudorandomness introduced by the author of the model interface.

You can say exactly the same about deterministic humans since it is often argued that the randomness of thermodynamic or quantum mechanical processes is irrelevant to the question of whether free will is possible. This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.

eru · 2025-02-07T00:20:26 1738887626

> The compatibility of determinism and freedom of will is still controversially debated. There is a good chance that Humans don’t „create“.

Determinism and free will are pretty irrelevant here.

Unless P=NP, there's no way for us to distinguish in general between eg pseudo random systems and truly random systems from the outside.

Btw, I don't think determinism in humans/AI has anything to do with deliberation.

The newest AI models are allowed to deliberate. At least by some meanings of the word.

> This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.

Well, different people have different definitions here. None of them very satisfying.

Ringz · 2025-02-07T11:50:54 1738929054

> Determinism and free will are pretty irrelevant here.

No. It’s the other way around. Free will is the basic for „creating something new“.

> Btw, I don't think determinism in humans/AI has anything to do with deliberation.

With determinism there is no deliberation.

eru · 2025-02-07T13:57:58 1738936678

> With determinism there is no deliberation.

As far as we can tell, all the laws of the universe are completely deterministic. (And that includes quantum mechanics.) As far as we can tell, human beings obey the laws of physics.

(To explain: quantum mechanics as a theory is completely deterministic and even linear. Some outdated interpretations of quantum mechanics, like Copenhagen, use randomisation. But interpretations don't make a difference to what the underlying theory actually is. And more widely accepted interpretations like 'Many Worlds' preserve the determinism of the underlying theory.)

Btw, neural nets are typically sampled from, and you can use as good a random number generator (even a physical random number generator) as there is, if you want to. I don't think it'll change what we think neural nets are capable of.

vidarh · 2025-02-07T17:44:35 1738950275

That's exactly their point (and mine), with respect to the person above arguing humans unlike AI can create "new things". For that distinction to make sense "new things" must be interpreted as "something that can't be deterministically derived from the current world state", as they're trying to create a distinction between a purely deterministic algorithm and human consciousness.

INGSOCIALITE · 2025-02-07T02:41:23 1738896083

kind of. the constitution as a whole, and the amendments, don't give you the right to do anything. you have the right to do whatever you want whenever you want. the constitution tells the government what it can and can not stop you from doing.

eru · 2025-02-07T00:11:05 1738887065

I'm not sure the US 'First Amendment' is relevant here? DeepSeek is in China.

antimatter15 · 2025-02-06T20:20:59 1738873259

It's hardly even illicit- at least in the United States, the output of an AI isn't copyrightable.

eru · 2025-02-07T04:04:11 1738901051

Was that decided in courts, yet?

In any case, copyright ain't the only thing that prevents copying.

tomrod · 2025-02-06T15:28:05 1738855685

Stochastic decompression. Dass-it.

fennecfoxy · 2025-02-10T14:37:14 1739198234

I think it's less about that and more whether or not they used the free or paid API.

I think if OpenAI (or any other company) are paid for their compute time/access as anybody would, then using content generated by other models is fair game. Because it's an active/ongoing cost and not a passive one.

Whereas if someone trained on my dumb Tweets or HN posts then so be it; it's a passive cost for me - I paid my time to say x thing for my own benefits (tribal monk-e social interaction) therefore I have already gotten the value out of it.

surajrmal · 2025-02-06T16:31:29 1738859489

Maybe but something has gotta pay the bills to justify the cutting edge. I guess it's a similar problem to researching medicine.

ClumsyPilot · 2025-02-06T18:40:47 1738867247

Well the artists and writers also want to pay their bills. We threw them under the bus, might as well throw openAI too and get an actual open AI that we can use

advael · 2025-02-06T20:20:25 1738873225

The investment thrown at OpenAI seems deeply inflated for how much meaningful progress they're able to make with it

I think it's clear that innovative breakthroughs in bleeding-edge research are not just a matter of blindly hurling more money at a company to build unprecedentedly expensive datacenters

But also, even if that was a way to do it, I don't think we should be wielding the law to enable privately-held companies to be at the forefront of research, especially in such a grossly inconsistent manner

pona-a · 2025-02-05T14:06:28 1738764388

If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

zoogeny · 2025-02-05T21:22:56 1738790576

I've had an idea since I was a kid which I can share. I was contemplating AI and consciousness generally, probably around the time I read "The Minds I".

I reflected on the pop-psychology idea of consciousness and subconsciousness. I thought of each as an independent stream of tokens, like stream of consciousness poetry. But along the stream there were joining points between these two streams, points where the conscious stream was edited by the subconscious stream. You could think of the subconscious stream as performing CRUD like operations on the conscious stream. The conscious stream would act like a buffer of short-term memory while the subconscious stream would act like a buffer of long-term memory. Like, the subconscious has instructions related to long-term goals and the conscious stream has instructions related to short-term goals.

You can imagine perception as input being fed into the conscious stream and then edited by the subconscious stream before execution.

It seems entirely possible to actually implement this idea in this current day and age. I mean, it was a fever dream as a kid, but now it could be an experiment!

barrenko · 2025-02-06T08:29:28 1738830568

Conscious as subconscious pretending not to be sunconscious, something like that, a thin wrapper. Crud makes sense.

Gels closely to buddhism, hell, all religions.

ForHackernews · 2025-02-06T13:22:55 1738848175

Have you read Jaynes' "The Origin of Consciousness in the Breakdown of the Bicameral Mind"?

zoogeny · 2025-02-06T19:01:21 1738868481

I haven't read the original but I am familiar with the broad stroke view. There are similarities (perhaps vague) in the more recent work of someone like McGilchrist and his The Master and His Emissary (another book which I only have a broad stroke view of).

At the time I had this idea I did not know of either of these. I think I was drawing explicitly on the conscious / subconscious vocabulary.

easeout · 2025-02-05T15:37:30 1738769850

Here's a paper your idea reminds me of. https://arxiv.org/abs/2501.19201

It's also so not far from Meta's large concept model idea.

pona-a · 2025-02-05T15:43:59 1738770239

Previous discussion:

[41 comments, 166 points] https://news.ycombinator.com/item?id=42919597

bluechair · 2025-02-05T15:09:36 1738768176

I had this exact same thought yesterday.

I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

No data to back my hypothesis up. So take it for what it’s worth.

hadlock · 2025-02-06T01:05:28 1738803928

This is where I was headed but I think you said it better. Some kind of executive process monitoring the situation, the random stream of consciousness and the actual output. Looping back around to outdated psychology you have the ego which is the output (speech), the super ego is the executive process and the id is the <think>internal monologue</think>. This isn't the standard definition of those three but close enough.

larodi · 2025-02-05T15:23:40 1738769020

My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?

whimsicalism · 2025-02-05T15:27:20 1738769240

> this incomprehensible stream of embedding vectors as natural language explanation, in a way returning to encoder/decoder architecture

this is just standard decoding, the stream of vectors is called the k/v cache

cakealert · 2025-02-06T01:48:06 1738806486

The problem is that RL is extremely inefficient. It's one thing to use it for fine tuning an LLM to do the chain of thought trick and quite another to do thinking entirely from scratch. The pretrained LLM does a lot of heavy lifting there.

And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.

itissid · 2025-02-06T02:25:40 1738808740

> The problem is that RL is extremely inefficient.

Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.

RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.

[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...

cakealert · 2025-02-06T03:25:58 1738812358

RL provides very poor training signal for deep learning, an order of magnitude or more worse than supervised learning. Better than nothing of course.

What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.

pizza · 2025-02-06T04:44:28 1738817068

In the case of supervised problem domains, you implicitly make a decision about what is signal, and what is noise, and sure, in that closed setting, supervised learning is much more sample efficient. But I think what we're learning now is that with strong enough base models, 'aha' moments in RL training show that it might be possible to essentially 'squeeze out signal from language itself', giving you far greater breadth of latent knowledge than supervised examples, and letting you train to generalize to far greater horizons than a fixed dataset might allow. In a fascinating way it is rather reminiscent of, well, abiogenesis. This might sound like speculative claptrap if you look at the things the current generation of models are still weak at, but... there's a real chance that there is a very heavy tail to the set of outcomes in the limit.

cakealert · 2025-02-06T07:03:32 1738825412

With a pretrained LLM most of the work is done. RL just steers the model into a 'thinking' mode. There is enough signal for that to work and for the inefficiency to not matter.

The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.

pizza · 2025-02-06T08:13:21 1738829601

I strongly agree; in fact I think what best matches the thought process is something like the multiset tree/forest workspace approach as suggested by Marcolli, Chomsky, and Berwick - a Hopf algebra that can be externalized into (non-planar) embeddings of linearized strings, or alternately into semantic manifolds.

bloomingkales · 2025-02-05T15:49:00 1738770540

Once we train models on the chain of thought outputs, next token prediction can solve the halting problem for us (eg, this chain of thinking matches this other chain of thinking).

psadri · 2025-02-05T16:36:41 1738773401

I think that is how human brains work. When we practice, at first we have to be deliberate (thinking slow). Then we “learn” from our own experience and it becomes muscle memory (thinking fast). Of course, it increases the odds we are wrong.

bloomingkales · 2025-02-05T16:45:24 1738773924

Or worse, we incorrectly overweight the wrong chain of thinking to an irrelevant output (but pragmatically useful output), at scale.

For example, xenophobia as a response to economic hardship is the wrong chain of thinking embedded in the larger zeitgeist.

jjk7 · 2025-02-05T17:53:19 1738777999

Comments on a google doc? Nesting in social media comments?

Seems like similar concepts. I think there is some potential to improving how LLMs improve and further their own reasoning lines, but I'm no AI mage.

mark_l_watson · 2025-02-05T13:51:51 1738763511

Off topic, but I just bookmarked Tim’s blog, great stuff.

I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.

Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.

I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.

bloomingkales · 2025-02-05T15:16:30 1738768590

If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.

You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.

Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.

zoogeny · 2025-02-05T21:55:24 1738792524

I can't believe this hasn't been done yet, perhaps it is a cost issue.

My literal first thought about AI was wondering why we couldn't just put it in a loop. Heck, one update per day, or one update per hour would even be a start. You have a running "context", the output is the next context (or a set of transformations on a context that is a bit larger than the output window). Then ramp that up ... one loop per minute, one per second, millisecond, microsecond.

int_19h · 2025-02-05T22:20:27 1738794027

The hard part is coming up with a good way to grade results. Which you need to update the weights based on the outcome, otherwise the model will not actually learn anything.

zoogeny · 2025-02-05T22:32:50 1738794770

For the "looping" I'm talking about you don't need to update the weights. It is simply, old context in, new context out, new context in, new-new context out, etc.

Of course, keeping that coherent over numerous loops isn't going to be easy. No doubt there is a chance it goes off the rails. So you might have a section of context that is constantly stable, a section of context that updates each loop, etc.

In the other response to my comment someone mentioned eventually updating the weights (e.g. daily) and you would in that case have to have some kind of loss function.

int_19h · 2025-02-06T02:04:57 1738807497

Then I'm not quite sure what benefit you expect to derive from it? Making e.g. QwQ-32 loop isn't hard - it often does it all by itself, even. But it doesn't translate to improvements on every iteration; it just goes in circles.

bloomingkales · 2025-02-06T02:27:59 1738808879

When you look out your eyes, that rectangular viewport is all your context. For example, we cannot fit Jupiter into your viewport.

So, if we can never fit large concepts like Jupiter into your viewport (your context window), does it follow that you will never understand Jupiter? Or is there no way to take a picture of Jupiter and make it smaller so it fits into your viewport?

See, coherence is just resizing or reimagining things so it fits into context. Context can never hold it all because we have the capacity to always imagine infinitely large things.

So I don’t fully know if it’s old context out, new context in, but could be just the right context, just enough of it, and just the right looking context so we can assemble the right worldview. This process would have to go on infinitely, and that is the world as you see and understand it.

layer8 · 2025-02-05T22:06:25 1738793185

Same. And the next step is that it must feed back into training, to form long-term memory and to continually learn.

zoogeny · 2025-02-05T22:09:07 1738793347

I analogize this with sleep. Perhaps that is what is needed, 6 hours offline per day to LoRa the base model on some accumulated context from the day.

dev0p · 2025-02-06T19:43:13 1738870993

LLMs need to sleep too. Do they dream of electric sheep?

MrLeap · 2025-02-05T21:17:15 1738790235

This is a fantastic insight and really has my gears spinning.

We need to cluster the AI's insights on a spatial grid hash, give it a minimap with the ability to zoom in and out, and give it the agency to try and find its way to an answer and build up confidence and tests for that answer.

coarse -> fine, refine, test, loop.

Maybe a parallel model that handles the visualization stuff. I imagine its training would look more like computer vision. Mind palace generation.

If you're stuck or your confidence is low, wander the palace and see what questions bubble up.

Bringing my current context back through the web is how I think deeply about things. The context has the authority to reorder the web if it's "epiphany grade".

I wonder if the final epiphany at the end of what we're creating is closer to "compassion for self and others" or "eat everything."

gyomu · 2025-02-05T22:49:42 1738795782

> If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

That’s pottery, not sculpture. Traditionally in sculpture you start from a block of marble or wood, but you can also make sculptures of cast bronze or welded steel (or clay, but you don’t use a spinning plate).

bloomingkales · 2025-02-06T01:43:16 1738806196

Thank you for the clarification. I wanted to use some kind of visual to show the model in a loop. Otherwise, I’d just have to say explicitly that the sculptor is the one in the loop, as in the person will not stop chiseling. It’s in this infinite chiseling that we get our answers (same thing as finding a limit in calculus as it approaches infinity, we will never get the discrete answer, but we will get infinitely close enough to label a discrete point confidently).

In other words, we fly as close to the sun as possible and get our measurements :)

swiftcoder · 2025-02-05T13:25:18 1738761918

> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...

gessha · 2025-02-05T14:13:10 1738764790

This is pure speculation on my part but I think at some point a company's valuation became tied to how big their compute is so everybody jumped on the bandwagon.

tyfon · 2025-02-05T17:17:06 1738775826

I don't think you need to speculate too hard. On CNBC they are not tracking revenue, profits or technical breakthroughs, but how much the big companies are spending (on gpus). That's the metric!

LeifCarrotson · 2025-02-05T18:45:00 1738781100

I probably don't have to repeat it, but this is a perfect example of Goodhart's Law: when a metric is used as a target, it loses its effectiveness as a metric.

If you were a reporter who didn't necessarily understand how to value a particular algorithm or training operation, but you wanted a simple number to compare the amount of work OpenAI vs. Google vs Facebook are putting into their models, yeah, it makes sense. How many petaflops their datacenters are churning through in aggregate is probably correlated to the thing you're trying to understand. And it's probably easier to look at their financials and correlate how much they've spent on GPUs to how many petaflops of compute they need.

But when your investors are giving you more money based on how well they perceive you're doing, and their perception is not an oracle but is instead directly based on how much money you're spending... the GPUs don't actually need to do anything other than make number go up.

Mistletoe · 2025-02-05T17:19:56 1738775996

This feels like one of those stats they show from 1929 and everyone is like “and they didn’t know they were in a bubble?”

ur-whale · 2025-02-05T21:44:52 1738791892

> but how much the big companies are spending (on gpus). That's the metric!

Burn rate based valuations!

The 2000's are back in full force!

RobotToaster · 2025-02-05T18:28:26 1738780106

"But tulip sales keep increasing!"

B56b · 2025-02-05T20:41:47 1738788107

They absolutely are tracking revenues/profits on CNBC, what are you talking about?

syntaxing · 2025-02-05T14:24:56 1738765496

Matt Levine tangentially talked about this during his podcast this past Friday (or was it the one before?). It was a good way to value these companies according to their compute size since those chips are very valuable. At a minimum, the chips are an asset that acts as a collateral.

jxdxbx · 2025-02-05T14:42:15 1738766535

I hear this a lot, but what the hell. It's still computer chips. They depreciate. Short supply won't last forever. Hell, GPUs burn out. It seems like using ice sculptures as collateral, and then spring comes.

baxtr · 2025-02-05T14:52:28 1738767148

If so wouldn’t it be the first time in history when more processing power is not used?

In my experience CPU/GPU power is used up as much as possible. Increased efficiency just leads to more demand.

littlestymaar · 2025-02-05T17:16:16 1738775776

I think you're missing the point: H100 isn't going to remain useful for a long time, would you consider Tesla or Pascal graphic cards a collateral? That's what those H100 will look like in just a few years.