Hacker News new | past | comments | ask | show | jobs | submit login
S1: A $6 R1 competitor? (timkellogg.me)
851 points by tkellogg 8 days ago | hide | past | favorite | 415 comments





I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.


We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."


So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

So more 'mature' models might arise in the near future with less params and better benchmarks?

That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.

"Better", but not better than the model they were distilled from, at least that's how I understand it.

I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)

Not at all — how many people were geniuses and their parents not? I can name several and I’m sure with a quick search you can too.

How is that relevant? A few examples do not disprove anything. It's pretty common knowledge that the more successful/rich etc. your parents were, the more likely you'll be successful/rich etc.

This does not directly prove the theory your parent comment posits, being that better circumstances during a child's development improve the development of that child's brain. That would require success being a good predictor of brain development, which I'm somewhat uncertain about.


They might also be more biased and less able to adapt to new technology. Interesting times.

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.


> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.


And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.

Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.

Maybe I'm just a filthy computationalist, but the way I see it, the most accurate model of the universe is the one which makes the most accurate predictions with the fewest parameters.

The Newtonian model makes provably less accurate predictions than Einsteinian (yes, I'm using a different example), so while still useful in many contexts where accuracy is less important, the number of parameters it requires doesn't much matter when looking for the one true GUT.

My understanding, again as a filthy computationalist, is that an accurate model of the real bonafide underlying architecture of the universe will be the simplest possible way to accurately predict anything. With the word "accurately" doing all the lifting.

As always: https://www.sas.upenn.edu/~dbalmer/eportfolio/Nature%20of%20...

I'm sure there are decreasingly accurate, but still useful, models all the way up the computational complexity hierarchy. Lossy compression is, precisely, using one of them.


The thing is, Lagrangian mechanics makes exactly the same predictions as Newtownian, and it starts from a foundation of just one principle (least action) instead of three laws, so it's arguably a sparser theory. It just makes calculations easier, especially for more complex systems, that's its raison d'être. So in a world where we don't know about relativity yet, both make the best predictions we know (and they always agree), but Newton's laws were discovered earlier. Do they suddenly stop being natural laws once Lagrangian mechanics is discovered? Standard physics curricula would not agree with you btw, they practically always teach Newtownian mechanics first and Lagrangian later, also because the latter is mathematically more involved.

I will argue that 'has least action as foundation' does not in itself imply that Lagrangian mechanics is a sparser theory:

Here is something that Newtonian mechanics and Lagrangian mechanics have in common: it is necessary to specify whether the context is Minkowski spacetime, or Galilean spacetime.

Before the introduction of relativistic physics the assumption that space is euclidean was granted by everybody. The transition from Newtonian mechanics to relativistic mechanics was a shift from one metric of spacetime to another.

In retrospect we can recognize Newton's first law as asserting a metric: an object in inertial motion will in equal intervals of time traverse equal distances of space.

We can choose to make the assertion of a metric of spacetime a very wide assertion: such as: position vectors, velocity vectors and acceleration vectors add according to the metric of the spacetime.

Then to formulate Newtonian mechanics these two principles are sufficient: The metric of the spacetime, and Newton's second law.

Hamilton's stationary action is the counterpart of Newton's second law. Just as in the case of Newtonian mechanics: in order to express a theory of motion you have to specify a metric; Galilean metric or Minkowski metric.

To formulate Lagrangian mechanics: choosing stationary action as foundation is in itself not sufficent; you have to specify a metric.

So: Lagrangian mechanics is not sparser; it is on par with Newtonian mechanics.

More generally: transformation between Newtonian mechanics and Lagrangian mechanics is bi-directional.

Shifting between Newtonian formulation and Lagrangian formulation is similar to shifting from cartesian coordinates to polar coordinates. Depending on the nature of the problem one formulation or the other may be more efficient, but it's the same physics.


You seem to know more about this than me, but it seems to me that the first law does more than just induce a metric, I've always thought of it as positing inertia as an axiom.

There's also more than one way to think about complexity. Newtownian mechanics in practice requires introducing forces everywhere, especially for more complex systems, to the point that it can feel a bit ad hoc. Lagrangian mechanics very often requires fewer such introductions and often results in descriptions with fewer equations and fewer terms. If you can explain the same phenomenon with fewer 'entities', then it feels very much like Occam's razor would favor that explanation to me.


Indeed inertia. Theory of motion consists of describing the properties of Inertia.

In terms of Newtonian mechanics the members of the equivalence class of inertial coordinate systems are related by Galilean transformation.

In terms of relativistic mechanics the members of the equivalence class of inertial coordinate systems are related by Lorentz transformation.

Newton's first law and Newton's third law can be grouped together in a single principle: the Principle of uniformity of Inertia. Inertia is uniform everywhere, in every direction.

That is why I argue that for Newtonian mechanics two principles are sufficient.

The Newtonian formulation is in terms of F=ma, the Lagrangian formulation is in terms of interconversion between potential energy and kinetic energy

The work-energy theorem expresses the transformation between F=ma and potential/kinetic energy The work-energy theorem: I give a link to an answer by me on physics.stackexchange where I derive the work-energy theorem https://physics.stackexchange.com/a/788108/17198

The work-energy theorem is the most important theorem of classical mechanics.

About the type of situation where the Energy formulation of mechanics is more suitable: When there are multiple degrees of freedom then the force and the acceleration of F=ma are vectorial. So F=ma has the property that the there are vector quantities on both sides of the equation.

When expressing in terms of energy: As we know: the value of kinetic energy is a single value; there is no directional information. In the process of squaring the velocity vector directional information is discarded, it is lost.

The reason we can afford to lose the directional information of the velocity vector: the description of the potential energy still carries the necessary directional information.

When there are, say, two degrees of freedom the function that describes the potential must be given as a function of two (generalized) coordinates.

This comprehensive function for the potential energy allows us to recover the force vector. To recover the force vector we evaluate the gradient of the potential energy function.

The function that describes the potential is not itself a vector quantity, but it does carry all of the directional information that allows us to recover the force vector.

I will argue the power of the Lagrangian formulation of mechanics is as follows: when the motion is expressed in terms of interconversion of potential energy and kinetic energy there is directional information only on one side of the equation; the side with the potential energy function.

When using F=ma with multiple degrees of freedom there is a redundancy: directional information is expressed on both sides of the equation.

Anyway, expressing mechanics taking place in terms of force/acceleration or in terms of potential/kinetic energy is closely related. The work-energy theorem expresses the transformation between the two. While the mathematical form is different the physics content is the same.


Nicely said, but I think then we are in agreement that Newtownian mechanics has a bit of redundancy that can be removed by switching to a Lagrangian framework, no? I think that's a situation where Occam's razor can be applied very cleanly: if we can make the exact same predictions with a sparser model.

Now the other poster has argued that science consists of finding minumum complexity explanations of natural phenomena, and I just argued that the 'minimal complexity' part should be left out. Science is all about making good predictions (and explanations), Occam's razor is more like a guiding principle to help find them (a bit akin to shrinkage in ML) rather than a strict criterion that should be part of the definition. And my example to illustrate this was Newtonian mechanics, which in a complexity/Occam's sense should be superseded by Lagrangian, yet that's not how anyone views this in practice. People view Lagrangian mechanics as a useful calculation tool to make equivalent predictions, but nobody thinks of it as nullifying Newtownian mechanics, even though it should be preferred from Occam's perspective. Or, as you said, the physics content is the same, but the complexity of the description is not, so complexity does not factor into whether it's physics.


Laws (in science, not government) are just a relationship that is consistently observed, so Newton's laws remain laws until contradictions were observed, regardless of the existence of or more alternative models which would predict them to hold.

The kind of Occam’s Razor-ish rule you seem to be trying to query about is basically a rule of thumb for selecting among formulations of equal observed predictive power that are not strictly equivalent (that is, if they predict exactly the same actually observed phenomenon instead of different subsets of subjectively equal importance, they still differ in predictions which have not been testable), whereas Newtonian and Lagrangian mechanics are different formulations that are strictly equivalent, which means you may choose between them for pedagogy or practical computation, but you can't choose between them for truth because the truth of one implies the truth of the other, in either direction; they are the exactly the same in sibstance, differing only in presentation.

(And even where it applies, its just a rule of thumb to reject complications until they are observed to be necessary.)


Newtownian and Lagrangian mechanics are equivalent only in their predictions, not in their complexity - one requires three assumptions, the other just one. Now you say the fact that they have the same predictions makes them equivalent, and I agree. But it's clearly not compatible with what the other poster said about looking for the simplest possible way to explain a phenomenon. If you believe that that's how science should work, you'd need to discard theories as soon as simpler ones that make the same predictions are found (as in the case of Newtownian mechanics). It's a valid philosophical standpoint imho, but it's in opposition to how scientists generally approach Occam's razor, as evidenced eg by common physics curricula. That's what I was pointing out. Having to exclude Newtownian mechanics from what can be considered science is just one prominent consequence of the other poster's philosophical stance, one that could warrant reconsidering whether that's how you want to define it.

> Do they suddenly stop being natural laws once Lagrangian mechanics is discovered?

Not my question to answer, I think that lies in philosophical questions about what is a "law".

I see useful abstractions all the way down. The linked Asimov essay covers this nicely.



Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken

And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.

Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).


This brings up an interesting thought too. A photo is just a lossy representation of the real world.

So it's lossy all the way down with LLMs, too.

Reality > Data created by a human > LLM > Distilled LLM


What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.

Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")


For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.


Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.


> _fundamentally_ don’t understand how deep learning systems works.

It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.

Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.

I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.

The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.


As a person who has trained a number of computer vision deep networks, I can tell you that we have some cool-looking visualizations on how lower layers work but no idea how later layers work. The intuition is built over training numerous networks and trying different hyperparameters, data shuffling, activations, etc. it’s absolutely brutal over here. If the theory was there, people like Karpathy who have great teacher vibes would’ve explained it for the mortal grad students or enthusiast tinkerers.

> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks

I say this less as an authoritative voice but more as an amused insider: Spend a week with some ML grad students and you will get a chuckle whenever somebody says we’re not some monkeys throwing things at GPUs.


It may be as simple as this:

https://youtube.com/shorts/7GrecDNcfMc

Many many layers of that. It’s not a profound mechanism. We can understand how that works, but we’re dumbfounded how such a small mechanism is responsible for all this stuff going on inside a brain.

I don’t think we don’t understand, it’s a level beyond that. We can’t fathom the implications, that it could be that simple, just scaled up.


> Many many layers of that. It’s not a profound mechanism

Bad argument. Cavemen understood stone, but they could not build the aqueducts. Medieval people understood iron, water and fire but they could not make a steam engine

Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make


"Cavemen understood stone"

How far removed are you from a caveman is the better question. There would be quite some arrogance coming out of you to suggest the several million years gap is anything but an instant in the grand timeline. As in, you understood stone just yesterday ...

The monkey that found the stone is the monkey that built the cathedral. It's only a delusion the second monkey creates to separate it from the first monkey (a feeling of superiority, with the only tangible asset being "a certain amount of notable time passed since point A and point B").

"Finally we understand protons, electrons, and neutrons and the forces that government them but it does not mean we understand everything they could mossibly make"

You and I agree. That those simple things can truly create infinite possibilities. That's all I was saying, we cannot fathom it (either because infinity is hard to fathom, or that it's origins are humble - just a few core elements, or both, or something else).

Anyway, this can discussion can head into any direction.


> The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work

Just like alchemists made enormous strides in chemistry, but their goal was to turn piss into gold.


Isn't that just scale? Even small LLMs have more parts than any car.

LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.


You can make a bigger ICE engine (like a container ship engine) and still understand how the whole thing works. Maybe there’s more parts moving but it still has the structure of an ICE engine.

With neural networks big or small, we got no clue what’s going on. You can observe the whole system, from the weights and biases, to the activations, gradients, etc and still get nothing.

On the other hand, one of the reasons why economics, psychology and politics are hard is because we can’t open up people’s heads and define and measure what they’re thinking.


One way I've heard it summarized: Computer Science as a field is used to things being like physics or chemistry, but we've suddenly encountered something that behaves more like biology.

[flagged]


"God" as a concept in unproven to exist, it is also impossible to prove, so for all intents and porpouses it doesn't exist.

Could be, but it does not change the fact that we do not understand them as of now.

We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.

Eh I feel like that mostly just down to; yes transformers are a "next token predictor" but during fine tuning for instruct the attention related wagon slapped on the back is partially hijacked as a bridge from input token->sequences of connections in the weights.

For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.

I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).


It feels less like a word prediction algorithm and more like a world model compression algorithm. Maybe we tried to create one and accidentaly created the other?

Its almost like a Model of Language, but very Large

Why would asking a question about ice cream trigger a consideration about all possible topics? As in, to formulate the answer, the LLM will consider the origin of Elephants even. It won’t be significant, but it will be factored in.

Why? In the spiritual realm, many postulated that even the Elephant you never met is part of your life.

None of this is a coincidence.


The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.

It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

yes, it also feels like we are going to lose our just-in-time global shipments of anything to anywhere any day now. It will soon feel like 1900 in other ways.

Hope we don't get 1914 again, too.

We’ll have to raise our own chickens too…

Agreed. Here are three things that I find surreal about the s1 paper.

(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?


Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.

I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.

Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.

It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.


This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world. I've yet to see really good articulations of what, precisely we should be scared of.

Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.

AI alone is words on a screen.

The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.

Perhaps there will be political instability, but .. we're already there from good old humans.


Depends on the model I suppose. Atm everything is being heavily trained as LLMs without much capability outside of input text->output text aside from non-modelised calls out to the Internet/RAG system etc.

But at some point (still quite far away) I'm sure we'll start training a more general purpose model, or an LLM self-training will break outside of the "you're a language model" bounds and we'll end up with exactly that;

An LLM model in a self-training loop that breaks outside of what we've told it to be (a Language model), becomes a general purpose model and then becomes intelligent enough to do something like put itself out onto the Internet. Obviously we'd catch the feelers that it puts out and realise that this sort of behaviour is starting to happen, but imagine if we didn't? A model that trained itself to be general purpose but act like a constantly executing LLM, uploads itself to Hugging Face, gets run on thousands of clusters by people, because it's "best in class" and yes it's sitting there answering LLM type queries but also in the background is sending out beacons & communicating with itself between those clusters to...idk do something nefarious.


Some of the scariest horror movies are the ones where the monster isn't shown. Often once the monster is shown, it is less terrifying.

In a general sense, uncertainty causes anxiety. Once you know the properties of the monster you are dealing with you can start planning on how to address it.

Some people have blind and ignorant confidence. A feeling they can take on literally anything, no matter how powerful. Sometimes they are right, sometimes they are wrong.

I'm reminded by the scene in No Country For Old Men where the good guy bad-ass meets the antagonist and immediately dies. I have little faith in blind confidence.

edit: I'll also add that human adaptability (which is probably the trait most confidence in humans would rest) has shown itself capable of saving us from many previous civilization changing events. However, this change with AI is happening much, much faster than any before it. So part of the anxiety is whether or not our species reaction time is enough to avoid the cliff we are accelerating towards.


> without grounding in the real world.

> I've yet to see really good articulations of what, precisely we should be scared of. Bedroom superweapons?

Loss of paid employment opportunities and increasing inequality are real world concerns.

UBI isn't coming by itself.


Worst case scenario humans mostly go back to manual labor, which would fix a lot of modern day ailments such as obesity and (some) mental health struggles, with added enormous engineering advancements based on automatic research.

Manual labour jobs are not magically going to appear.

Sure, but those are also real world concerns in the non-AI alternate timeline. As is the unlikelihood of UBI.

Yes, but they are likely dramatically accelerated in the AI timeline.

> This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world.

It has certainly had this impact on my identity; I am unclear how well-grounded I really am*.

> I've yet to see really good articulations of what, precisely we should be scared of.

What would such an articulation look like, given you've not seen it?

> Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them.

Even with current limited systems — which are not purely desk workers, they're already being connected to and controlling robots, even by amateurs — AI lowers the minimum human skill level needed to do those things.

The fear is: how far are we from an AI that doesn't need a human in the loop? Because ChatGPT was almost immediately followed by ChaosGPT, and I have every reason to expect people to continue to make clones of ChaosGPT continuously until one is capable of actually causing harm. (As with 3d-printed guns, high chance the first ones will explode in the face of the user rather than the target).

I hope we're years away, just as self driving cars turned out to be over-promised and under-delivered for the last decade — even without a question of "safety", it's going to be hard to transition the world economy to one where humans need not apply.

> And the problem of "human alignment" is one unsolved since Cain and Abel.

Yes, it is unsolved since time immemorial.

This has required us to not only write laws, but also design our societies and institutions such that humans breaking laws doesn't make everything collapse.

While I dislike the meme "AI == crypto", one overlap is that both have nerds speed-running discovering how legislation works any why it's needed — for crypto, specifically financial legislation after it explodes in their face; for AI, to imbue the machine with a reason to approximate society's moral code, because they see the problem coming.

--

* Dunning Kruger applies; and now I have first-hand experience of what this feels like from the inside, as my self-perception of how competent I am at German has remained constant over 7 years of living in Germany and improving my grasp of the language the entire time.


> The intelligence that will be available to the average technically literate individual will be frightening.

That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.

AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.


> AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

How would that work? If there are no consumers then why even bother producing? If the cost of labor and capital trends towards zero then the natural consequence is incredible deflation. If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Our society isn’t really geared for handling double digit deflation so something does need to change if we really are accelerating exponentially.


> How would that work? If there are no consumers then why even bother producing?

Whim and ego. I think the advanced economy will shift to supporting trillionaires doing things like "DIY home improvement" for themselves. They'll own a bunch of automated resources (power generation, mining, manufacturing, AI engineers), and use it to do whatever they want. Build pyramids on the moon, while the now economically-useless former middle-class laborers shiver in the cold? Sure, why not?


> If there are no consumers then why even bother producing?

> If the producers refuse to lower their prices then they either don’t participate in the market (which also means their production is pointless) or ensure some other way that the consumers can buy their products.

Imagine you're a billionaire with a data centre and golden horde of androids.

You're the consumer, the robots make stuff for you; they don't make stuff for anyone else, just you, in the same way and for the same reason that your power tools and kitchen appliances don't commute to work — you could, if you wanted, lend them to people, just like those other appliances, but you'd have to actually choose to, it wouldn't be a natural consequence of the free market.

Their production is, indeed, pointless. This doesn't help anyone else eat. The moment anyone can afford to move from "have not" to "have", they drop out of the demand market for everyone else's economic output.

I don't know how big the impact of dropping out would be: the right says "trickle down economics" is good and this would be the exact opposite of that; while the left criticism's of trickle-down economics is that in practice the super-rich already have so much stuff that making them richer doesn't enrich anyone else who might service them, so if the right is correct then this is bad but if the left is correct then this makes very little difference.

Unfortunately, "nobody knows" is a great way to get a market panic all by itself.


I agree with you and I am scared. My problem is: if most people can't work, who is going to pay for the product/services created with IA?

I get a lot of "IA will allow us to create SaaS in a weekend" and "IA will take engineers jobs", which I think they both may be true. But a lot of SaaS surive because engineers pay for them -- if engineer don't exist anymore, a lot of SaaS won't either. If you eat your potential customers, creating quick SaaS doesn't make sense anymore (yeah, there are exceptions, etc., I know).


> My problem is: if most people can't work, who is going to pay for the product/services created with IA?

A lot of those will probably go under, too. I think a lot of people are in for a rude awakening.

The only people our society and economy really values are the elite with ownership and control, and the people who get to eat and have comfort are those who provide things that are directly or indirectly valuable to that elite. AI will enable a game of musical chairs, with economic participants iteratively eliminated as the technology advances, until there are only a few left controlling vast resources and capabilities, to be harnessed for personal whims. The rest of us will be like rats in a city, scraping by on the margins, unwanted, out of sight, subsisting on scraps, perhaps subject to "pest control" regimes.


This is the same conclusion I can't help but reach. I would love nothing more but to be convinced that (there is a chance that) that is not going to happen.

> The only people our society and economy really values are the elite with ownership and control

This isn’t true. The biggest companies are all rich because they cater to the massive US middle class. That’s where the big money is at.


> This isn’t true. The biggest companies are all rich because they cater to the massive US middle class..

It is true, but I can see why you'd be confused. Let me ask you this: if members of the "the massive US middle class" can be replaced with automation, are those companies going 1) to keep paying those workers to support the middle-class demand which made them rich, or are they going to 2) fire them so more money can be shoveled up to the shareholders?

The answer is obviously #2, which has been proven time and again (e.g. how we came to have "the Rust Belt").

> That’s where the big money is at

Now, but not necessarily in the future. I think AI (if it doesn't hit a wall) will change that, maybe not instantaneously, but over time.


It’s true, but I can see why you’d be confused. You conflated what the economy rewards (which is what caters to the large middle class pool of money) with what individual companies try to optimize for (eliminating labor costs).

So you end up with a huge starved mob trying to come all over your mansions and islands. I somehow think Musk totally capable of nuking those mobs, or unleash the (future) AI dogs over them, because the mob cannot produce anymore (because of AI) and cannot pay anymore (because no jobs because of AI). So the mob will be totally worthless to this style of "capitalism". Really why would they bother with UBI when they can let the mob just die out?

> Really why would they bother with UBI when they can let the mob just die out?

Personally, I think UBI is a ploy to keep the "huge starved mob[s]" pacified during the transition, when they still have enough power to act, before the tech oligarchs fully cement their control.

Once the common people are powerless to protect themselves and their interests, then they'll be left to die out.


Those people will simply be surplus to requirements. They'll be left alone as long as they don't get in the way of the ruling class, and disposed of if they do. As usual in history.

That's a fallacy. You can't have an advanced economy with most people sitting on the side. Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

> Money needs to keep flowing. If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

Plenty of profit was made off feudalism, and technofeudalism has all the tools of modern technology at its disposal. If things go in that direction, they will have an unlimited supply of serfs desperate for whatever human work/livelihood is left.


Unlimited supply yes, but highly limited usage for them. So even if a few will work for free, the rest will be starving, and angry.

> You can't have an advanced economy with most people sitting on the side.

If AI lives up to the hype, that will become possible.

> If all that remains of the economy consists of a few datacenters talking to each other, how can the ruling class profit off that?

I don't think it would be that. There'd also be power generation, manufacturing, mining, and construction, etc.; but all extremely automated. If you get to truly extreme levels of wealth concentration, things would shift out of our capitalist market system model, and concepts like "profit" would become anachronisms.

It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.


> It actually might kinda look like the "economy" of Starcraft: you gather resources, decide what to build with them, and order it all around according to your whim. There will be a handful of guys playing, and everyone else will be a NPC.

I guess if the “players” are sociopathic enough they might decide to just wipe out the NPCs. The possibility of someone like Putin or Musk becoming the sole member of the post-singularity humanity does make me pause.


That's assuming people like Altman can keep artificial superintelligence under human control. It very well may escape control and humanity would be disempowered forever. Or worse, wiped out.

That is assuming the accelerating AI stays under human control.

We're racing up a hill at an ever-increasing speed, and we don't know what's on the other side. Maybe 80% chance that it's either nothing or "simply" a technological revolution.


Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:

  [[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.
> reasoning models which do AI research

In the introduction to my research project on Accelerationism [0], I write:

  Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]
At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.

[0] https://retrochronic.com/

[1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.


Hope we get the Nick Land the younger, and not Nick Land the elder, set of outcomes. Somewhere, sometime, along the way, it seems like everything from CCRU and Duginism leapt out of the page into the real. Maybe it's just the beginning of the Baudrilliardian millennium.

Nice. Though I couldn't understand those "opening lines" until I read in your Introduction:

> For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping

Fibonacci, amongst many others, played a critical role that highly accelerative technology.


In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.


I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

The above 'hack' is one of many realizations of the above differences.


Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.

Yes, that's explicitly mentioned in the blog post:

>In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".


Yes, that's one of the tricks.

I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

Semantically, wait is a bit of a stop-and-breathe point.

Consider the text:

I think I'll go swimming today. Wait, ___

what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:

Wait, I need to help my dad.


Yes, R1 seems to mostly use it like that. It's either to signal a problem with its previous reasoning, or if it's thought of a better approach. In coding it's often something like "this API won't work here" or "there's a simpler way to do this".

I guess it goes to show how important reiteration is for general logic problems. And tbf when finding a solution to something myself I'll consider each part, and/or consider parts in relation to each other and/or consider all parts in relation to each other (on a higher level) before coming to a final solution.

It's weird because I feel like we should've known that from work in general logic/problem solving studies, surely?


I bet a token like "sht!", "f*" or "damn!" would have the same or even stronger effect but the LLM creators would not like to have the users read them

It's literally in the article, they measured it and wait was the best token

Maybe, but it doesn't just use it to signify that it's made a mistake. It also uses it in a positive way, such as it's had a lightbulb moment. Of course some people use expletives in the same way, but that would be less common than for mistakes.

I think you're onto something, however, as the training is done through on text and not actual thoughts, it may take some experimentation to find these stronger words.

> a branch of computer science

It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.


Almost seems more like computer engineering. Is it really that different than signal/image processing?

I suspect CS departments don’t want to concede because they are now in the limelight…


I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?


There are more than 10 different ways that I know for sure will improve LLMs just like `wait`. It is part if the CoT. I assume most researchers know this. CoT in old as 2019

Chain of thought (CoT)?

Mind elaborating ?

I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.


I think NVIDIAs future is pretty bright.

We're getting to the run-your-capable-LLM on-prem or at-home territory.

Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.

https://www.nvidia.com/en-us/project-digits/


Except I can run R1 1.5b on a GPU-less and NPU-less Intel NUC from four-five years ago using half its cores and the reply speed is…functional.

As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.

I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.


R1 1.5b won’t do what most people want at all.

No, it won't. But that's not the point I was making

Yeah but what was R1 trained with? 50k GPUs as far as I've heard as well as distillation from OpenAI's models (basically leaning on their GPUs/GPU time).

Besides the fact that consumers will still always want GPUs for gaming, rendering, science compute etc.

No, I don't have any Nvidia stocks.


I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?

its fascinating how certain political movements avoid that Wait moment...

I'm strictly speaking never going to think of model distillation as "stealing." It goes against the spirit of scientific research, and besides every tech company has lost my permission to define what I think of as theft forever

At most it would be illicit copying.

Though it's poetic justice that OpenAI is complaining about someone else playing fast and loose with copyright rules.


The First Amendment is not just about free speech, but also the right to read, the only question is if AI has that right.

Does my software have the right to read the contents of a DVD and sell my own MP4 of it then no. If a streamer plays a YouTube video on there channel is the content original then yes. When gpt3 was training people saw it as a positive. When people started asking chatgpt more things than searching sites it became a negative.

If AI was just reading, there would be much less controversy. It would also be pretty useless. The issue is that AI is creating its own derivative content based on the content it ingests.

Isn't any answer to a question which hasn't been previously answered a derivative work? Or when a human write a parody of a song, or when a new type of music is influenced by something which came before.

This argument is so bizarre to me. Humans create new, spontaneous thoughts. AI doesn’t have that. Even if someone’s comment is influenced by all the data they have ingested over their lives, their style is distinct and deliberate, to the point where people have been doxxed before/anonymous accounts have been uncovered because someone recognized the writing style. There’s no deliberation behind AI, just statistical probabilities. There’s no new or spontaneous thoughts, at most pseudorandomness introduced by the author of the model interface.

Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset


This argument is so bizarre to me.

There is no evidence whatsoever to support that humans create "new, spontaneous thoughts" in any materially, qualitatively different way than an AI. In other words: As a Turing-computable function over the current state. It may be that current AI's can't, but the notion that there is some fundamental barrier is a hypothesis with no evidence to support it.

> Even if you give GenAI unlimited time, it will not develop its own writing/drawing/painting style or come up with a novel idea, because strictly by how it works it can only create „new” work by interpolating its dataset

If you know of any mechanism whereby humans can do anything qualitatively different, then you'd have the basis for a Nobel Prize-winning discovery. We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.

We don't even know how to formalize what it would mean to "come up with a novel idea" in the sense you appear to mean, as presumably, something purely random would not satisfy you, yet something purely Turing computable would also not do, but we don't know of any computable functions that are not Turing computable.


This argument, by now a common refrain from defenders of companies like OpenAI, misses the entire putative point of intellectual property, and the point of law in general. It is a distraction of a common sort - an attempt to reframe a moral and legal question into an abstract ontological one

The question of whether the mechanism of learning in a human brain and that in an artificial neural network is similar is a philosophical and perhaps technical one that is interesting, but not relevant to why intellectual property law was conceived: To economically incentivize human citizens to spend their time producing creative works. I don't actually think property law is a good way to do this. Nonetheless the question when massive capital investments are used to scrape artists' work in order to undercut their ability to make a living from that work for the benefit of private corporations that do not have their consent to do this is whether this should violate this artificial notion of intellectual property that we have constructed for this purpose, and in that sense, it's fairly obvious that the answer is yes


I wasn't responding to a moral and legal question. I was responding to a comment arguing that humans are some magical special case in nature.

If you want to argue it's a distraction, argue that with the person I replied to, who was the person who changed the focus.


Yea I'll give you that. But many people seem to have the argument you've made - which is dubious on its own terms, by the way, as we don't really have a complete picture of human learning and the assumption that it simply follows the mechanisms we understand from machine learning is not a null hypothesis that doesn't demand justification - loaded up for these conversations, and it needs to be addressed wherever possible that the ontological question is not what matters here

> which is dubious on its own terms, by the way, as we don't really have a complete picture of human learning and the assumption that it simply follows the mechanisms we understand from machine learning is not a null hypothesis that doesn't demand justification

The argument I made in no way rests on a "complete picture of human learning". The only thing they rest on is lack of evidence of computation exceeding the Turing computable set. Finding evidence of such computation would upend physics, symbolic logic, maths. It'd be a finding that'd guarantee a Nobel Prize.

I gave the justification. It's a simple one, and it stands on its own. There is no known computable function that exceeds the Turing computable, and all Turing computable functions can be computed on any Turing complete system. Per the extended Church Turing thesis this includes any natural system given the limitations of known physics. In other words: Unless you can show knew, unknown physics, human brains are computers with the same limitations as any electronic computer, and the notion of "something new" arising from humans, other than as a computation over pre-existing state, in a way an electronic computer can't also do, is an entirely unsupportable hypothesis.

> and it needs to be addressed wherever possible that the ontological question is not what matters here

It may not be what matters to you, but to me the question you clearly would prefer to discuss is largely uninteresting.


> In other words: As a Turing-computable function over the current state.

You need to be a bit more expansive. Turing-computable functions need to halt and return eventually. (And they need to be proven to halt.)

> We know of no mechanism that could allow humans to exceed the Turing computability that AI models are limited to.

Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)


> You need to be a bit more expansive. Turing-computable functions need to halt and return eventually. (And they need to be proven to halt.)

This is pedantry. Any non-halting function can be decomposed into a step function and a loop. What matters is that step function. But ignoring that, human existence halts, and so human thought processes can be treated as a singular function that halts.

> Depends on which AI models you are talking about? When generating content, humans have access to vastly more computational resources than current AI models. To give a really silly example: as a human I can swirl some water around in a bucket and be inspired by the sight. A current AI model does not have the computational resources to simulate the bucket of water (nor does it have a robotic arm and a camera to interact with the real thing instead.)

An AI model does not have computational resources. It's a bunch of numbers. The point is not the actual execution but theoretical computational power if unconstrained by execution environment.

The Church-Turing thesis also presupposes an unlimited amount of time and storage.


Yes, that's why we need something stronger than the Church-Turing thesis.

See https://scottaaronson.blog/?p=735 'Why Philosophers should care about Computational Complexity'

Basically, what the brain can do in reasonable amounts of time (eg polynomial time), computers can also do in polynomial time. To make it a thesis something like this might work: "no physically realisable computing machine (including the brain) can do more in polynomial time than BQP already allows" https://en.wikipedia.org/wiki/BQP


If people were claiming that a computer might be able to, but will be to slow, that might be an angle to take, but to date, in these discussions, none of the people arguing that brains can do more have argued that they're just more efficient, but that they inherently have more capabilities, so it's an unnecessarily convoluted argument.

>Humans create new, spontaneous thoughts I don't believe we do; just look to media, very few plot-lines in Movies/TV are little more than "boy meets girl Pocahontas".

And if you say that a model could not create anything new because of it's static data set but humans could...I disagree with that because us humans are working with a data set that we add to some days, but if we use the example of writing a TV script, the writer draw from their knowledge (gained thru life experience) that is as finite as a model's training set is.

I've made this sort of comment before. Even look to high fantasy; what are elves but humans with different ears? Goblins are just little humans with green skin. Dragons are just big lizards. Minotaurs are just humans but mixed with a bull. We basically create no new ideas - 99% of human "creativity" is just us riffing on things we know of that already exist.

I'd say the incidences of humans having a brand new thought or experience not rooted in something that already exists is very, very low.

Even just asking free chat gpt to make me a fantasy species with some culture and some images of the various things it described does pretty well; https://imgchest.com/p/lqyeapqkk7d. But it's all rooted in existing concepts, same as anything most humans would produce.


> Humans create new, spontaneous thoughts.

The compatibility of determinism and freedom of will is still controversially debated. There is a good chance that Humans don’t „create“.

> There’s no deliberation behind AI, just statistical probabilities. There’s no new or spontaneous thoughts, at most pseudorandomness introduced by the author of the model interface.

You can say exactly the same about deterministic humans since it is often argued that the randomness of thermodynamic or quantum mechanical processes is irrelevant to the question of whether free will is possible. This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.


> The compatibility of determinism and freedom of will is still controversially debated. There is a good chance that Humans don’t „create“.

Determinism and free will are pretty irrelevant here.

Unless P=NP, there's no way for us to distinguish in general between eg pseudo random systems and truly random systems from the outside.

Btw, I don't think determinism in humans/AI has anything to do with deliberation.

The newest AI models are allowed to deliberate. At least by some meanings of the word.

> This is justified by the fact that our concept of freedom means a decision that is self-determined by reasons and not a sequence of events determined by chance.

Well, different people have different definitions here. None of them very satisfying.


> Determinism and free will are pretty irrelevant here.

No. It’s the other way around. Free will is the basic for „creating something new“.

> Btw, I don't think determinism in humans/AI has anything to do with deliberation.

With determinism there is no deliberation.


> With determinism there is no deliberation.

As far as we can tell, all the laws of the universe are completely deterministic. (And that includes quantum mechanics.) As far as we can tell, human beings obey the laws of physics.

(To explain: quantum mechanics as a theory is completely deterministic and even linear. Some outdated interpretations of quantum mechanics, like Copenhagen, use randomisation. But interpretations don't make a difference to what the underlying theory actually is. And more widely accepted interpretations like 'Many Worlds' preserve the determinism of the underlying theory.)

Btw, neural nets are typically sampled from, and you can use as good a random number generator (even a physical random number generator) as there is, if you want to. I don't think it'll change what we think neural nets are capable of.


That's exactly their point (and mine), with respect to the person above arguing humans unlike AI can create "new things". For that distinction to make sense "new things" must be interpreted as "something that can't be deterministically derived from the current world state", as they're trying to create a distinction between a purely deterministic algorithm and human consciousness.

kind of. the constitution as a whole, and the amendments, don't give you the right to do anything. you have the right to do whatever you want whenever you want. the constitution tells the government what it can and can not stop you from doing.

I'm not sure the US 'First Amendment' is relevant here? DeepSeek is in China.

It's hardly even illicit- at least in the United States, the output of an AI isn't copyrightable.

Was that decided in courts, yet?

In any case, copyright ain't the only thing that prevents copying.


Stochastic decompression. Dass-it.

I think it's less about that and more whether or not they used the free or paid API.

I think if OpenAI (or any other company) are paid for their compute time/access as anybody would, then using content generated by other models is fair game. Because it's an active/ongoing cost and not a passive one.

Whereas if someone trained on my dumb Tweets or HN posts then so be it; it's a passive cost for me - I paid my time to say x thing for my own benefits (tribal monk-e social interaction) therefore I have already gotten the value out of it.


Maybe but something has gotta pay the bills to justify the cutting edge. I guess it's a similar problem to researching medicine.

Well the artists and writers also want to pay their bills. We threw them under the bus, might as well throw openAI too and get an actual open AI that we can use

The investment thrown at OpenAI seems deeply inflated for how much meaningful progress they're able to make with it

I think it's clear that innovative breakthroughs in bleeding-edge research are not just a matter of blindly hurling more money at a company to build unprecedentedly expensive datacenters

But also, even if that was a way to do it, I don't think we should be wielding the law to enable privately-held companies to be at the forefront of research, especially in such a grossly inconsistent manner


If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

I've had an idea since I was a kid which I can share. I was contemplating AI and consciousness generally, probably around the time I read "The Minds I".

I reflected on the pop-psychology idea of consciousness and subconsciousness. I thought of each as an independent stream of tokens, like stream of consciousness poetry. But along the stream there were joining points between these two streams, points where the conscious stream was edited by the subconscious stream. You could think of the subconscious stream as performing CRUD like operations on the conscious stream. The conscious stream would act like a buffer of short-term memory while the subconscious stream would act like a buffer of long-term memory. Like, the subconscious has instructions related to long-term goals and the conscious stream has instructions related to short-term goals.

You can imagine perception as input being fed into the conscious stream and then edited by the subconscious stream before execution.

It seems entirely possible to actually implement this idea in this current day and age. I mean, it was a fever dream as a kid, but now it could be an experiment!


Conscious as subconscious pretending not to be sunconscious, something like that, a thin wrapper. Crud makes sense.

Gels closely to buddhism, hell, all religions.


Have you read Jaynes' "The Origin of Consciousness in the Breakdown of the Bicameral Mind"?

I haven't read the original but I am familiar with the broad stroke view. There are similarities (perhaps vague) in the more recent work of someone like McGilchrist and his The Master and His Emissary (another book which I only have a broad stroke view of).

At the time I had this idea I did not know of either of these. I think I was drawing explicitly on the conscious / subconscious vocabulary.


Here's a paper your idea reminds me of. https://arxiv.org/abs/2501.19201

It's also so not far from Meta's large concept model idea.


Previous discussion:

[41 comments, 166 points] https://news.ycombinator.com/item?id=42919597


I had this exact same thought yesterday.

I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

No data to back my hypothesis up. So take it for what it’s worth.


This is where I was headed but I think you said it better. Some kind of executive process monitoring the situation, the random stream of consciousness and the actual output. Looping back around to outdated psychology you have the ego which is the output (speech), the super ego is the executive process and the id is the <think>internal monologue</think>. This isn't the standard definition of those three but close enough.

My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?

> this incomprehensible stream of embedding vectors as natural language explanation, in a way returning to encoder/decoder architecture

this is just standard decoding, the stream of vectors is called the k/v cache


The problem is that RL is extremely inefficient. It's one thing to use it for fine tuning an LLM to do the chain of thought trick and quite another to do thinking entirely from scratch. The pretrained LLM does a lot of heavy lifting there.

And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.


> The problem is that RL is extremely inefficient.

Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.

RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.

[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...


RL provides very poor training signal for deep learning, an order of magnitude or more worse than supervised learning. Better than nothing of course.

What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.


In the case of supervised problem domains, you implicitly make a decision about what is signal, and what is noise, and sure, in that closed setting, supervised learning is much more sample efficient. But I think what we're learning now is that with strong enough base models, 'aha' moments in RL training show that it might be possible to essentially 'squeeze out signal from language itself', giving you far greater breadth of latent knowledge than supervised examples, and letting you train to generalize to far greater horizons than a fixed dataset might allow. In a fascinating way it is rather reminiscent of, well, abiogenesis. This might sound like speculative claptrap if you look at the things the current generation of models are still weak at, but... there's a real chance that there is a very heavy tail to the set of outcomes in the limit.

With a pretrained LLM most of the work is done. RL just steers the model into a 'thinking' mode. There is enough signal for that to work and for the inefficiency to not matter.

The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.


I strongly agree; in fact I think what best matches the thought process is something like the multiset tree/forest workspace approach as suggested by Marcolli, Chomsky, and Berwick - a Hopf algebra that can be externalized into (non-planar) embeddings of linearized strings, or alternately into semantic manifolds.

Once we train models on the chain of thought outputs, next token prediction can solve the halting problem for us (eg, this chain of thinking matches this other chain of thinking).

I think that is how human brains work. When we practice, at first we have to be deliberate (thinking slow). Then we “learn” from our own experience and it becomes muscle memory (thinking fast). Of course, it increases the odds we are wrong.

Or worse, we incorrectly overweight the wrong chain of thinking to an irrelevant output (but pragmatically useful output), at scale.

For example, xenophobia as a response to economic hardship is the wrong chain of thinking embedded in the larger zeitgeist.


Comments on a google doc? Nesting in social media comments?

Seems like similar concepts. I think there is some potential to improving how LLMs improve and further their own reasoning lines, but I'm no AI mage.


Off topic, but I just bookmarked Tim’s blog, great stuff.

I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.

Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.

I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.


If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.

You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.

Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.


I can't believe this hasn't been done yet, perhaps it is a cost issue.

My literal first thought about AI was wondering why we couldn't just put it in a loop. Heck, one update per day, or one update per hour would even be a start. You have a running "context", the output is the next context (or a set of transformations on a context that is a bit larger than the output window). Then ramp that up ... one loop per minute, one per second, millisecond, microsecond.


The hard part is coming up with a good way to grade results. Which you need to update the weights based on the outcome, otherwise the model will not actually learn anything.

For the "looping" I'm talking about you don't need to update the weights. It is simply, old context in, new context out, new context in, new-new context out, etc.

Of course, keeping that coherent over numerous loops isn't going to be easy. No doubt there is a chance it goes off the rails. So you might have a section of context that is constantly stable, a section of context that updates each loop, etc.

In the other response to my comment someone mentioned eventually updating the weights (e.g. daily) and you would in that case have to have some kind of loss function.


Then I'm not quite sure what benefit you expect to derive from it? Making e.g. QwQ-32 loop isn't hard - it often does it all by itself, even. But it doesn't translate to improvements on every iteration; it just goes in circles.

When you look out your eyes, that rectangular viewport is all your context. For example, we cannot fit Jupiter into your viewport.

So, if we can never fit large concepts like Jupiter into your viewport (your context window), does it follow that you will never understand Jupiter? Or is there no way to take a picture of Jupiter and make it smaller so it fits into your viewport?

See, coherence is just resizing or reimagining things so it fits into context. Context can never hold it all because we have the capacity to always imagine infinitely large things.

So I don’t fully know if it’s old context out, new context in, but could be just the right context, just enough of it, and just the right looking context so we can assemble the right worldview. This process would have to go on infinitely, and that is the world as you see and understand it.


Same. And the next step is that it must feed back into training, to form long-term memory and to continually learn.

I analogize this with sleep. Perhaps that is what is needed, 6 hours offline per day to LoRa the base model on some accumulated context from the day.

LLMs need to sleep too. Do they dream of electric sheep?

This is a fantastic insight and really has my gears spinning.

We need to cluster the AI's insights on a spatial grid hash, give it a minimap with the ability to zoom in and out, and give it the agency to try and find its way to an answer and build up confidence and tests for that answer.

coarse -> fine, refine, test, loop.

Maybe a parallel model that handles the visualization stuff. I imagine its training would look more like computer vision. Mind palace generation.

If you're stuck or your confidence is low, wander the palace and see what questions bubble up.

Bringing my current context back through the web is how I think deeply about things. The context has the authority to reorder the web if it's "epiphany grade".

I wonder if the final epiphany at the end of what we're creating is closer to "compassion for self and others" or "eat everything."


> If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

That’s pottery, not sculpture. Traditionally in sculpture you start from a block of marble or wood, but you can also make sculptures of cast bronze or welded steel (or clay, but you don’t use a spinning plate).


Thank you for the clarification. I wanted to use some kind of visual to show the model in a loop. Otherwise, I’d just have to say explicitly that the sculptor is the one in the loop, as in the person will not stop chiseling. It’s in this infinite chiseling that we get our answers (same thing as finding a limit in calculus as it approaches infinity, we will never get the discrete answer, but we will get infinitely close enough to label a discrete point confidently).

In other words, we fly as close to the sun as possible and get our measurements :)


> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...


This is pure speculation on my part but I think at some point a company's valuation became tied to how big their compute is so everybody jumped on the bandwagon.

I don't think you need to speculate too hard. On CNBC they are not tracking revenue, profits or technical breakthroughs, but how much the big companies are spending (on gpus). That's the metric!

I probably don't have to repeat it, but this is a perfect example of Goodhart's Law: when a metric is used as a target, it loses its effectiveness as a metric.

If you were a reporter who didn't necessarily understand how to value a particular algorithm or training operation, but you wanted a simple number to compare the amount of work OpenAI vs. Google vs Facebook are putting into their models, yeah, it makes sense. How many petaflops their datacenters are churning through in aggregate is probably correlated to the thing you're trying to understand. And it's probably easier to look at their financials and correlate how much they've spent on GPUs to how many petaflops of compute they need.

But when your investors are giving you more money based on how well they perceive you're doing, and their perception is not an oracle but is instead directly based on how much money you're spending... the GPUs don't actually need to do anything other than make number go up.


This feels like one of those stats they show from 1929 and everyone is like “and they didn’t know they were in a bubble?”

> but how much the big companies are spending (on gpus). That's the metric!

Burn rate based valuations!

The 2000's are back in full force!


"But tulip sales keep increasing!"

They absolutely are tracking revenues/profits on CNBC, what are you talking about?

Matt Levine tangentially talked about this during his podcast this past Friday (or was it the one before?). It was a good way to value these companies according to their compute size since those chips are very valuable. At a minimum, the chips are an asset that acts as a collateral.

I hear this a lot, but what the hell. It's still computer chips. They depreciate. Short supply won't last forever. Hell, GPUs burn out. It seems like using ice sculptures as collateral, and then spring comes.

If so wouldn’t it be the first time in history when more processing power is not used?

In my experience CPU/GPU power is used up as much as possible. Increased efficiency just leads to more demand.


I think you're missing the point: H100 isn't going to remain useful for a long time, would you consider Tesla or Pascal graphic cards a collateral? That's what those H100 will look like in just a few years.

Not sure I do tbh.

Any asset depreciates over time. But they usually get replaced.

My 286 was replaced by a faster 386 and that by an even faster 468.

I’m sure you see a naming pattern there.


> Any asset depreciates over time.

That's why "those chips are very valuable" is not necessarily a good way to value companies - and it isn't if they can extract the value from the chips before they become worthless.

> But they usually get replaced.

They usually produce enough income to cover depreciation so you actually have the cash to replace them.


My 1070 was replaced by… nothing, I moved it from a haswell box to an alder lake box.

Given that inference time will soon be extremely valuable with agents and <thinking> models, H100s may yet be worth something in a couple years.


And that's why such assets represents only a marginal part of valuation. (And if you look at accounting, this depreciation is usually done over three years for IT hardware, and as such most of these chips have already lost half of their accounting value in the balance sheet).

> My 286 was replaced by a faster 386 and that by an even faster 468.

How much was your 286 chip worth when you bought your 486?


Yeah, exactly! I've got some 286, 386, and 486 CPUs that I want to claim as collateral!

That is the wrong take. Depreciated and burned out chips are replaced and a total compute value is typically increased over time. Efficiency gains are also calculated and projected over time. Seasons are inevitable and cyclical. Spring might be here but winter is coming.

Year over year gains in computing continue to slow. I think we keep forgetting that when talking about these things as assets. The thing controlling their value is the supply which is tightly controlled like diamonds.

They have a fairly limited lifetime even if progress stands still.

Last I checked AWS 1-year reserve pricing for an 8x H100 box more than pays for the capital cost of the whole box, power, and NVIDIA enterprise license, with thousands left over for profit. On demand pricing is even worse. For cloud providers these things pay for themselves quickly and print cash afterwards. Even the bargain basement $2/GPU/hour pays it off in under two years.

Labor! You need it to turn the bill of sale into a data center and keep it running. The bargain basement would be even cheaper otherwise...

Honestly, I don't fully understand the reason for this shortage.

Isn't it because we insist on only using the latest nodes from a single company for manufacture?

I don't understand why we can't use older process nodes to boost overall GPU making capacity.

Can't we have tiers of GPU availability?

Why is Nvidia not diversifying aggressively to Samsung and Intel no matter the process node.

Can someone explain?

I've heard packaging is also a concern, but can't you get Intel to figure that out with a large enough commitment?


> Isn't it because we insist on only using the latest nodes from a single company for manufacture?

TSMC was way ahead of anyone else introducing 5nm. There's a long lead time porting a chip to a new process from a different manufacturer.

> I don't understand why we can't use older process nodes to boost overall GPU making capacity.

> Can't we have tiers of GPU availability?

NVidia do this. You can get older GPUs, but more performance is better for performance sensitive applications like training or running LLMs.

Higher performance needs better manufacturing processes.


> Year over year gains in computing continue to slow.

This isn't true in the AI chip space (yet). And so much of this isn't just about compute but about the memory.


From a per mm2 performance standpoint things absolutely have slowed considerably. Gains are primarily being eked out via process advantage (which has slowed down) and larger chips (which has an ever-shrinking limit depending on the tech used)

Chiplets have slowed the slowdown in AI, but you can see in the gaming space how much things have slowed to get an idea of what is coming for enterprise.


> It was a good way to value these companies according to their compute size since those chips are very valuable.

Are they actually, though? Presently yes, but are they actually driving ROI? Or just an asset nobody really is meaningfully utilizing, but helps juice the stocks?


I asked this elsewhere, but, I don't fully understand the reason for the critical GPU shortage.

Isn't it because NVIDIA insists on only using the latest nodes from a single company (TSMC) for manufacture?

I don't understand why we can't use older process nodes to boost overall GPU making capacity.

Can't we have tiers of GPU availability some on cutting edge nodes, others built on older Intel and Samsung nodes?

Why is Nvidia not diversifying aggressively to Samsung and Intel no matter the process node.

Can someone explain?

I've heard packaging is also a concern, but can't you get Intel to figure that out with a large enough commitment?

(Also, I know NVIDIA has some capacity on Samsung. But why not go all out, even using Global Foundries?)


That's a great way to value a company that is going bankrupt.

But, I'm not going to value an operating construction company based on how many shovels or excavators they own. I'm going to want to see them putting those assets to productive use.


If you are a cloud provider renting them out

Otherwise you better keep them humming trying to find a business model because they certainly aren't getting any newer as chips


So, "No one was ever fired for ... buying more server infrastructure."

Walmart has massive, idle datacenters full of running machines doing nothing.

Mainly it points to a non-scientific "bigger is better" mentality, and the researchers probably didn't mind playing around with the power because "scale" is "cool".

Remember that the Lisp AI-labs people were working on non-solved problems on absolute potatoes of computers back in the day, we have a semblance of progress solution but so much of it has been brute-force (even if there has been improvements in the field).

The big question is if these insane spendings has pulled the rug on real progress if we head into another AI winter of disillusionment or if there is enough real progress just around the corner to show that there is hope for investors in a post-deepseek valuation hangover.


We are in a phase where costs are really coming down. We had this phase from GPT2 to about GPT4 where the key to building better models was just building bigger models and training them for longer. But since then a lot of work has gone into distillation and other techniques to make smaller models more capable.

If there is another AI winter, it will be more like the dotcom bubble: lots of important work got done in the dotcom bubble, but many of the big tech companies started from the fruits of that labor in the decade after the bubble burst


Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.

What? It definitely is.

Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.

But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.


This claim is mathematically nonsensical. It implies a more-or-less linear relationship, that more is always better. But there's no reason to limit that to H100s. Conventional servers are, if anything, rather more established in their ability to generate value, by which I mean, however much potential AI servers may have to be more important than conventional servers that they may manifest in the future, we know how to use conventional servers to generate value now.

And thus, by this logic, every company in the world should just be buying as many servers as they can get their hands on, because More Servers = More Value.

Obviously, this is not happening. It doesn't take much analysis to start listing the many and manifold reasons why. Many of those reasons will apply to GPUs as well. Just as if everything in AWS got 10x faster, overnight, this would not create a situation where everyone suddenly starts grabbing more servers in AWS. Obviously everyone would start trimming down, even if perhaps in a few years time they'd find some way to use this burst of power such that they can use more later. This can't happen overnight, though. It would take time, and not "weeks" or "months" but "years" at scale.

Incorporating the important variable of time in the analysis, if AIs become literally hundreds of times cheaper to run, today, then it is perfectly logical that the near-term demand for the hardware to run them is also going to go way, way down. However much potential AI may have, it is fairly clear looking out at the AI landscape right now that there isn't really anyone out there unlocking vast amounts of value and sitting there wringing their hands because they just can't get more GPU compute. The GPU rush has been from fear that someone will figure out how to "really" unlock AI and then they'll be stuck without the hardware to compete.

It may be the case that vastly cheaper AI will in fact be part of unlocking that value, and that as the AI industry grows it will grow faster as a result... but that's still going to be on a multi-year time frame, not a tomorrow time frame. And all those GPUs and all those valuations are still broadly based on them being valuable real soon now, not in a few years, and all those GPU purchases are on the assumption they need them now, or on a timeframe where we can't be waiting around, rather than waiting for some rounds of exponential doublings to bring price down. The hardware curve in 5 years may be higher but the curve in the next year would be lower, and by a lot.

And, you know, who's to say we're done? I doubt there's another 100x in there, but is someone going to eke out another 2x improvement? Or a 10x improvement? Making it easier to run lots of experiments makes it much more likely for that to happen. I'm skeptical of another 10x general improvement but 10x improvements for specific, important use cases I can't rule out.

Edit: I should also point out this is an extremely common pattern in technology in general. Often the very hardest part is producing a thing that does a particular task at all. Once we have it in hand, once we can use it and learn how it operates and what its characteristic operating modes are, once we can try modifications to it in the real world and see what happens, optimizing it becomes much easier, sometimes explosively so by comparison. Taking any first iteration of a tech that is practical and then trying to straight-line demand based on it is silly, in all sorts of ways and all directions. The internal combustion engine, for example, has had a myriad of impacts on the world and certainly after various improvements many, many millions if not billions of them have been made... but any company that reacted to the first couple of cars and just went ballistic buying those first-generation internal combustion engines would have lost everything, and rather quickly.


For starters every employee has an H100 under their desk.

The part about taking control of a reasoning model's output length using <think></think> tags is interesting.

> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

I had found a few days ago that this let you 'inject' your own CoT and jailbreak it easier. Maybe these are related?

https://pastebin.com/G8Zzn0Lw

https://news.ycombinator.com/item?id=42891042#42896498


This even points to a reason why OpenAI hides the "thinking" step: it would be too obvious that the context is being manipulated to induce more thinking.

It's weird that you need to do that at all, couldn't you just reject that token and use the next most probable?

In case you’re not sure what S1 is, here is the original paper: https://arxiv.org/html/2501.19393v1

It's linked in the blog post, too. In the first sentence, actually, but for some reason the author never bothered to attach the name to it. As if keeping track of o1, 4o, r1, r2d2, wasn't exhausting enough already.

> for some reason the author never bothered to attach the name to it

Respect for his readers’ intelligence, maybe.


it's also the first link in the article's first sentence

Good call, I must have missed it. I read the whole blog then went searching for what S1 was.

  To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation.
Does this mean that the end-of-thinking delimiter is a single token? Presumably </think> or similar wasn't a single token for the base model. Did they just pick a pair of uncommon single-token symbols to use as delimiters?

EDIT: Never mind, end of thinking is represented with <|im_start|> followed by the word 'answer', so the code dynamically adds/removes <|im_start|> from the list of stop tokens.


I don't know what R1 is either

It’s the DeepSeek reasoning model.

This feels just like telling a constraint satisfaction engine to backtrack and find a more optimal route through the graph. We saw this 25 years ago with engines like PROVERB doing directed backtracking, and with adversarial planning when automating competitive games.

Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?

Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?


What would a superior control approach be? It's not clear to me how to get an LLM to be an LLM if you're not doing stochastic next token prediction. Given that, the model itself is going to know best how to traverse its own concept space. The R1 chain of thought training encourages and develops exactly that capability. Still, you want that chain of thought to terminate and not navel gaze endlessly.

So how to externally prod it to think more when it does terminate? Replacing thought termination with a linguistic signifier of continued reasoning plus novel realization seems like a charmingly simple, principled, and general approach to continue to traverse concept space.


This is the difference between science and engineering. What they have done is engineering. If the result is 90% of the way there with barely any effort, its best to move on to something else that may be low hanging fruit than to spend time chasing that 10%.

Totally agreed this is not a solution we are looking for, in fact this is the only solution we have in our hands right now. It's a good step forward.

S1 has no relationship to R1. It's a marketing campaign for an objectively terrible and unrelated paper.

S1 is fully supervised by distilling Gemini. R1 works by reinforcement learning with a much weaker judge LLM.

They don't follow the same scaling laws. They don't give you the same results. They don't have the same robustness. You can use R1 for your own problems. You can't use S1 unless Gemini works already.

We know that distillation works and is very cheap. This has been true for a decade; there's nothing here.

S1 is a rushed hack job (they didn't even run most of their evaluations with an excuse that the Gemini API is too hard to use!) that probably existed before R1 was released and then pivoted into this mess.


Love the look under the hood! Specially discovering some AI hack I came up with is how the labs are doing things too.

In this case, I was also forcing R1 to continue thinking by replacing </think> with “Okay,” after augmenting reasoning with web search results.

https://x.com/0xmmo/status/1886296693995646989


For all the hype about thinking models, this feels much like compression in terms of information theory instead of a "takeoff" scenario.

There are a finite amount of information stored in any large model, the models are really good at presenting the correct information back, and adding thinking blocks made the models even better at doing that. But there is a cap to that.

Just like how you can compress a file by a lot, there is a theoretical maximum to the amount of compression before it starts becoming lossy. There is also a theoretical maximum of relevant information from a model regardless of how long it is forced to think.


I think an interesting avenue to explore is creating abstractions and analogies. If a model can take a novel situation and create an analogy to one that it is familiar with, it would expand its “reasoning” capabilities beyond its training data.

I think this is probably accurate and what remains to be seen is how "compressible" the larger models are.

The fact that we can compress a GPT-3 sized model into an o1 competitor is only the beginning. Maybe there is even more juice to squeeze there?

But even more, how much performance will we get out of o3 sized models? That is what is exciting since they are already performing near Phd levels on most evals.


my thinking (hope?) is that the reasoning models will be more like how a calculator doesn’t have to “remember” all the possible combinations of addition, multiplication, etc for all the numbers, but can actually compute the results.

As reasoning improves the models could start with a basic set of principles and build from there. Of course for facts grounded in reality RAG would still likely be the best, but maybe with enough “reasoning” a model could simulate an approximation of the universe well enough to get to an answer.


This thing that people are calling “reasoning” is more like rendering to me really, or multi pass rendering. We’re just refining the render, there’s no reasoning involved.

That was succinct and beautifully stated. Thank-you for the "Aha!" moment.

Hah. You should check out my other comment on how I think we’re obviously in a simulation (remember, we just need to see a good enough render).

LLMs are changing how I see reality.


How are you defining "reasoning"?

Because I see these sorts of gnostic assertion about LLMs all the time about how they "definitely aren't doing <thing we normally apply to meat-brains>" by gesturing at the technical things it's doing, with no attempts to actually justify the negative assertion.

It often comes across as privileged reason trying to justify that of course the machine isn't doing some ineffable thing only meat-brains do.


From my other ridiculous comment, as I do entertain simulation theory in my understanding of God:

Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.

The LLM doesn’t know anything. We determine what output is right, even if the LLM swears the output is right. We “reason” about it, I guess? Well in this case the whole “reasoning” process is to simply get an output that looks right, so what is reasoning in our case?

Let me just go one ridiculous level lower. If I measure every frame the Hubble telescope takes, and I measure with a simple ruler the distances between things, frame by frame, I can “reason” out some rules of the universe (planetary orbits). In this “reasoning” process, the very basic question of “well why, and who made this” immediately arises, so reasoning always leads to the fundamental question of God.

So, yeah. We reason to see God, because that’s all we’re seeing, everything else is an illusion. Reasoning is inextricably linked to God, so we have to be very open minded when we ask what is this machine doing.


Honestly, I was going to nitpick, but this definition scratches an itch in my brain so nicely that I'll just complement it as beautiful. "We reason to see God", I love it.

(Also, if I might give a recommendation, you might be the type of person to enjoy Unsong by Scott Alexander https://unsongbook.com/)


Thank you for the suggestion and nice words. Trust me, I have to sit here and laugh at the stuff I write too, because I wasn’t always a believer. So it’s a little bit of a trip for me too, I’m still exploring my own existence.

Yes.

Before LLMs we had N-Gram language models. Many tasks like speech recognition worked as beach search in the graph defined by the ngram language model. You could easily get huge accuracy gains simply by pruning your beam less.

s1 reminds of this. You can always trade off latency for accuracy. Given these LLMs are much more complex than good old N-Grams, we're just discovering how to do this trade.


Let me carry that concept, “learning to do this trade”, it’s a new trade.

I don’t believe computer science has the algorithms to handle this new paradigm. Everything was about sequential deterministic outputs, and clever ways to do it fast. This stuff is useless at the moment. We need new thinkers on how to not think sequentially or how not to think about the universe in such a small way.

Verifying input/output pairs is the old way. We need to understand differently going forward.


We could see it the other way around : what we call "reasoning" may actually be some kind of multipass rendering, whatever it is performed by computers or human brains.

Yes, of course. The implications are awesome.

Which is related to multistage/ hierarchical/coarse-to-fine optimization, which is a pretty good way to find the global optimum in many problem domains.

sshhhh, let the money flow

"...there’s no reasoning involved...wait, could I just be succumbing to my heuristic intuitions of what is (seems to be) true....let's reconsider using System 2 thinking..."

Or there is no objective reality (well there isn’t, check out the study), and reality is just a rendering of the few state variables that keep track of your simple life.

A little context about you:

- person

- has hands, reads HN

These few state variables are enough to generate a believable enough frame in your rendering.

If the rendering doesn’t look believable to you, you modify state variables to make the render more believable, eg:

Context:

- person

- with hands

- incredulous demeanor

- reading HN

Now I can render you more accurately based on your “reasoning”, but truly I never needed all that data to see you.

Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.


Is this a clever rhetorical trick to make it appear that your prior claim was correct?

If not: what am I intended to take away from this? What is its relevance to my comment?


You made a joke about questioning reality, I simply entertained it. You can do whatever you want with it, wasn’t a slight at all.

It may have been in the form of a joke, but I certainly wasn't joking.

I think it is interesting what actions cannot be done by humans.


I wasn’t joking either. Things are just getting started with this AI stuff, and I feel like programmers will experience that “de ja vu” phenomenon that they talk about in the Matrix, that eerie feeling something isn’t right.

Look, why have game developers spent so much time lazy loading parts of the game world? Very rarely do they just load the whole world, even in 2025. See, the worlds get bigger, so even as the tech gets better, we will always lazy load worlds in.

It’s a context issue right? Developers have just recently been given this thing called “context”.

But yeah man, why do we think just because we walked from our house to the supermarket that this reality didn’t lazy load things. That’s how programmers have been doing it all along …

Anyways


A more parsimonious explanation: consciousness is generative, like an LLM. And, according to cultural conditioning, this generated scenario is referred to as reality.

I like this version for at least two reasons:

1. It is 100% compliant with large quantities of scientific findings (psychology and neuroscience), whreas I believe yours has a conservation of mass problem at least

2. Everyone dislikes it at least in certain scenarios (say, when reference is made to it during an object level disagreement)


> "Note that this s1 dataset is distillation. Every example is a thought trace generated by another model, Qwen2.5"

The traces are generated by Gemini Flash Thinking.

8 hours of H100 is probably more like $24 if you want any kind of reliability, rather than $6.


"You can train a SOTA LLM for $0.50" (as long as you're distilling a model that cost $500m into another pretrained model that cost $5m)

The original statement stands, if what you are suggesting in addition to it is true. If the initial one-time investment of $505m is enough to distill new SOTA models for $0.50 a piece, then the average cost for subsequent models will trend toward $0.50.

That's absolutely fantastic, because if you have 1 good idea that's additive to the SOTA, you can test it for a dollar, not millions

I work at a mid-sized research firm, and there’s this one coworker who completely turned her performance around. A complete 180. A few months ago, she was one of the slowest on the team, now she’s always the first to get her work done. I was curious, so I asked her what changed. She just laughed and said she just used an AI tool that she randomly found on YouTube to do 90% of her work.

We’ve been working on a project together, and every morning for the past two months, she’s sent me clean, perfectly organized FED data. I assumed she was just working late to get ahead. Turns out, she automated the whole thing. She even scheduled it to send automatically. Tasks that used to take hours. Gathering 1000s of rows of data, cleaning it, running a regression analysis, time series, hypothesis testing etc… she now completes almost instantly. Everything. Even random things like finding discounts for her Pilates class. She just needs to check and make sure everything is good. She’s not super technical so I was surprised she could do these complicated workflows but the craziest part is that she just prompted the whole thing. She just types something like “compile a list of X, format it into a CSV, and run X analysis” or “go to Y, see what people are saying, give me background of the people saying Z” And it just works. She’s even joking about connecting it to the office printer. I’m genuinely baffled. The barrier to effort is gone.

Now we’ve got a big market report due next week, and she told me she’s planning to use DeepResearch to handle it while she takes the week off. It’s honestly wild. I don’t think most people realize how doomed knowledge work is.


> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

The larger the organisation, the less experiments you can afford to do. Employees are mostly incentivised by getting something done quick enough to not to be fired in this job market. They know that the higher-ups would get them off for temporary gains. Rush this deadline, ship that feature, produce something that looks OK enough.


All you need is attention and waiting. I feel like a zen monk.

S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.

I agree, but LLMs in general have a horrendously bad smell in terms of efficiency. s1 and r1 are just proving it.

The models' latent spaces are insanely large. The vast, vast majority pretty much has to be irrelevant and useless, it's just that the training commandeers random fragments of that space to link up the logic they need and it's really hard to know which of the weights are useless, which are useful but interchangeable with other weights, and which are truly load-bearing. You could probably find out easily by testing the model against every possible thing you ever might want it to do, just as soon as someone gets around to enumerating that non-enumerable collection of tasks.

These bogus <wait> tokens kind of demonstrate that the models are sort of desperate to escape the limitations imposed by the limited processing they're allowed to do -- they'll take advantage of thinking time even when it's provided in the silliest manner possible. It's amazing what you can live if it's all you have!

(Apologies for the extended anthropomorphizing.)


can you please elaborate on the wait tokens? what's that? how do they work? is that also from the R1 paper?

The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.

? theyre not special tokens really

i'm not actually sure whether they're special tokens in the sense of being in the vocabulary

<think> might be i think "wait" is tokenized like any other in the pretraining

There's a decent explanation in the article, just FYI.



When you're only used to ollama, how do I go about using this model?

I think we need to wait for someone to convert it into a GGUF file format.

However, once that happens, you can run it (and any GGUF model) from Hugging Face![0]

[0] https://huggingface.co/docs/hub/en/ollama



I ran it, so far it seems like a pretty good model, especially locally.

oh god, this is terrible!

I just said "Hello!" and it went off the rails.


why how what? can you add a sample prompt with output ?

In my original prompt, it gave me back multiple greetings in various languages. In this example, it gave me back just multiple greetings:

>>> hello!

Hello there! I'm here to help with any questions you might have. How can I assist you today? If you're asking a question, feel free to provide some details about what you need help with. I'll do my best to answer your question or guide you through the process. Just let me know!

<|im_start|>think

The user has asked "hello! " and provided no further information. Answer here is a direct response to the user's request for information, which was essentially just a greeting. The appropriate response is to greet back and offer assistance.

<|im_start|>answer

Answer:

Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!

<|im_start|>think

<|im_start|>answer

The user's message was "hello!" and there was no other information provided. The appropriate response is a greeting back and an offer of assistance.

<|im_start|>answer

Answer:

Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!

<|im_start|>think

The user has greeted with "hello!" An appropriate response involves acknowledging the greeting and offering assistance. This could involve asking how we can help them or what they'd like guidance on.

<|im_start|>answer

Answer:

Hello! How can I assist you today? If you have a question or need help with something specific, feel free to ask. I'm here to support you!


you can load the safetensors with ollama, you just have to provide a modelfile. or wait for someone to do it. It will in theory also quantize it for you, as I guess most ppl cannot load a 129 GB model...

> Why did it cost only $6? Because they used a small model and hardly any data.

> After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.

> 32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.


I have a bunch of questions, would love for anyone to explain these basics:

* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)

* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?

* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?

* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?


If what you say is true, and distilling LLMs is easy and cheap, and pushing the SOTA without a better model to rely on is dang hard and expensive, then that means the economics of LLM development might not be attractive to investors - spending billions to have your competitors come out with products that are 99% as good, and cost them pennies to train, does not sound like a good business strategy.

What I still don’t understand is how one slurps out an entire model (closed source) though.

Does the deepseek paper actually say what model it’s trained off of, or do they claim the entire thing is from scratch?


AFAIK DeepSeek have not publicly acknowledged training their model on OpenAI output - the OpenAI people have alleged that they did.

At any rate, I don't think distillation involves 'slurping out' the whole model, as I understand it, it means providing the other model's output as training data input to create your new model. Maybe analogous to an expert teaching a novice how to do something by providing carefully selected examples, without having to expose the novice to all the blind alleys the expert went down to achieve mastery.


> If these are improvements on existing models, why is this being reported as decimating training costs?

Because that's what gets the clicks...

Saying they spent a boatload of money on the initial training + iteration + final fine-tuning isn't as headline grabbing as "$5 million trained AI beats the pants off the 'mericans".


It appears that someone has implemented a similar approach for DeepSeek-R1-Distill-Qwen-1.5B: https://reddit.com/r/LocalLLaMA/comments/1id2gox/improving_d...

I hope it gets tested further.


It just occurred to me that if you squint a little (just a little!) the S1 paper just provided the scientific explanation for why Twitter's short tweets mess you up and books are good for you.

Kidding, but not really. It's fascinating how we seem to be seeing a gradual convergence of machine learning and psychology.


> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check its answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>")

I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution

It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution

At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it


What’s missing in that analogy is that humans tend to have a good hunch about when they have to think more and when they are “done”. LLMs seem to be missing a mechanism for that kind of awareness.

LLMs actually do have such hunch, they just don't utilize it. You can literally ask them "Would you do better if you started over?" and start over if answer is yes. This works.

https://arxiv.org/abs/2410.02725


Great observation. Maybe an additional “routing model” could be trained to predict when it’s better to think more vs just using the current result

From the S1 paper:

> Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end

I'm feeling proud of myself that I had the crux of the same idea almost 6 months ago before reasoning models came out (and a bit disappointed that I didn't take this idea further!). Basically during inference time, you have to choose the next token to sample. Usually people just try to sample the distribution using the same sampling rules at each step.... but you don't have to! you can selectively insert words into the the LLM's mouth based on what it said previously or what it wants to say, and decide "nah, say this instead". I wrote a library so that you could sample an LLM using llama.cpp in swift and you could write rules to sample tokens and force tokens into the sequence depending on what was sampled. https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...

Here, I wrote a test that asks Phi-3 instruct "how are you" and it if it tried to say "as an AI I don't have feelings" or "I'm doing " I forced it to say "I'm doing poorly" and refuse to help since it was always so dang positive. It sorta worked, though the instruction tuned models REALLY want to help. But at the time I just didn't have a great use case for it - I had thought about a more conditional extension to llama.cpp's grammar sampling (you could imagine changing the grammar based on previously sampled text), or even just making it go down certain paths, but I just lost steam because I couldn't describe a killer use case for it.

This is that killer use case! forcing it to think more is such a great usecase for inserting ideas into the LLM's mouth, and I feel like there must be more to this idea to explore.


So what you mean is that if the current train of thought is going in a direction we find to be not optimal, we could just interrupt it and hint it into the right direction?

That sounds very useful, albeit a bit different than how current "chat" implementations would work, as in you could control both ways of the conversation.


> and a bit disappointed that I didn't take this idea further!

Don’t be, that’s pretty common.

https://en.wikipedia.org/wiki/Multiple_discovery


CoT is widely known technique - what became fully novel was the level of training embedding CoT via RL with optimal reward trajectory. DeepSeek took it further due to their compute restriction to find memory, bandwidth, parallelism optimizations in every part (GRPO - reducing memory copies, DualPipe for data batch parallelism between memory & compute, kernel bypasses (PTX level optimization), etc.) - then even using MoE due to sparse activation and further distillation. They operated on the power scaling laws of parameters & tokens but high quality data circumvents this. I’m not surprised they utilized synthetic generation from OpenAI or copied the premise of CoT, but where they should get the most credit is their infra level & software level optimizations.

With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead


The point about agents to conceal access to the model is a good one.

Hopefully we won’t lose all access to models in future


Well dang, I am great at tinkering like this because I can’t remember things half the time. I wonder if the ADHD QA guy solved this for the devs?

>it can run on my laptop

Has anyone run it on a laptop (unquantized)? Disk size of the 32B model appears to be 80GB. Update: I'm using a 40GB A100 GPU. Loading the model took 30GB vRAM. I asked a simple question "How many r in raspberry". After 5 minutes nothing got generated beyond the prompt. I'm not sure how the author ran this on a laptop.


32B models are easy to run on 24GB of RAM at a 4-bit quant.

It sounds like you need to play with some of the existing 32B models with better documentation on how to run them if you're having trouble, but it is entirely plausible to run this on a laptop.

I can run Qwen2.5-Instruct-32B-q4_K_M at 22 tokens per second on just an RTX 3090.


My question was about running it unquantized. The author of the article didn't say how he ran it. If he quantized it then saying he ran it on a laptop is not a news.

I can't imagine why anyone would run it unquantized, but there are some laptops with the more than 70GB of RAM that would be required. It's not that it can't be done... it's just that quantizing to at least 8-bit seems to be standard practice these days, and DeepSeek has shown that it's even worth training at 8-bit resolution.

Maybe he has a 64GB laptop. Also he said he can run it, not that he actually tried it.

At 6 dollars per run, I'm tempted to try to figure out how to replicate this. I'd like to try some alternatives to "wait" - e.g. "double checking..." Or write my own chains of thought.

Like the ones they tested?

Yes, that is what "replicate" with my own ideas means.

I think a lot of people in the ML community were excited for Noam Brown to lead the O series at OpenAI because intuitively, a lot of reasoning problems are highly nonlinear i.e. they have a tree-like structure. So some kind of MCTS would work well. O1/O3 don’t seem to use this, and DeepSeek explicitly mentioned difficulties training such a model.

However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.


Do you have a reference for us to check? - "DeepSeek explicitly mentioned difficulties training such a model."

Section 4.2: Unsuccessful attempts

https://arxiv.org/pdf/2501.12948



> I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

Couldn't they just start hiding the thinking portion?

It would be easy for them to do this. Currently, they already provide one sentence summaries for each step of the thinking I think users would be fine or at least stay if it were changed to provide only that.


They hid it and deepseek came up with R1 anyway, with RL on only results and not even needing any of the thinking tokens that OpenAI hid.

Which is still the funniest and most interesting result in AI so far IMO. Fascinating, but sort of makes intuitive sense too!

I found it interesting but the "Wait" vs. "Hmm" bit just made me think we don't really understand our own models here. I mean, sure, it's great that they measured and found something better, but it's kind of disturbing that you have to guess.

This argument that the data centers and all the GPUs will be useful even in the context of Deepseek doesn't add up... basically they showed that it's diminishing returns after a certain amount. And so far it didn't make OpenAI or Anthropic go faster, did it?

What is the source for the diminishing returns? I would like to read about it as I have only seen papers referring to the scaling law still applying.

Deepseek R1 uses <think/> and wait and you can see it in the thinking tokens second guessing itself. How does the model know when to wait?

These reasoning models are feeding more to OP's last point about NVidia and OpenAI data centers not being wasted since reason models require more tokens and faster tps.


From playing around they seem to 'wait' when there's a contradiction in their logic.

And I think the second point is due to The Market thinking there is no need to spend ever increasing amounts of compute to get to the next level of AI overlordship.

Of course Jevon's paradox is also all in the news these days..


Probably when it would expect a human to second guess himself, as shown in literature and maybe other sources.

Cool trick. But is this better than reinforcement learning, where the LLM decides for itself the optimal thinking time for each prompt?

Maybe this is why OpenAI hides o1/o3 reasoning tokens - constraining output at inference time seems to be easy to implement for other models and others would immediately start their photocopiers.

It also gave them a few months to recoup costs!


LLMs still feel so magical. It’s like quantum physics. “I get it” but I don’t. Not really. I don’t think I ever will. Perhaps a human mind can only comprehend so much.

Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.

Wait, actually 1 + 1 equals 1.


As one with teaching experience, the idea of asking a student "are you sure about that?" is to get them to think more deeply rather than just blurting a response. It doesn't always work, but it generally does.

It works because the question itself is a hint born of knowledge. “Are you sure about that” is a polite way to say “that answer is wrong, try again”. Students know that, so instead of doubling down will redo their work with the assumption they made a mistake. It is much rarer to ask the question when the answer is correct, and in fact doing so is likely to upset the learner because they had to redo the work for no reason.

If you want a true comparison, start asking that question every time and then compare. My hypothesis is students would start ignoring the prompt and answering “yes” every time to get on with it.


> They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6

Running where? H100s are usually over $2/hr, thats closer to $25


Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I tried, maybe they've somewhat used that trick already?


> even the smartest people make hundreds of tiny experiments

This is the most important point, and why DeepSeek’s cheaper training matters.

And if you check the R1 paper, they have a section for “things that didn’t work”, each of which would normally be a paper of its own but because their training was so cheap and streamlined they could try a bunch of things.


Thank you for this, I really appreciate this article and I learned a bunch!

Great article! I enjoyed reading it

this isn't rlvr and so sorta uninteresting, they are just distilling the work already done

Anyone else wants more articles on how those benchmarks are created and how they work?

Those models can be trained in way tailored to have good results on specific benchmarks, making them way less general than it seems. No accusation from me, but I'm skeptical on all the recent so called 'breakthroughs'.


Is it me, or the affiliations are totally missing in the cited paper?? Looks like they come from a mix of UK / US institutions

Wait

> If you believe that AI development is a prime national security advantage, then you absolutely should want even more money poured into AI development, to make it go even faster.

This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.

To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.

Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from their peers and they buy into it. We are social creatures after all.

I think using any of this in a national security setting is stupid, wasteful and very, very insecure.

Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.


You can choose to be somewhat ignorant of the current state in AI, about which I could also agree that at certain moments it appears totally overhyped, but the reality is that there hasn't been a bigger technology breakthrough probably in the last ~30 years.

This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

If you spare yourself some time and read through the actual (scientific) papers of multiple generations of LLM models, the first one being from Google ~~not DeepMind~~ in 2017, you might get to understand that this is no fluff.

And I'm speaking this from a position of a software engineer, without bias.

The reason why all this really took off with so much hi-speed is because of the not quite expected results - early LLM experiments have shown that "knowledge" with current transformers architecture can linearly scale with regards to the amount of compute and training time etc. That was very unexpected and to this day scientists do not have an answer why this even works.

So, after reading bunch of material I am inclined to think that this is something different. The future of loading the codebase into the model and asking the model to explain me the code or fix bugs has never been so close and realistic. For the better or worse.


This line of thinking doesn't really correspond to the reason Transformers were developed in the first place, which was to better utilize how GPUs do computation. RNNs were too slow to train at scale because you had to sequentially compute the time steps, Transformers (with masking) can run the input through in a single pass.

It is worth noting that the first "LLM" you referring to was only 300M parameters, but even then the amount of training required (at the time) was such that training a model like that outside of a big tech company was infeasible. Obviously now we have models that are in the hundreds of billions / trillions of parameters. The ability to train these models is directly a result of better / more hardware being applied to the problem as well as the Transformer architecture specifically designed to better conform with parallel computation at scale.

The first GPT model came out ~ 8 years ago. I recall when GPT-2 came out they initially didn't want to release the weights out of concern for what the model could be used for, looking back now that's kind of amusing. However, fundamentally, all these models are the same setup as what was used then, decoder based Transformers. They are just substantially larger, trained on substantially more data, trained with substantially more hardware.


What line of thinking you're referring to?

Transformers were aimed to solve the "context" problem and authors, being aware that RNNs don't scale at all neither do they solve that particular problem, had to come up with the algorithm that overcomes both of those issues. It turned out that the self-attention compute-scale was the crucial ingredient to solve the problem, something that RNNs were totally incapable of.

They modeled the algorithm to run on the hardware they had at that time available but hardware developed afterwards was a direct consequence, or how I called it a byproduct, of transformers proving themselves to be able to continuously scale. Had that not be true, we wouldn't have all those iterations of NVidia chips.

So, although one could say that the NVidia chip design is what enabled the transformers success, one could also say that we wouldn't have those chips if transformers didn't prove themselves to be so damn efficient. And I'm inclined to think the latter.


> This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

This is the line of thinking I'm referring to.

The "context" problem had already been somewhat solved. The attention mechanism existed prior to Transformers and was specifically used on RNNs. They certainly improved it, but innovation of the architecture was making it computation efficient to train.

I'm not really following your argument. Clearly your acknowledging that it was first the case that with the hardware at the time, researchers demonstrated that simply scaling up training with more data yielded better models. The fact that hardware was then optimized for these for these architectures only reinforces this point.

All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

> this is not only the result of better hardware

Regarding this in particular. A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.


> innovation of the architecture was making it computation efficient to train.

and

> researchers demonstrated that simply scaling up training with more data yielded better models

and

> The fact that hardware was then optimized for these for these architectures only reinforces this point.

and

> All the papers discussing scaling laws point to the same thing, simply using more compute and data yields better results.

is what I am saying as well. I read the majority of those papers so this is all very known to me but I am perhaps writing it down in a more condensed format so that other readers that are light on the topic can pick the idea easier.

> A majority of the improvement from GPT-2 and GPT-4 was simply training on a much larger scale. That was enabled by better hardware and lots of it.

Ok, I see your point and the conclusion here is what we disagree with. You say that the innovation was simply enabled by the better hardware whereas I say that that better hardware wouldn't have its place if there hadn't been a great innovation in the algorithm itself. I don't think it's fair to say that the innovation is driven by the NVidia chips.

I guess my point, simplistically saying, is if we had a lousy algorithm, new hardware wouldn't mean anything without rethinking or rewriting the algorithm. And with the transformers, this definitely hadn't been the case. There had been plenty of optimizations throughout the years in order to better utilize the HW (e.g. flash-attention) but the architecture of transformers remained more or less the same.


> the first one being from DeepMind in 2017

? what paper are you talking about



oh gotcha. maybe pedantic but that is not a deepmind paper

I stand corrected, not sure why I was under the impression that this paper came out of DeepMind lab.

  > Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
That sounds to me like dismissing the idea that a Russian SSBN might cross the Pacific and nuke Los Angeles because "submarines can't swim".

Even if the machine learning isn't really intelligent, it is still capable of performing IF..THEN..ELSE operations, which could have detrimental effects for [some subset of] humans.

And even if you argue that such a machine _shouldn't_ be used for whatever doomsday scenario would harm us, rest assured that someone, somewhere, who either does not understand what the machines are designed to do or just pretends that they work like magic, will put the machines in a position to make such a decision.


One could hope...

Even at the height of the Cold War there was always a human between <leader presses button> and <nukes go aflyin'>.

--edit--

...which has me wondering if a president even has the constitutional authority to destroy the entire planet and if one could interpret their command as a 'lawful order'. Makes one think.


On the topic of fail-deadly nukes:

https://en.wikipedia.org/wiki/Dead_Hand


> They think it’s the end all be all for everything.

Is (human-based) general intelligence not one of the fundamental enabling elements of literally every human activity throughout history, regardless of how many layers of automation and technology one has to peel back to get to it?

Can you maybe imagine how the ability to create arbitrary amounts of general intelligence, completely divorced from the normal lengthy biological process, could upend that foundation of human activity?

> They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.

I've found that most AI-related movies exclusively focus on "quality ASI" scenarios, which are mostly irrelevant to our current state of the world, as an immense amount of danger/value/disruption will arrive with AGI. People who are seriously reasoning about the impacts of AGI are not using movies as references. "Those stupid movie watching idiots" is just a crutch you are using to avoid thinking about something that you disagree with.

> Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.


>> Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

> Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.

I think the "fake intelligence can replace a human" needs more support in general. We know how human intellect works practically (not theoretically) and we know how to apply it in different scenarios. We're still far from knowing how "fake intelligence" works and how to apply it to different scenarios.


I couldn't agree more.

If we're not talking about cyber war exclusively, such as finding and exploiting vulnerabilities, for the time being national security will still be based on traditional army.

Just a few weeks ago, italy announced a 16bln€ plan to buy >1000 rheinmetall ifv vehicles. That alone would make italy's army one of the most equipped in Europe. I can't imagine what would happen with a 500$bln investment in defense,lol. I don't agree with what Meloni's government is doing, but one of the ministers I agree more with is the defense minister Crosetto

Furthermore, what is being shown, at least for the time being, is that open source can be and is crucial in aiding developing better models. This collides with the idea of big, single "one winner takes it all" VC mentality (because let's be honest, these defense pitches are still made by startup/VC bros)


>italy announced a 16bln€ plan to buy >1000 rheinmetall ifv vehicles. That alone would make italy's army one of the most equipped in Europe.

So target practice for a beyond-the-horizon missile system launched ground-to-ground or air-to-ground? As an attacking force, conventional ground forces and tactics are a non-runner in a modern theatre of operations when faced against air and drone support. This is why no single EU country is incentivised into dumping money into any single area - as the only probable defense would be against USA/Russia/China to begin with.

The US proved it beyond doubt in Afghanistan - partisans simply haven't a chance against a gunship with IR or NV optics; the last time they levelled the playing field against air interdictors was in Charlie Wilson's Afghanistan when the Mujahideen took on that era of Soviet gunships with hand-held AA systems.


Italy had a joke of tanks in the last decade(s), that weren't clearly on par with the other European alternatives and the defense budget went to other places, namely navy and air force. Since some of the Italian tanks are being dismissed, there was the need for a new tank, but not having invested much in tanks had the effect of not having cutting edge technology to develop an alternative. So the decision was taken to produce tanks from rheinmetall (Leonardo was available to another German producer, but they didn't want to transfer technology and produce in italy).

Not having tanks is a big no no due to latest events, I do not see the grudge you hold against them when it's clear they wouldn't be used in the same flat scenario of Ukraine or Poland. Given that Italy is highly mountainous, it made sense to prioritize air force and navy first. I think they're also compatible for anti drone guns

Piaggio air space was acquired a few weeks ago by baykar, hopefully it will produce drones.

I do not get why you're saying that a change in the approach of war makes tank not useful anymore, when it's clear they will eventually be adapted to counter drones and similar threats


> I can't imagine what would happen with a 500$bln investment in defense,lol.

The $90,000 bag of bushings becomes a $300,000 bag?


No, at least in italy we would create some gimmick consultant position for some ex politician and repay him/her for past favours /s

It's not one or the other, though. AI-controlled drones are already a thing in Ukraine, today.

Been saying this for years, it's been fucking baffling. Generating images, video and text that sort-of resembles what a human would come up with is genuinely quite impressive. It is not "let's claim it'll fix our country" (looking at you, Keir) impressive though, and I cannot believe so much money has been pumped into it.

But you have to over promise and under deliver, otherwise you won't receive those sweet sweet money

Yes, I'd like to see some examples where our current AI can actually extrapolate rather than interpolate. Let it invent new things, new drawing styles, new story plots, etc. Maybe _then_ it will impress me.


I'm not convinced. This is using the tooling and paradigms invented by humans.

Science is a paradigm invented by humans. If a human uses this paradigm to invent something he is considered intelligent but if an ai uses it it’s not? All humans use human paradigms and they are intelligent.

If a human uses such a paradigm to success it is the same.


Biology is a field where you can find new things relatively easily, if you follow existing protocols and do the hard lab work (which costs time and money).

Ah yes. I believe this is commonly called "moving the goalposts".

No it isn't.

Can you? Most people can't do anything that isn't 99% derivative.

I can only say that exponential curves grow nominally sublinearly before they take off. AI is not quite at the obvious take off point, but owners of the biggest clusters have seen the extrapolations and it isn't pretty - once your competitor achieves take off and you aren't anywhere close, you're done for. The risk of not participating in that are too great.

> This is machine learning except the machines are bigger than ever before. There is no intellect.

Define "intellect".


What is even the possible usage of AI for national security? Generating pictures of kittens riding nuclear weapons to the very end like in Dr Strangelove?

> What is even the possible usage of AI for national security? Generating pictures of kittens riding nuclear weapons to the very end like in Dr Strangelove?

For all that critics of AI dismiss them as lacking imagination, your reaction suggests a lack of imagination.

Off the top of my head: facial recognition and identification to make "smart" guns that hit specific targets with reduced collateral damage (as found on most digital cameras even before smartphones); creating and A/B testing propaganda campaigns; using modified wifi signals as wall-penetrating radar capable of post estimation, heart rate and breathing monitoring[0]; take any self-driving car's AI and conditionally invert the part that says "don't hit pedestrians" when a certain target is spotted; ANPR to track specific vehicles with known owners over long distances; alternative targeting system for cruise missiles in the absence or jamming of GPS systems; using them as red teams in war-game exercises; using them to automate intrusion detection by monitoring for changes to background distributions of basically every measurable event; person-tracking by watching CCTV in secure areas; control systems for security robots (think Boston Dynamics' Spot) that are currently in deployment.

There's likely a lot more, too.

[0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zhao_...


Gentlemen you can't use AI here, this is the deep learning lab!

Lol: Where I live (Memphis) both “one” and “two” are considered two syllable words. Seriously. Our kids were taught this in the best public elementary school.

How does that work, one-uh, two-uh?

egzacly

Autonomous weapons.

Agreed. I was working on some haiku things with ChatGPT and it kept telling me that busy has only one syllable. This is a trivially searchable fact.

link a chat please

It wasn't just busy that it failed on. I was feeding it haikus and wanted them broken into a list of 17 words/fragments. Certain 2 syllable words weren't split and certain 1 syllable words were split into two.

Also the narrative that we are currently on the brink of Ai explosion and this random paper shows it has been the same tired old story handed out by ai hawks for years now. Like yes, I agree with the general idea that more compute means more progress for humans and perhaps having a more responsive user interface through some kind of ai type technology would be good. But I don’t see why that will turn into Data from Star Trek. But I also think all these ai hawks kind of narcissistically over value their own being. Like blink and their lives are over in the grand scheme of things. Maybe our “awareness” of the world around us is an illusion provided by evolution because we needed it to value self preservation whereas other animals don’t. There is an inherent belief in the specialness of humans that I suppose I mistrust.

> But I don’t see why that will turn into Data from Star Trek.

"Is Data genuinely sentient or is he just a machine with this impression" was a repeated plot point in TNG.

https://en.wikipedia.org/wiki/The_Measure_of_a_Man_(Star_Tre...

https://en.wikipedia.org/wiki/The_Offspring_(Star_Trek:_The_...

https://en.wikipedia.org/wiki/The_Ensigns_of_Command

https://en.wikipedia.org/wiki/The_Schizoid_Man_(Star_Trek:_T...

Similar with The Doctor on VOY.

Even then, what we have with LLMs is basically already at the level of the ship's main computer as it was written in TNG/DS9/VOY.

But what counts as personhood? No idea. We're as ignorant as of that as the writers of the TV show; humanity's best attempts at philosophy of mind are still making the same kinds of arguments today as we were a century ago, the only real shift is to be slightly more inclusive of other species besides Homo Sapiens, and the only real improvement over two centuries ago is to include all of Homo Sapiens.


The main computer does not make choices stochastically and always understands what people ask it. I do not think that resembles the current crop of LLMs. On voyager the ships computer is some kind of biological computing entity that they eventually give up on as a story topic but there is an episode where the bio computing gel packs get sick.

I believe data and the doctor both would be people to me. But is minuet? The woman created by the binars to distract riker so they could steal the ship? Star Trek goes back and forth as to who counts as a person or not. This is one of the points that is made in the episode where the doctor learns his hologram is cloned many times to work as miners on an asteroid and he says this is slavery. Does minuet get to say she doesn’t want to seduce riker if she doesn’t feel like it? Or is she a program that does whatever the programmer wants?

But moreover linking back to the original technology, does eqtransformer have awareness? Do we treat it as a thinking being that we enslaved to find earthquakes in data because it uses the same technology as ChatGPT? I don’t think anyone thinks we should do this but it seems like that is where people want to take it. That’s strange to me. At what point does data pipelines doing some kind of stochastic transformation and summarization of training data become an individual that presents a desire for autonomy like data or the doctor?

I think there’s lots of questions here to answer and I don’t know the answers to them.


> The main computer does not make choices stochastically and always understands what people ask it.

The mechanism is never explained, but no, it doesn't always understand correctly — and neither does Data. If hologram-Moriarty is sentient (is he?), then the capability likely exceeds what current LLMs can do, but the cause of the creation is definitely a misunderstanding.

Even the episode where that happens, the script for Dr. Pulaski leading up to Moriarty's IQ boost was exactly the same arguments used against LLMs: https://www.youtube.com/watch?v=4pYDy7vsCj8

(Common trope in that era being that computers (including Data) are too literal, so there was also: https://www.youtube.com/watch?v=HiIlJaSDPaA)

Similar with every time the crew work iteratively to create something in the holodeck. And, of course: https://www.youtube.com/watch?v=srO9D8B6dH4

> I do not think that resembles the current crop of LLMs. On voyager the ships computer is some kind of biological computing entity that they eventually give up on as a story topic but there is an episode where the bio computing gel packs get sick.

"Take the cheese to sickbay" is one of my favourite lines from that series.

> But is minuet?

I would say the character was a puppet, with the Bynars pulling the strings, because the holo-character was immediately seen as lacking personhood the moment they stopped fiddling with the computer.

Vic Fontaine was more ambiguous in that regard. Knew he was "a lightbulb", but (acted like) he wanted to remain within that reality in a way that to me felt like he was *programmed* to respond as if the sim around him was the only reality that mattered rather than having free will in that regard.

(But who has total free will? Humans are to holograms as Q is to humans, and the main cast were also written to reject "gifts" from Riker that time he briefly became a Q).

The villagers of Fair Haven were, I think, not supposed to be sentient (from the POV of the crew), but were from the POV of the writers: https://en.wikipedia.org/wiki/Fair_Haven_(Star_Trek:_Voyager... and https://en.wikipedia.org/wiki/Spirit_Folk_(Star_Trek:_Voyage...

> does eqtransformer have awareness?

There's too many different definitions for a single answer.

We don't know what part of our own brains gives us the sensation of our own existence; and even if we did, we wouldn't know if it was the only mechanism to do so.

To paraphrase your own words:

At what point does chemical pipelines doing some kind of stochastic transformation and electrochemical integration of sensory input become an individual that presents a desire for autonomy like data or the doctor?

I don't know. Like you, I'd say:

> I think there’s lots of questions here to answer and I don’t know the answers to them.


I find the last part of the paragraph offputting and I agree

It used to be much easier to be conservative about AI, especially AGI, after living through three cycles of AI winters. No more. Dismissing it as “merely machine learning” is worse than unfair to the last decade of machine learning ;-)

The hard part now is relatively trivial. Does anyone think that there is a fundamental and profound discovery that evolution made purely by selection in the last 200,000 years? I mean a true qualitative difference?

Sure—-We call it language, which is just another part of a fancy animal’s tool kit.

Does anyone think there is an amazing qualitative difference between the brain of a chimp and the brain of a human?

No, not if they know any biology.

(Although that does not stop some scientist from looking for a “language gene” like FOXP2.)

So what did dumb mutations and 200,000 years of selection do that a group of dedicated AI scientists cannot do with their own genuine general intelligence?

Nothing—-nothing other than putting a compact energy efficient LLM with reinforcement learning on a good robotic body and letting it explore and learn like we did as infants, toddlers and teenagers.

Each one of us has experienced becoming a “general intelligence”. I remember it hit me on the head in 6th grade when I dreamed up a different way of doing long division. I remember thinking: “How did I think that?” And each one of us who has watched an infant turn into a toddler has watched it as an observer or teacher. This is what makes babies so fascinating to “play” with.

We have to give our baby AGI a private memory and a layer of meta-attention like we all gain as we mature, love, and struggle.

I read the linked article and as a neuroscientist I realized the “wait” cycles that improved performance so much is roughly equivalent to the prefrontal cortex: the part of the CNS most responsible for enabling us to check our own reasoning recursively. Delay—as in delayed gratification—-is a key attribute of intelligent systems.

We are finally on the door step to Hofstadter’s Strange Loop and Maturana’s and Valera’s “enactive” systems, but now implemented in silicon, metal, and plastic by us rather than dumb but very patient natural selection.

Karl Friston and Demis Hassabis (two very smart neuroscientist) figured this out years ago. And they were preceded by three other world class neuroscientist: Humberto Maturana, Francisco Valera, and Rich Sutton (honorary neuroscientist). And big credit to Terry Winograd for presaging this path forward long ago too.


> I think using any of this in a national security setting is stupid

What about AI enabled drones and guided missiles/rockets? The case for their effectiveness is relatively simple in terms of jamming resistance.


drone and missile guidance system development has been using ML for decades at this point. That's just as much "AI" as anything currently coming out of the LLM craze.

It's not just target guidance at this point. There are prototypes of drone swarms, for example.

Like a lot of AI boosters, would you like to explain how that works other than magic AI dust? Some forms of optical guidance are already in use, but there's other limitations (lighting! weather!)

Sure thing. The basic idea would be:

1) Have a camera on your drone 2) Run some frames through a locally running version of something like AWS Rekognition's celebrity identification service but for relevant military targets. 3) Navigate towards coordinates of target individuals

It isn't exactly magic, here's a video of a guy doing navigation with openCV on images: https://www.youtube.com/watch?v=Nrzs3dQ9exw


I believe this is a capability that the Switchblade 600 or STM KARGU already has.

https://en.wikipedia.org/wiki/STM_Kargu


I think jamming resistance is a red herring. AI weapons will have their own failure modes due to jamming. Any sensor modality will have its own particular weakness. Also reasoning model malfunctions as well i.e. hallucinations.

Not to mention false GPS etc...


I would say that they don't require an 500bln$ investment. AFAIK, drone that help lock on target have started being used in Ukraine

I generally agree, piggybacking on innovations in smartphone GPUs / batteries will probably be enough to get locally running AI models in drones.

This somehow reminds me of a certain killer robot from a Black Mirror episode ;)

> This is machine learning

Yeah, I was thinking about this while trying to figure out author affiliations.

There was a Stanford paper a few years ago that dusted off some old intelligence concepts and the authors seemed excited about it.

But given the pace of AI, it's difficult to look in new directions. It will probably take an AI winter and some unbridled enthusiasm immune to burnout to make some real progress outside of feed forward neural networks.


You would solve the ai problem if you correctly defined what intellect even is.

I agree agi wont solve national security but saying this isn’t intelligence is false.

This is ai and trend lines point to an intelligence that matches or barely exceeds human intellect in the future.

You’re part of a trend of people in denial. When LLMs first came out there were hordes of people on HN claiming it was just a stochastic parrot and LLMs displayed zero intellectual ability. It is now abundantly clear that this not true.

We don’t fully understand LLMs. That’s why gains like COT are just black box adjustments that come from changing external configurations. We have no way to read the contents of the black box and make adjustments off of it. Yet idiots like you can make such vast and hard claims when nobody really fully understands these things. You’re delusional.

I agree that LLMs won’t allow us to make some super weapon to give us some edge in national security.


> then you absolutely should want even more money poured into AI development, to make it go even faster.

Indeed. People are welcome to go "all in" on whatever nonsense gambling they want to do with their personal investments, but national security demands actually thinking about things - adversarially. Because the enemy will as well.

It's perfectly possible to lose a war by investing in expensive superweapons that under deliver. The Nazis were particularly bad at this.


That sovereign wealth fund with tik tok might set a good precedent; when we have to 'pour money' into these companies we can do so with stake in them held in our sovereign wealth fund.

Extra-legal financial instruments meant to suck money from other federal departments don't strike me as a good precedent in any sense. I don't disagree though that nationalizing the value of enormous public investments is something we should be considering, looking at you oil industry. But until congress appropriates the money under law it's a pipe dream or theft.

Sorry for being lazy, but I just don't have the time right now to read the paper. Is there in the paper or somewhere else a comparison based on benchmarks of S1 vs R1 (the full R1, not quantized or distilled)?

The S1 paper is not meant to compete with R1. It simply shows that with 1k well curated examples for finetuning (26 minutes training on 16 GPU) and with a simple hack for controlling the length of the thinking process, one can dramatically increase the performance of a non-reasoning model and show a clear increase in benefit with increased test-time compute. It is worth a quick skim.

> Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

(sorry for the long quote)

I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.

For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: