They're just outputting tokens that resemble a reasoning process. The underlying tech is still the same LLM it always has been.
I can't deny that doing it that way improves results, but any model could do the same thing if you add extra prompts to encourage the reasoning process, then use that as context for the final solution. People discovered that trick before "reasoning" models became the hot thing. It's the "Work it out step by step" trick but in a dedicated fine-tune.
> They're just outputting tokens that resemble a reasoning process.
Looking at one such process of emulating reasoning (got deepseek-70B locally), I'm starting to wonder how does that differ from actual reasoning? We "think" about something, may make errors in that thinking, look for things that don't make sense and correct ourselves. That "think" step is still a blackbox.
I asked that llm a typical question of gas exchange between containers, it made some errors and noticed some calculations that didn't make sense:
> Moles left A: ~0.0021 mol
> Moles entered B: ~0.008 mol
> But 0.0021 +0.008=0.0101 mol, which doesn't make sense because that would imply a net increase of moles in the system.
Well, that's totally invalid calculation, it should be "-" in there. It also noticed that those quantities should be same in other place.
Eventually, after 102 minutes and 10141 tokens, involving checking answers from different angles multiple times, it outputted approximately correct response.
Does it matter if it doesn't know why this particular pattern is suitable? Also, do you always ask yourself why you use that particular pattern all the time, or do you just use them?
It seems like you are implying that I don't think before I speak. Maybe that is sometimes the case, but I would venture to say, "not usually, and certainly not always."
The point I'm making here is that all of these observations are made after-the-fact. We humans see five different categories of output:
1. "I do know X" where X is indeed correct information
2. "I do know X" where X is false information or nonsense
3. "I don't know" when it really doesn't
4. "I don't know" when a slightly different prompt would lead to option #1
5. Output that is not phrased as a direct answer to a question.
The article introduced #2 as "hallucinations". I introduced #4 in my previous comment (and just now #5), and propose that all five are hallucinations.
As far as the LLM is concerned, there is only one category of output: the most likely next token. Which of the five that will be is determined by the examples present in the training corpus, which are later weighed during training.
Logic is not present in the process. It is only present in the result.
> It seems like you are implying that I don't think before I speak.
I'm implying that most times you don't think before you think or after you think (you or me typically don't meta-think).
I'm saying that very often I (and looks like a lot of people around me) don't think much before I speak. I have internal monologue when I'm "thinking something out", but I typically don't think things through when I'm speaking with people in day-to-day conversations, only when I encounter a problem I didn't see yet and I'm not "trained" in solving it. Maybe some people can make fully reasoned sentences in split seconds before they start talking, but not me. IIRC those two modes of thinking are called slow and fast thinking.
> Logic is not present in the process. It is only present in the result.
I'm talking about that process. Have you seen "thinking" part of current reasoning LLM's? It does indeed look like a process of using logic. After "thinking" part, there is "output" part that makes conclusions form the process of thinking. Recently I asked local version of deepseek about a gas exchange problem and it thought a lot about this, making some small mistakes in logic, correcting them, ultimately returning approximately valid result. It even made some small errors in calculations and corrected itself by multiplying parts of numbers and adding them for correct result. I've put that example online[1] if you'd like to read it, it's pretty interesting.
I guess the crux of it is this: is it training or awareness?
What I see happening between the <think> tags of Deepseek-R1 is essentially a premade set of circular prompts. Each of these prompts is useful, because it explores a path of tokens that are likely to match a written instance of logical deduction.
When the <think> continuation rewrites part of a prompt as a truthy assertion, it reaches a sort of fork in the road: to present a story of either acceptance or rejection of that assertion. The path most likely followed depends entirely on how the assertion is phrased (both in the prompt, and in the training corpus). Remember that back in the training corpus, example assertions that look sensible are usually followed by a statement of acceptance, and example assertions that look contradictory or fallacious are usually followed by a statement of rejection.
Because the token generation process follows an implicit branching structure, and because that branching structure is very likely to match a story of logical deduction, the result is likely to be logically coherent. It's even likely to be correct!
The distinction I want to make here is that these branches are not logic. They are literary paths that align to a story, and that story is - to us - a well-formed example of written logical deduction. Whether that story leads to fact or fiction is no more and no less than an accident. We humans often tend to follow a similar process, but we can actively choose to do real critical thinking instead.
This design pattern is really useful for a few reasons:
- it keeps the subjects of the prompt in context
- it presents the subjects of the prompt from different perspectives
- it often stumbles into a result that is equivalent to real critical thinking
On the other hand,
- it may fill the context window with repetitive conversation, and lose track of important content
- it may get caught in a loop that never ends
- it may confidently present a false conclusion to itself, then expand that conclusion into a whole thread
- the false conclusions it presents will be much less obvious, because they will always be written as if they came out of a thorough process of logical deduction
I find that all of these problems are much more likely to occur when using a smaller locally hosted copy of the model than when using the full-sized one that is hosted on chat.deepseek.com. That doesn't mean these are solved by using a bigger model, only that the set of familiar examples is large enough to fit most use cases. The more unique and interesting your conversation is, the less utility these models will have.
> We humans often tend to follow a similar process, but we can actively choose to do real critical thinking instead.
> - it may confidently present a false conclusion to itself, then expand that conclusion into a whole thread
I want to know how that differs from human "real critical thinking", because I may be missing this function. How do you know what you thought of is true or false? I only know it because I think I know it. I had made a lot of mistakes in past with a lot of confidence.
> The more unique and interesting your conversation is, the less utility these models will have.
Yeah, that also happens with a lot of people I know.
> ... the result is likely to be logically coherent. It's even likely to be correct!
Yeah, a lot of training data made sure that what it outputs is as correct as possible. I still remember my training over many days and nights to be able to multiply properly, with two different versions of multiplying table and many false results until I got it right.
> I guess the crux of it is this: is it training or awareness?
I don't think LLM's are really aware (yet). But they do indeed follow logical reasoning method, even if not perfect yet.
Just a thought: when do you think about how and what you think (awareness of your thoughts)? When you actually think through a problem, or after that thinking? Maybe to be self-aware, AI's should be given some "free-thinking time". Currently it's "think about this problem and then immediately stop, do not think any more". Currently training data discourages any "out-of-context" thinking, so they don't.
We know what true and false mean. An LLM knows what true and false are likely to be surrounded with.
The problem is that expressions of logic are written many ways. Because we are talking about instances of natural language, they are often ambiguous. LLMs do not resolve ambiguity. Instead, they continue it with the most familiar patterns of writing. This works out when two things are true:
1. Everything written so far is constructed in a familiar writing pattern.
2. The familiar writing pattern that follows will not mix up the logic somehow.
The self prompting train of thought LLM pattern is good at keeping its exploration inside these two domains. It starts by attempting to phrase its prompt and context in a particular familiar structure, then continues to rephrase it with a pattern of structures that we expect to work.
Much of the logic we actually write is quite simple. The complexity is in the subjects we logically tie together. We also have some generalized preferences for how conditions, conclusions, etc. are structured around each other. This means we have imperfectly simplified the domain that the train of thought writing pattern is exploring. On top of that, the training corpus may include many instances of unfamiliar logical expressions, each followed by a restatement of that expression in a more familiar/compatible writing style. That can help trim the edge cases, but it isn't perfect.
---
What I'm trying to design is a way to actually resolve ambiguity, and do real logical deduction from there. Because ambiguity cannot be resolved to a single correct result (that's what ambiguity means), my plan is to, each time, use an arbitrary backstory for disambiguation. This way, we could be intentional about the process instead of relying on the statistical familiarity of tokens to choose for us. We would also guarantee that the process itself is logically sound, and fix it where it breaks.
OpenAI claims recent models are actually reasoning to some extent.