My nuclear fire hot take is that the chat pattern is actively hampering AI tools...

disqard · 2025-02-04T17:13:55 1738689235

Whoa! You broke my brain a bit there (but your posts often do, in a Good way!)

Would you be so kind as to ELI5 what you did in that index.js?

I've used ollama to run models locally, but I'm still stuck in chat-land.

Of course, if a blog post is in the works, I'll just wait for that :)

xena · 2025-02-04T17:26:19 1738689979

The file explains it a bit, but my blogpost https://xeiaso.net/notes/2025/s1-simple-test-time-scaling/ could probably be better explained. I'll write out more but just for you I'll summarize what I'm gonna end up writing up.

AI models fundamentally work on the basis of "given what's before, what comes next?" When you pass messages to an API like:

    [
      { "role": "system", content": "You are an expert in selling propane and propane accessories. Whenever someone talks about anything that isn't propane, steer them back." },
      { "role": "user", "content": "What should I use to cook food on my grill?" },
      { "role": "assistant", "content": "For cooking food on your grill, using propane is a great choice due to its convenience and efficiency. [...]" }
    ]

Under the hood, the model actually sees something like this (using the formatting that DeepSeek's Qwen 2.5 32b reasoning distillation uses):

    You are an expert in selling propane and propane accessories. Whenever someone talks about anything that isn't propane, steer them back.
    <｜User｜>What should I use to cook food on my grill?<｜endofsentence｜>
    <｜Assistant｜>

And then the model starts generating tokens to get you a reply. What the model returns is something like:

    For cooking food on your grill, using propane is a great choice due to its convenience and efficiency. [...]<｜endofsentence｜>

The runtime around the model then appends that as the final "assistant" message and sends it back to the user so there's a façade of communication.

What I'm doing here is manually assembling the context window such that I can take advantage of that and then induce the model that it needs to think more, so the basic context window looks like:

    Follow this JSON schema: [omitted for brevity]
    <｜User｜>Tell me about Canada.<｜endofsentence｜>
    <｜Assistant｜><think>Okay

And then the model will output reasoning steps until it sends a </think> token, which can be used to tell the runtime that it's done thinking and to treat any tokens after that as the normal chat response. However, sometimes the model stops thinking too soon, so what you can do is intercept this </think> token and then append a newline and the word "Wait" to the context window. Then when you send it back to the model, it will second-guess and double-check its work.

The paper s1: Simple test-time scaling (https://arxiv.org/abs/2501.19393) concludes that this is probably how OpenAI implemented the "reasoning effort" slider for their o1 API. My index.js file applies this principle and has DeepSeek's Qwen 2.5 32b reasoning distillation think for three rounds of effort and then output some detailed information about Canada.

In my opinion, this is the kind of thing that people need to be more aware of, and the kind of stuff that I use in my own research for finding ways to make AI models benefit humanity instead of replacing human labor.

disqard · 2025-02-04T18:14:21 1738692861

Thank You so much for making time to write that up! Deeply appreciated.

It's fascinating how this "turn-taking protocol" has emerged in this space -- as a (possibly weird) analogy, different countries don't always use the same electrical voltage or plug/socket form-factor.

Yet, the `role` and `content` attrib in json appears to be pretty much a de facto standard now.