Kinda disappointed that they're hiding the thought process. Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.
I wonder how far we are from having a model that can correctly solve a word soup search problem directly from just a prompt and input image. It seems like the crossword example is close. For a word search it would require turning the image into an internal grid representation, prepare the list of words, and do a search. I'd be interested in seeing if this model can already solve the word grid search problem if you give it the correct representation as an input.
> Hopefully the open source community will figure out how to effectively match and replicate what OpenAI is doing.
No need for that, there is a Reflection 70B model that does the exact same thing - with chains of thought being separated from the "final answer" via custom 'tag' tokens.
Some reported benchmarks do seem to be rather dubious, but the 70B model itself is quite real. Sample output:
$ ollama run reflection:70b-q4_0
>>> hello
<thinking>
To respond to "hello", I'll need to consider several factors:
1. The user's intent: They're likely saying hello as a greeting.
2. Appropriate responses: Common ways to respond to "hello" are:
- Hello back
- Hi
- Hey
3. Formality level: Since this is an AI response, I'll aim for a friendly but professional tone.
<reflection>
The approach of responding with a standard greeting seems appropriate in this context. It acknowledges the user's hello and provides a polite response.
</reflection>
Given these considerations, the most suitable response would be to echo "hello" back to the user.
</thinking>
<output>
Hello!
</output>
There was a hackernews post a few days ago, pointing to a reddit thread where some guys proved that the founder/s of relection AI were faking their model by just passing the input to Claude (Sonnet 3.5) and stripping the word "Claude" from the output, amongst other things. Then when they got caught they switched it to GPT 4-o.
After this, I will be very skeptical to anything they claim to achieve.
I have access to the model via the web client and it does show the thought process along the way. It shows a little icon that says things like "Examining parser logic", "Understanding data structures"...
However, once the answer is complete the chain of thought is lost
I wonder how far we are from having a model that can correctly solve a word soup search problem directly from just a prompt and input image. It seems like the crossword example is close. For a word search it would require turning the image into an internal grid representation, prepare the list of words, and do a search. I'd be interested in seeing if this model can already solve the word grid search problem if you give it the correct representation as an input.