> Trying to get LLMs with vision to properly identify zones also were found to be slow and unreliable, and the risk of hallucinated results is unacceptable, especially as a first step. Non-deterministic systems may be fine for creative projects, but not here. (Once we have a reliable reference we can then play with LLMs and if necessary, control the results by measuring the distance to the source.)
He tried it for fixing footnotes and the result went "classic LLM":
> It was a complete flop. Using OpenRouter, I tested over 200 models. More than 70% couldn't even count the footnotes right, but that wasn't the worst part.
The "best" models just made stuff up to meet the requirements. They lied in three ways:
Basic (stupid) lies: wrong counts but claiming they matched ('foonotes: 5, references: 3, match: true')
Better lies: claiming they placed references when they hadn't
Premium lies: making up new text to attach footnotes to when they weren't sure where they went (against explicit instructions in the prompt never to do that)
> Trying to get LLMs with vision to properly identify zones also were found to be slow and unreliable, and the risk of hallucinated results is unacceptable, especially as a first step. Non-deterministic systems may be fine for creative projects, but not here. (Once we have a reliable reference we can then play with LLMs and if necessary, control the results by measuring the distance to the source.)
He tried it for fixing footnotes and the result went "classic LLM":
> It was a complete flop. Using OpenRouter, I tested over 200 models. More than 70% couldn't even count the footnotes right, but that wasn't the worst part.
The "best" models just made stuff up to meet the requirements. They lied in three ways:
Basic (stupid) lies: wrong counts but claiming they matched ('foonotes: 5, references: 3, match: true') Better lies: claiming they placed references when they hadn't Premium lies: making up new text to attach footnotes to when they weren't sure where they went (against explicit instructions in the prompt never to do that)