> The "best" models just made stuff up to meet the requirements. They lied in th...

bambax · 2024-12-18T12:47:06 1734526026

Looks interesting, but the cost is prohibitive for a hobby project. Also, it doesn't really solve my problem.

Google Vision already returns the coordinates of each word (and even of each letter), so it's easy to know where the word was on the page, and even, if necessary, to rebuild the page with the words correctly placed -- that's fundamentally what I do with the mouseover on the interactive demo: https://divers.medusis.net/boislisle/pub (at the paragraph level).

But my problem isn't to know where the words are (Google Vision provides that); it's to know what belongs to what, what is footnotes, what is main text, etc. This is what the post discusses. Just having the text following the same layout as in the original wouldn't help, because I'm not trying to reproduce the layout or the typesetting, I want to rebuild the content semantically, so as to do different "flows".

That said, it got me thinking... there may be an opportunity to do a cheaper version of LLMwhisperer? ;-)