Hacker News new | past | comments | ask | show | jobs | submit login

I just tested Gemini 1.5 Flash (interactively on Google AI Studio) and the results are far from acceptable.

OCR seems good, on par with Google Vision.

But the footnotes are not properly identified on most pages; they are properly identified when there is a large gap and the first line of the footnotes starts with a number; but when the footnotes block starts with text (continuing a footnote from a previous page) and/or the gap is small or almost non-existent, it fails (all text on the page is considered belonging to main text).

But the main problem isn't even that, it's that it takes between 10 to 20 seconds per page. That would mean over three hours per volume of 600 pages. Google Vision takes less than one second per page.

It's possible there is a setup cost and that doing batches or even full PDFs would be better, though. Do you have experience with this? And can you maybe share "prompt secrets" that would improve the results...?






Gemini 1.5 pro worked better for me at Korean OCR on camera phone taken scans so must be better in some scenarios. You could try it but it's certainly slow.

Could you share the page numbers where it has trouble with footnotes? I’ll give it a try.

- There is no reason it should take 10-20 seconds per page. In my experience with complex documents of 20-30 pages it takes 3-4 seconds.

- Please message me for more info if you like as this should be something LLMs can handle with ease.

This was my prompt (except for the URLs I provided the actual JPG, one as the example and the other the page to process). It took 2 seconds:

"

Extract the text from this page, correctly identifying the different zones the text appears in.

Pages contain the following possible zones (colors refer to the image)

-header (blue)

-comments in the margins (green)

-main text (pink)

-footnotes (yellow)

-signature mark (purple)

{https://blog.medusis.com/assets/38/ss_example_zones_colors.j...}

Extracting the text means correctly parsing the words in those different zones and reconstructing them properly, so as to produce readable text.

- Many pages have comments or subtitles in the margins that don't belong in the main text

- There are extensive footnotes that need to be properly tagged since they're not part of the main text

- Each page has a header that you might need to remove for smooth reading across pages

- On some pages, but not all, there's a 'signature mark' at the bottom. These need to be captured.

So you need to process the output to properly identify all these elements and mark paragraphs correctly.

{https://divers.medusis.net/boislisle/pub/tomes/t01/out/cropp...}

"

This is the JSON schema I told it to output in:

{ "type": "OBJECT", "properties": { "header": { "type": "string", "nullable": false }, "main_text": { "type": "STRING", "nullable": false }, "footnotes": { "type": "array", "nullable": true, "items": { "type": "string" } }, "comments": { "type": "STRING", "nullable": true }, "signature_mark": { "type": "STRING", "nullable": true } }, "required": [ "header", "main_text", "footnotes", "comments", "signature_mark" ] }

This was the output:

{ "comments": "Guimaraes, son caractère et ses mœurs.", "footnotes": [ "1. Voyez une lettre du général Marquis de Saint-Simon, dans le Moniteur, du 18 août 1838. — Cet ouvrage, cessionnaire de Boisange, eut vingt et un volumes in-8° par Sautel, publiés par Delloye, et celle de 1883, publiée par les frères Ducharne, quarante volumes in-18.", "2. L'édition de 4820-4830, la publiée de l'édition de 1840.", "3. Mémoires complets et authentiques du duc de Saint-Simon sur le règne de Louis XIV, et la Régence, collationnés sur le manuscrit ori- ginal par M. Chéruel, et précédés d'une notice par Sainte-Beuve, de l'Académie française. — Paris, 1856, in-8° de 1840 pages. — Cette édition est imprimée en deux volumes, sans faute, et avec une exactitude parfaite, en raison des volumes de 1861 ; un autre, dans le format in-42, ac- compagné de dix-un.", "4. En treize volumes. — Un premier tirage, sans le concours ; un troi- sième, dans le format in-18, en 1883, et un quatrième en 1865, dans le format in-16.", "5. Cette maison venait d'inaugurer sa Bibliothèque des chemins de fer, qui contribua beaucoup au succès de cette publication.", "6. Sa propriété est particulièrement confirmée par des arrêts anté- rieurs à l'acquisition ; l'un du tribunal de première instance de Paris en date du 8 juin 1856, un autre de la Cour d'appel en date du 8 fé-" ], "header": "MEMOIRES DE SAINT-SIMON.", "main_text": "ce manuscrit, en y pratiquant toutefois ce qu'il appelait « les corrections et les retranchements indispensables ». Outre cette première édition, datée de 1829-1830, les Mémoires complets et authentiques du duc de Saint-Si- mon sur le siècle de Louis XIV et la Régence furent deux fois réimprimés par les soins du général de Saint-Simon en 1840 et 1856, avant que M. Chéruel obtint de faire l'édition de 1856, que depuis lors, on a considéré, non sur l'original une nouvelle revision ou d'ont sorties sans raison, comme édition principale, et plusieurs réim- pressions successives du texte sec, en moindre format, toutes faites par la maison Hachette³, qui devint propriétaire du manuscrit des Mémoires.", "signature_mark": null }


Also, if you're wondering why it output "Guimaraes, son caractère et ses mœurs." as a comment, it's because my instructions were not clear enough and it thought the prompt was asking for it to provide comment on the text :D



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: