Hacker News new | past | comments | ask | show | jobs | submit login

You could upload the books to the Internet Archive and let their OCR pipeline take a try. It is (or at least was) written around Abbyy. Results weren't great but they were a start.

I wonder what eventually happened with Ocropus which was supposed to help with page segmentation. I was a bit disappointed to see that this article used Google Vision as its OCR engine. I was hoping for something self hosted.




I searched the Internet Archive for Les Mémoires de Saint-Simon - https://archive.org/search?query=Les+Mémoires+de+Saint-Simon... - skimmed the results for items with the right number of pages, and came up empty.

I uploaded a new item - https://archive.org/details/memoires-de-saint-simon-nouvelle... - though I made a mess of the metadata. It's still processing.


The Internet Archive's OCR is built around tesseract nowadays, but you're right about piggybacking off their pipeline. Upload a text to archive.org and get hOCR for free.


The book is being worked on here https://fr.wikisource.org/wiki/Livre:Saint-Simon_-_M%C3%A9mo... already (volume 1 of 20). Not the same edition as what OP is working with, but it's a start.


That edition (the Chéruel edition) is the first integral edition of the Mémoires. It's been OCRed a long time ago and has been available in text form for 20+ years. But it has almost no footnotes.

The edition I'm working on here, the "Boislisle", is completely different thanks of the richness and coverage of its footnotes (but the main text should be almost identical).


If it's public domain, you can create a new record for it on Wikisource once you think it's ready for the human touch. This is the purpose of Wikisource though, taking the messy automated OCR, and allowing volunteers to correct/proofread/format everything.


Well, in my experience Google Vision is far, far ahead of Tesseract.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: