Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Progra...

bambax · 2024-12-17T22:33:52 1734474832

You are right that the quality of the scans is paramount! Unfortunately I don't have access to the physical books and have to work with the scans as they are (they're not good). But I will look at Scantailor, it looks interesting.

For now I reconstruct paragraphs in html but I could do markdown just as well (where paragraph breaks are marked by double line breaks, and single line breaks don't count).

Collaborative proofreading would be cool but it would require some way of properly tracking who wrote what, and I'm not sure what to use or if I should build a simple system from scratch. Do you have recommendations?

pronoiac · 2024-12-18T08:08:11 1734509291

I got a copy of the 30-year old book from EBay or Amazon for $20, chopped the spine off, and fed it through a scanner. Doing that to a century-old book feels wrong!

ScanTailor was tricky to start with; dunno if there's a manual. I remember belatedly realizing that there's automation at each step, that one can then quickly skim and manually adjust.

For collaborative editing, git via GitHub worked for us. Tracking who did what, and when, is easy. It allowed for sweeping edits covering multiple chapters. Building some porcelain on top of that, for less technical folks, could be good.

pronoiac · 2024-12-19T18:14:37 1734632077

> Pour obtenir un document de Gallica en haute définition, contacter utilisation.commerciale@bnf.fr.

roughly:

> To obtain a Gallica document in high definition, contact utilisation.commerciale@bnf.fr.

My expectations would be very low, but I'd reach out to them anyway.

jfil · 2024-12-18T03:15:09 1734491709

Because you're creating webpages from the text, one option for collaborative notes/corrections is to use a Web Annotation system like Hypothes.is.

2Gkashmiri · 2024-12-18T07:05:45 1734505545

A few years ago I got so good at the whole scan>scantailor>PDF that I could scan a 100-150 page book, send that to scantailor, edit it and improve it to TIFF. Convert to PDF and OCR it in half hour.

I got very good at this but page turning way a bore.

The PDF turned out in a mechanical fashion without much effort.

I made a few scripts to do TIFF to PDF and then stictching them and doing OCR.

pronoiac · 2024-12-18T17:15:49 1734542149

Page turning? So, non-destructive, with cameras? How's the quality?

2Gkashmiri · 2024-12-23T13:07:42 1734959262

I did two things. 1. Destructive removing page staples and then just page turn manually.

Or 2. Book holder for non destructive page turning. More delicate. Slow turning.

CAmara I use a cheap

https://www.amazon.in/ClickScan-Foldable-Metallic-Design-2-P...