Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Programming) from a book into a bunch of Markdown files, but that's "only" about a thousand pages long, compared to the roughly 27000 pages long of all those volumes. I have advice, possibly helpful, possibly not.
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
You are right that the quality of the scans is paramount! Unfortunately I don't have access to the physical books and have to work with the scans as they are (they're not good). But I will look at Scantailor, it looks interesting.
For now I reconstruct paragraphs in html but I could do markdown just as well (where paragraph breaks are marked by double line breaks, and single line breaks don't count).
Collaborative proofreading would be cool but it would require some way of properly tracking who wrote what, and I'm not sure what to use or if I should build a simple system from scratch. Do you have recommendations?
I got a copy of the 30-year old book from EBay or Amazon for $20, chopped the spine off, and fed it through a scanner. Doing that to a century-old book feels wrong!
ScanTailor was tricky to start with; dunno if there's a manual. I remember belatedly realizing that there's automation at each step, that one can then quickly skim and manually adjust.
For collaborative editing, git via GitHub worked for us. Tracking who did what, and when, is easy. It allowed for sweeping edits covering multiple chapters. Building some porcelain on top of that, for less technical folks, could be good.
A few years ago I got so good at the whole scan>scantailor>PDF that I could scan a 100-150 page book, send that to scantailor, edit it and improve it to TIFF. Convert to PDF and OCR it in half hour.
I got very good at this but page turning way a bore.
The PDF turned out in a mechanical fashion without much effort.
I made a few scripts to do TIFF to PDF and then stictching them and doing OCR.
Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while