Hacker News new | past | comments | ask | show | jobs | submit login

"A very crude method would be to remove the last line every 16 pages but that would not be very robust if there were missing scans or inserts, etc. I prefer to check every last line of every page for the content of the signature mark, and measuring a Levenshtein distance to account for OCR errors."

I'm curious: did you also check whether the signature mark was indeed found every 16 pages? Were there any scans missing?

Great project btw!






Yes, that's one of the (many) benefits of logging!

And in fact, there is a hiatus, because the introduction at the beginning is from a different "sub-book", where the pages are numbered using roman numerals. Typically the introduction would be written and typeset after the main book had been typeset, so its number of pages would not known in advance and that's why it uses a different numbering system.

So one finds a signature mark on pages 9, 25 41, 57, 73, 89, and then it starts again at page 93 109, 125, 141, 157, 173, 189, etc. (those numbers come from the filenames of the scans, not the numbers printed on the pages).

=> Another reason for not starting with the first signature mark and simply adding 16, is that would miss the changing of sub-book (or any irregular number of pages, for any reason).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: