My description was not correct. There are 210k works, but not all if them are in the Public Domain:
"Here you will find high-quality digitized copies of books, manuscripts and other media from the Staatsbibliothek zu Berlin. Where the originals are in the public domain, we provide them with a public domain license. Currently there are 209,779 works in total."
All of them are scans of the old material, but many/most of the works also have the OCRed text available. It would be nice if all these were fed to an algorithm and an AncientChatGPT was created :-)
they are in ChatGPT most likely. The question is why those and all the other inputs are not really accessible (by a subscription for example) for the general public...
I worked on this for the library of Uni Würzburg when I was a student. Pretty cool to see it on HN.
We worked on high quality scans, processing Terabytes of raw image material using Akka (this was around 2012 I think) and also created pipelines for performance OCR at scale on these scans. Doing this on Fraktur and medieval minuscule scripts was tricky and we didn't get really good results during my time there.
"Here you will find high-quality digitized copies of books, manuscripts and other media from the Staatsbibliothek zu Berlin. Where the originals are in the public domain, we provide them with a public domain license. Currently there are 209,779 works in total."