I remember having used, a very long time ago, a self-hosted search engine on my library of PDFs, and it was unbelievably useful.
I dream about a similar thing that can do OCR on scanned docs and extract text from my also sprawling library of epub and mobi files. If someone builds something like this, with maybe a LOCAL LLM to extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs, and add that to a NAS appliance, it’d be a killer.
That's what our app does: curiosity.ai, local index, support for many files types and apps out of the box, and integrated local OCR, STT and even local LLM
Evernote will do this, you can feed it a bunch of PDFs and other documents, it will OCR them and make them all searchable. it's not perfect, but you can also add manual tags for things you know are important.
For txt based pdfs, doc files etc, this is super easy to do and requires very little technical expertise or configuration. Download the gpt4all client. Pick an LLM the new meta llama3 model works really well then configure the local documents plugin.
I wrote a scraper to download all of the California EdCode from the governments site, convert them all to txt docs and I can ask questions about California EdCode in plain English.
I work in a shared governance capacity that requires us to refer to the Ed code for contractual negotiations and it’s been extremely helpful.
I have most of this code for doing this - just needs to get rewritten for local storage (I was running it on Google Cloud). Need to pick something that doesn't run Solr as a service for local use. With Ollama, we have function calls running, so should be doable. I was also thinking about using the Open WebUI for use.
You can always build a small search webservice that you open source and that your proprietary software calls out too, removing the need to open source everything.
Linux is GPL too, didn't hinder companies making trillions on top of it.
Linux is not a good example because of the syscall exemption. The licensing situation is not at all comparable as xapian’s original point of existence was embedding.
A lot more than a decade. I've been using it for 15 years at it was a very mature project even then. Repo history goes back to 1999 and according to the history page the project's roots go back to the 80s. A bit like Postgres in this respect.
I used it commercially, at a very big international company. All users had access to its unmodified source code and templates. There was no trouble at all.
The project at one point started tracking files for potential rewrite to rid itself of GPL history. I used it many years ago and I quite enjoyed it (pre elastic search times) but unfortunately the license situation didn’t help the project to become popular.
This means that files are automatically organised by year of publication,that I can search by tag name, and that I dont have to escape chars in the terminal. One day I hope to get round to building an Emacs mode to filter by the different elements.
If your PDFs have Author properties then you might be able to do "author:Heidegger" too. The Recoll PDF filter extracts some of these fields and if I remember right it can be configured to extract additional custom properties too.
Xapian is nice. I've used it before to add interactive autocomplete to a Python web app, since my previous favourite, Whoosh, is unmaintained and somehow slower than grep on a folder (I remember it being pretty fast years back, I'd love to know what happened).
I'd say my favourite thing about Xapian is that it's just a simple library you can embed in any app, no need for a separate database and JVM tuning. For simple usecases and small-to-medium datasets, it just works.
I dream about a similar thing that can do OCR on scanned docs and extract text from my also sprawling library of epub and mobi files. If someone builds something like this, with maybe a LOCAL LLM to extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs, and add that to a NAS appliance, it’d be a killer.