Xapian: Open source search engine library

rbanffy · 2024-08-17T14:15:23 1723904123

I remember having used, a very long time ago, a self-hosted search engine on my library of PDFs, and it was unbelievably useful.

I dream about a similar thing that can do OCR on scanned docs and extract text from my also sprawling library of epub and mobi files. If someone builds something like this, with maybe a LOCAL LLM to extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs, and add that to a NAS appliance, it’d be a killer.

theolivenbaum · 2024-08-17T16:36:08 1723912568

That's what our app does: curiosity.ai, local index, support for many files types and apps out of the box, and integrated local OCR, STT and even local LLM

rmholt · 2024-08-17T17:37:18 1723916238

I couldn't find any mention on your website about LOCAL LLMs and according to your FAQ, it requires an account with your website.

Is there a way how to run curiosity.ai fully offline, without an account on your servers?

Bluestein · 2024-08-17T16:56:04 1723913764

That is the future :) Much success!

glompers · 2024-08-17T15:30:46 1723908646

DEVONthink 3 [0] (Apple only) will do most of that although I don't keep up at all with its interoperability with LLM extensions.

[0] https://www.devontechnologies.com/apps/devonthink

andyfilms1 · 2024-08-17T14:24:52 1723904692

Evernote will do this, you can feed it a bunch of PDFs and other documents, it will OCR them and make them all searchable. it's not perfect, but you can also add manual tags for things you know are important.

rbanffy · 2024-08-17T15:25:49 1723908349

At some point I'd love to further train an LLM on all my PDFs and be able to ask it questions.

Jzush · 2024-08-20T16:35:23 1724171723

For txt based pdfs, doc files etc, this is super easy to do and requires very little technical expertise or configuration. Download the gpt4all client. Pick an LLM the new meta llama3 model works really well then configure the local documents plugin.

I wrote a scraper to download all of the California EdCode from the governments site, convert them all to txt docs and I can ask questions about California EdCode in plain English.

I work in a shared governance capacity that requires us to refer to the Ed code for contractual negotiations and it’s been extremely helpful.

pdw · 2024-08-17T16:56:12 1723913772

> similar thing that can do OCR on scanned docs

It's only part of what you want, but ocrmypdf will add a OCRed text layer to PDF files, making the text selectable and indexable

kordlessagain · 2024-08-17T15:14:16 1723907656

I have most of this code for doing this - just needs to get rewritten for local storage (I was running it on Google Cloud). Need to pick something that doesn't run Solr as a service for local use. With Ollama, we have function calls running, so should be doable. I was also thinking about using the Open WebUI for use.

Bluestein · 2024-08-17T16:01:36 1723910496

> extract text descriptions from photos and movies as well as indexing metadata for everything, subtitles from movies and lyrics for songs

The big AI players are probably already scraping the bottom of this "barrel" in their search for training data, I am sure ...

infocollector · 2024-08-17T11:59:35 1723895975

This project has been around and maintained for more than a decade! Small footprint, good speed. One downside might be GPL v2 for commercial use.

frenchman99 · 2024-08-17T12:54:09 1723899249

You can always build a small search webservice that you open source and that your proprietary software calls out too, removing the need to open source everything.

Linux is GPL too, didn't hinder companies making trillions on top of it.

the_mitsuhiko · 2024-08-17T13:05:38 1723899938

Linux is not a good example because of the syscall exemption. The licensing situation is not at all comparable as xapian’s original point of existence was embedding.

synergy20 · 2024-08-17T13:00:19 1723899619

you mean, don't compile it and link it within my application, instead wrap it as a separate program then call it via rpc remotely or locally?

frenchman99 · 2024-08-17T13:24:08 1723901048

Yes, exactly that.

donio · 2024-08-17T19:09:21 1723921761

A lot more than a decade. I've been using it for 15 years at it was a very mature project even then. Repo history goes back to 1999 and according to the history page the project's roots go back to the 80s. A bit like Postgres in this respect.

https://xapian.org/history https://sigir.org/files/forum/S2000/MUSCAT_note.pdf

rbanffy · 2024-08-17T14:16:29 1723904189

For what kind of use do you think GPLv2 would be a blocker?

rurban · 2024-08-18T06:32:01 1723962721

I used it commercially, at a very big international company. All users had access to its unmodified source code and templates. There was no trouble at all.

Bluestein · 2024-08-17T12:38:50 1723898330

In fact, AIUI it's roots go back about 3 decades. The "about" page has a nice historical overview.-

the_mitsuhiko · 2024-08-17T12:54:07 1723899247

The project at one point started tracking files for potential rewrite to rid itself of GPL history. I used it many years ago and I quite enjoyed it (pre elastic search times) but unfortunately the license situation didn’t help the project to become popular.

synergy20 · 2024-08-17T12:57:36 1723899456

that's true, wonder if there is alternative that is not gpl

bearjaws · 2024-08-17T13:11:43 1723900303

Sonic search https://github.com/valeriansaliou/sonic

Maybe not exactly the same, its a server that you can store documents and then retrieve their ID using a search string.

Bluestein · 2024-08-17T13:16:49 1723900609

Elastic Search and its Amazon fork Opensearch perhaps?

JackSlateur · 2024-08-17T13:32:44 1723901564

Xapian is a library, while elastic has a client-server model

Xapian is more like sqlite while elastic would be mariadb

inertiatic · 2024-08-17T14:00:55 1723903255

Lucene which is what ES builds upon is a library with bindings in languages other than Java, and it's Apache licensed.

Bluestein · 2024-08-17T13:46:14 1723902374

Thanks for the spot-on, very illuminated comparison.-

openrisk · 2024-08-17T12:33:46 1723898026

used also by recoll, the desktop search app: https://www.recoll.org/

nanna · 2024-08-17T13:07:58 1723900078

I use recoll to index and search thousands of pdfs. Because I always have the author name in the filename I can filter queries like this:

Cybernetics OR steering filename:Heidegger ext:pdf

It's an absolute power tool.

nickpsecurity · 2024-08-17T14:19:22 1723904362

I do it like this:

Title Year Author Name.pdf

Same benefits as you mentioned. You can also filter by time that way.

nanna · 2024-08-17T20:00:19 1723924819

My way is to do

year__author1_author2_author-n~~title-of-book~subtitle##tag1#tag-2#tagn.pdf

This means that files are automatically organised by year of publication,that I can search by tag name, and that I dont have to escape chars in the terminal. One day I hope to get round to building an Emacs mode to filter by the different elements.

donio · 2024-08-17T20:42:04 1723927324

If your PDFs have Author properties then you might be able to do "author:Heidegger" too. The Recoll PDF filter extracts some of these fields and if I remember right it can be configured to extract additional custom properties too.

nanna · 2024-08-18T06:21:46 1723962106

That would be great, however most of my pdfs don't have the author property set.

Beijinger · 2024-08-17T14:44:13 1723905853

Recoll is magic...

dvdkon · 2024-08-17T11:58:42 1723895922

Xapian is nice. I've used it before to add interactive autocomplete to a Python web app, since my previous favourite, Whoosh, is unmaintained and somehow slower than grep on a folder (I remember it being pretty fast years back, I'd love to know what happened).

I'd say my favourite thing about Xapian is that it's just a simple library you can embed in any app, no need for a separate database and JVM tuning. For simple usecases and small-to-medium datasets, it just works.