Meilisearch expands search power with Arroy's filtered disk ANN

goestoo · 2023-12-24T16:51:36.000000Z

I tested Meilisearch last year indexing 10mil documents it took 2 days! same dataset with the same server specs it took less than 2 hours to index in elasticsearch.

Kerollmops · 2023-12-24T16:56:11.000000Z

Thank you for the feedback. You should probably look at our blog post explaining how to index in smarter way [1]. You should probably come and talk to us on Discord. We work with MrBeast and they don't have any issue indexing millions of YouTube videos everyday.

Ho! And just to tease you a little bit. We highly improved the indexing speed of Meilisearch v1.6 (january 24). Here is a teasing tweet [2].

[1]: https://blog.meilisearch.com/best-practices-for-faster-index... [2]: https://x.com/kerollmops/status/1734576622303404317

milkshakes · 2023-12-24T20:05:30.000000Z

a lot of people will never read your blog, or visit your discord but would be quite interested in using a tool like meilisearch. you're correct that the information is out there, but unless you put it right in front of them, such people will have a bad time (and sometimes complain about it, publicly!).

simonw · 2023-12-24T18:20:10.000000Z

Why does MrBeast index millions of YouTube videos?

Kerollmops · 2023-12-24T19:00:51.000000Z

They recently released a new YouTube stats website and they use Meilisearch in the frontend.

https://news.ycombinator.com/item?id=38681328

supz_k · 2023-12-24T18:20:00.000000Z

We recently re-indexed comments to Meilisearch (a PHP process that synced data from MYSQL to Meilisearch) after a Meilisearch version upgrade. It only took about 1 hour for about 12 million documents on a 16GB/4vCPU. In your case, maybe it was a config issue? Or, an old version?

Kerollmops · 2023-12-24T19:14:39.000000Z

And when you'll use the v1.6 you see a much better indexation speed.

https://x.com/kerollmops/status/1734576622303404317

ko_pivot · 2023-12-24T14:49:58.000000Z

This is great. The only major thing Meilisearch is missing after hybrid search is introduced IMO is high availability. Without clustering, it’s hard to run any meaningful production workloads and there doesn’t seem to be an online upgrades story.

Kerollmops · 2023-12-24T14:59:46.000000Z

Indeed, you are right about this. This is why we will release geo-replication on the cloud in Q1 next year. We heard you and worked hard on both of those features. We already have a great working proof-of-concept.

marginalia_nu · 2023-12-24T23:24:56.000000Z

> but we decided to go with RoaringBitmaps to reduce their size

Interesting. Could you elaborate on the benefit of this?

I've (possibly prematurely) discarded the notion of using RoaringBitmaps like this because while they use less memory, traversal is and mutation is so much slower using a fixed buffer out of a pool and considering an upper bounded slice of putative results at a time.

Although this is for a search engine that typically deals with under-specified queries and is designed for best-effort retrieval given an upper computation time.

Kerollmops · 2023-12-25T11:26:43.000000Z

> Interesting. Could you elaborate on the benefit of this?

I don't know on what I can elaborate.

Storing integers that are near each other is much more optimal in a RoaringBitmap than in a flat array. The reason is that it will only store the integers by storing the high part once and the low part in an array or bitmap efficiently.

Also we already use RoaringBitmaps on the other end of Meilisearch and converting that to another data structure could take a lot of times.

marginalia_nu · 2023-12-25T11:53:43.000000Z

It's clearly a more succinct representation, but based on my prior experience with these structures it seems like it would come at a not insignificant fee in access costs, and while sometimes they're better, there appear to be many cases where they are not.

Don't take this as me questioning your design process or choices, I'm sure they are well motivated, I just think it's a good practice to be curious as to how when someone reaches a different conclusion than yourself :-)

Kerollmops · 2023-12-25T13:02:55.000000Z

Indeed, there could probably be cases where you see higher deserialization cost than raw lists of integers. But when it comes to high number of integers I can confirm that it is much more efficient.

marginalia_nu · 2023-12-25T13:18:29.000000Z

Hmm, I should do some benchmarks I guess... Thanks for the datapoints :)

wg0 · 2023-12-24T21:35:01.000000Z

I'm thinking of putting this for software documentation. The idea is that content is all markdown in Astro or Seveltkit with SSG (static site generation) and then Index it all at build time and then search it all via Meliesearch.

darkotic · 2023-12-24T23:04:35.000000Z

I want to use meilisearch but it takes up more disk space than I'd like unless that has already improved.

naiv · 2023-12-24T23:49:29.000000Z

This also caught me offguard comparing it to other search engines:

https://www.meilisearch.com/docs/learn/advanced/storage#meas...

so I never even tried Meilisearch as our base json is 5gb

sgt101 · 2023-12-24T13:52:35.000000Z

feels like product placement dressed as a blog

Kerollmops · 2023-12-24T14:41:36.000000Z

It's my own company and I work hard on both Meilisearch, the keyword search part since 2018 and the soon-to-be-released semantic search part. The hybrid search will be also part of the v1.6 release in january. I took time to write those three blog posts.

What do you think about the blog posts content?

sroussey · 2023-12-24T20:20:34.000000Z

Is there a way to make this a bundled library (more like sqlite) than a server (more like postgresql)?

I want to have something running local and bundled into an app for friends, maybe more.

dureuill · 2023-12-25T11:47:03.000000Z

One of our top oss contributors is developing mimir as an embedded Meilisearch: https://github.com/GregoryConrad/mimir/tree/main/packages/mi...

sroussey · 2023-12-27T06:45:29.000000Z

I wonder if I can get this as a WASM module?

dureuill · 2023-12-27T08:22:33.000000Z

AFAIK, not yet. It was a goal of Mimir, but LMDB (used by arroy on particular and Meilisearch in general) uses filesystem and os features (mmap, file locking) that are incompatible with wasm.

If you're interested in bringing WASM support, I think the most promising avenue was to support redb as a heed backend. Heed is the high level binding library we're using to talk to LMDB. If you're interested in doing this, you can get in touch with the maintainer of mimir.

As a Meilisearch team member, I'd love to participate as well, but I'm afraid I won't have the bandwidth

Kerollmops · 2023-12-25T10:08:35.000000Z

You can use arroy, the vector store, directly in your library. It is in Rust and runs on top of LMDB which manages a disk file.

sroussey · 2023-12-27T06:45:35.000000Z

I wonder if I can get this as a WASM module?

_a_a_a_ · 2023-12-24T14:50:06.000000Z

It's information-dense and informative (also over my head). As such it feels absolutely nothing like product placement. I think you're being needlessly cynical.