Looks like I didn't give Meilisearch enough lattitude -- It's still ingesting an...

qdequelen · on Oct 25, 2022

Hello, I'm the Meilisearch CEO, your issues could be because you sent your data without configuring your index, and it's what I read from your comment. Just change your setting to not index URLs.

Check out the doc: https://docs.meilisearch.com/reference/api/settings.html#upd... With the payload: `["title", "descriptionHTML"]`

It will change everything!

hardwaresofton · on Oct 25, 2022

Hey thanks for letting me know! I did add a few fields that weren’t directly searched on, because I wanted to be a bit more fair across other search engines (postgres is holding the whole document and the indexes).

I’m going to change the configuration and see how that goes.

One thing I’d love help with (that would make an awesome recipe section for your docs site) are the best practices around bulk insertion! I couldn’t tell if there was an actual benefit to using addDocuments() vs addDocumentsInBatches().

qdequelen · on Oct 25, 2022

If you remove the URLs from indexation, it'll generally save a ton of place and will be much, much faster to index. We are thinking about not indexing URLs by default; you can help us by explaining your use case here -> https://github.com/meilisearch/product/discussions/553

Just a detail, if you're making a `du -sh` on your computer, the size on the disk will stay unchanged because we are doing soft deletion ;). Don't worry. It will be physically deleted after a while if you need it in the future.

If you kept the default configuration of Meilisearch, the maximum size of the HTTP payload is 100Mb (for security). You change it here -> https://docs.meilisearch.com/learn/configuration/instance_op...

addDocumentsInBatches() is just an helper to send your big json array into multiple parts, not absolutely sure you'll need it. (Code -> https://github.com/meilisearch/meilisearch-js/blob/807a6d827...)

hardwaresofton · on Oct 25, 2022

Thanks! I removed the URLs and now the searchable attributes are only title, description and some author fields!

> Just a detail, if you're making a `du -sh` on your computer, the size on the disk will stay unchanged because we are doing soft deletion ;). Don't worry. It will be physically deleted after a while if you need it in the future.

Ah I was just wildy undershooting the size I gave the PVC! I just gave it much more and it's fine -- right now it's resting around 19Gi of usage, which is actually a bit of a problem considering the data set was only like 4GB or something like that originally. That said, disk is really not an issue so I'll just throw more at it, maybe leave it at 32GB and call it a day (it's around 1.6MM documents out of ~2MM), so shouldn't be too much more.

> If you kept the default configuration of Meilisearch, the maximum size of the HTTP payload is 100Mb (for security). You change it here -> https://docs.meilisearch.com/learn/configuration/instance_op...

Thanks for this, I'll keep this in mind -- so I could actually pass off HUGE chunks to Meilisearch.

It seems like the larger the chunk the more efficient? There didn't seem to be much of a change in how much time it took to work through a chunk of documents, more just that having lots of smaller chunks would go slower. I started off with 10k in a batch, then went to 1k then back to 5k, maybe I should go to 100k docs in a batch and see the performance.

There's a blog post waiting to be written in here...

> addDocumentsInBatches() is just an helper to send your big json array into multiple parts, not absolutely sure you'll need it. (Code -> https://github.com/meilisearch/meilisearch-js/blob/807a6d827...)

Thanks! Was this something someone requested? Is there a tangible benefit (were there some customers that didn't want to split up the payloads themselves)? Because it seems like unnecessary cruft in the API otherwise.