More

PSeitz · 2024-07-12T01:57:55

tantivy has two dictionaries FST and SSTable. We added SSTable in tantivy because it works great with object storage, while FST does not. With some metadata we can download only the required parts and not the whole dictionary.

SStable does not support Regex queries, it would require a full load and scan, which would be very expensive.

Your best bet currently would be to make it work with tokenizing, which is way more efficient anyways.

prefix queries are supported btw

ukuina · 2024-07-14T21:28:57

Are in-order queries supported? e.g., TERM1*TERM2 should return matches with those terms in that specific order.

PSeitz · 2024-05-28T04:39:52

They serve quite different use cases.

quickwit was built to handle extremely large data volumes, you can ingest and search TB and PB of logs.

meilisearches indexing doesn't scale as it will become slower the more data you have, e.g. I failed to ingest 7GB of data.

qdequelen · 2024-05-28T07:06:23

Hey PSeitz, Meilisearch CEO here. Sorry to hear that you failed to index a low volume of data. When did you last try Meilisearch? We have made significant improvements in the indexing speed. We have a customer with hundreds of gigabytes of raw data on our cloud, and it scales amazingly well. https://x.com/Kerollmops/status/1772575242885484864

banish-m4 · 2024-05-28T22:37:09

Frankly, I'm okay with Meillisearch for instant search because y'all are clear about analytics choices, offer understandable FOSS Rust, and have a non-AGPL license. If/when we make some money, I'm in favor of $upporting and consulting of tools used to keep them alive out of self-interest.

PSeitz · 2024-05-28T04:18:16

> Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

Yes that's how you use tantivy normally, not sure which json config you mean.

tantivy-cli is more like a showcase, https://github.com/quickwit-oss/tantivy is the actual project.

OtomotO · 2024-05-28T09:30:21

Yes, and there is https://tantivy-search.github.io/examples/basic_search.html

But instead of this, I would prefer some way to just hand it JSON and for it to just index all the fields...

for comparison, this is my meilisearch SDK code:

    fun createCustomers() {
        val client = Client(Config("http://localhost:7700", "password"))
        val index = client.index("customers")
        val customers = transaction {
            val customers = Customer.all()
            val json = customers.map { CustomerJson.from(it) }
            Json.encodeToString(ListSerializer(CustomerJson.serializer()), json)
        }
        index.addDocuments(customers, "id")
    }

PSeitz · 2024-05-28T15:13:10

You can just put everything in a JSON field in tantivy and set it to INDEXED and FAST

OtomotO · 2024-05-29T08:38:01

Hm, I need to read up on the trade offs of going this route.

Thanks!

PSeitz · 2023-10-09T08:38:00

The issue for geo search is here: https://github.com/quickwit-oss/tantivy/issues/44

PSeitz · on July 2, 2023

> It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps), PEF would likely be faster, if dense (visit a large fraction of the bitmap), it'd be slower.

Just for clarification you mean the access pattern is sparse and not the data itself? How is that relevant?

I did part of the investigation and implementation, but didn't look much into elias fano coding. The select for the dense codec is already really fast (with popcount), there's not much room for improvement on the instruction side (https://godbolt.org/z/dq7WeE66Y), except implicitly by touching less memory. The sparse codec with its binary search should be easy to beat though. Partitioned Elias-Fano indexes may be a superior choice in contrast to the sparse codec in terms of rank and compression, and probably less so for select and code complexity.

ot · on July 2, 2023

> How is that relevant?

Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance.

EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block: https://github.com/facebook/folly/blob/main/folly/experiment...

PSeitz · on July 3, 2023

>Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance. EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

There's no sequential decoding in our variant, every access is independent. The roaring bitmap variant is used only for the optional index (1 docid => 0 or 1 value) in the columnar storage (DocValues), not for the inverted index. Since this is used for aggregation, some queries may be a full scan.

The inverted index in tantivy uses bitpacked values of 128 elements with a skip index on top.

> I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block

The select for the sparse codec is a simple array index access, that is hard to beat (https://github.com/quickwit-oss/tantivy/blob/main/columnar/s...). Compression is not good near the 5k threshold though. PDEP is currently deactivated since ryzen before Zen3 was really slow.

Creation speed is also quite important, do you know how "Partitioned Elias-Fano" performs there?

PSeitz · on June 19, 2023

Indeed if you are colorblind that may be an issue. I'll address it If I get to it. Otherwise it assigns names to colors, no idea how you could be confused by that.

PSeitz · on Jan 27, 2021

As far as I know LZ4 is much faster that most compression algorithms, with decompression speeds of over 4GB/s

PSeitz · on Jan 27, 2021

I ported the block format to Rust matching the C implementation in performance and ratio.

https://github.com/pseitz/lz4_flex

dgacmu · on Jan 27, 2021

Has anyone written the appropriate Reader wrappers to use this with file io? (Asking b/c a quick search didn't turn anything up.)

PSeitz · on Jan 27, 2021

file io should come with the frame format, which is not yet implemented. For the block format it's not really suited.

pmarreck · on Jan 27, 2021

NICE! Well done!

PSeitz · on Aug 5, 2019

But I like coal, so coal can't be the problem either. ROLF!

PSeitz · on July 30, 2019

Or with einstein "yeah you were pretty much right. we don't have anything new"