> It would definitely compress much better than roaring bitmaps. In terms of per...

ot · on July 2, 2023

> How is that relevant?

Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance.

EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block: https://github.com/facebook/folly/blob/main/folly/experiment...

PSeitz · on July 3, 2023

>Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance. EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

There's no sequential decoding in our variant, every access is independent. The roaring bitmap variant is used only for the optional index (1 docid => 0 or 1 value) in the columnar storage (DocValues), not for the inverted index. Since this is used for aggregation, some queries may be a full scan.

The inverted index in tantivy uses bitpacked values of 128 elements with a skip index on top.

> I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block

The select for the sparse codec is a simple array index access, that is hard to beat (https://github.com/quickwit-oss/tantivy/blob/main/columnar/s...). Compression is not good near the 5k threshold though. PDEP is currently deactivated since ryzen before Zen3 was really slow.

Creation speed is also quite important, do you know how "Partitioned Elias-Fano" performs there?