Hacker News new | past | comments | ask | show | jobs | submit login

> It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps), PEF would likely be faster, if dense (visit a large fraction of the bitmap), it'd be slower.

Just for clarification you mean the access pattern is sparse and not the data itself? How is that relevant?

I did part of the investigation and implementation, but didn't look much into elias fano coding. The select for the dense codec is already really fast (with popcount), there's not much room for improvement on the instruction side (https://godbolt.org/z/dq7WeE66Y), except implicitly by touching less memory. The sparse codec with its binary search should be easy to beat though. Partitioned Elias-Fano indexes may be a superior choice in contrast to the sparse codec in terms of rank and compression, and probably less so for select and code complexity.




> How is that relevant?

Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance.

EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block: https://github.com/facebook/folly/blob/main/folly/experiment...


>Roaring bitmaps and similar data structures get their speed from decoding together consecutive groups of elements, so if you do sequential decoding or decode a large fraction of the list you get excellent performance. EF instead excels at random skipping, so if you visit a small fraction of the list you generally get better performance. This is why it works so well for inverted indexes, as generally the queries are very selective (otherwise why do you need an index?) and if you have good intersection algorithms you can skip a large fraction of documents.

There's no sequential decoding in our variant, every access is independent. The roaring bitmap variant is used only for the optional index (1 docid => 0 or 1 value) in the columnar storage (DocValues), not for the inverted index. Since this is used for aggregation, some queries may be a full scan.

The inverted index in tantivy uses bitpacked values of 128 elements with a skip index on top.

> I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block

The select for the sparse codec is a simple array index access, that is hard to beat (https://github.com/quickwit-oss/tantivy/blob/main/columnar/s...). Compression is not good near the 5k threshold though. PDEP is currently deactivated since ryzen before Zen3 was really slow.

Creation speed is also quite important, do you know how "Partitioned Elias-Fano" performs there?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: