I work in cheminformatics, and wrote one of the documents cited by Sagar. The an...

adrian_b · on Aug 14, 2023

Since Intel Ice Lake and AMD Zen 4, the Intel and AMD CPUs with AVX-512 support (or AVX10 in the future) have the VPOPCNT instruction, which works on a short vectors of up to 512-bit length.

With VPOPCNT it is easy to accelerate any POPCNT dependent algorithm to speeds far beyond of what is possible with any other instructions.

dalke · on Aug 14, 2023

Yes, I certainly expect so!

In my 2019 paper I earlier linked to I predicted "The VPOPCNTDQ instruction in the AVX-512 instruction set computes a 512-bit popcount, which should be faster still." because at the time I didn't have access to such hardware.

While those are easier to find now, I haven't revisited the issue because my experiments back then strongly suggested my code is now limited by memory bandwidth, not popcount evaluation performance.

I also don't know how many of my customers deploy on VPOPCNT-capable hardware.

My bitvectors have a 1-bt density of 5%, so I think the real next step is to look into something like BLOSC, where I store the data in compressed form, using a compression form which supports on-the-fly decompression faster than a memory read, then do the popcount on that transient data. 75% compression would need a 4x faster popcount.

I've tried to use BLOSC for this, but wasn't quickly able to figure how how to integrate it with my code, and realized it would likely require some pretty breaking big changes in my code that I couldn't really justify, so I've been putting it off for years.