Hacker News new | past | comments | ask | show | jobs | submit login

I work in cheminformatics, and wrote one of the documents cited by Sagar.

The answer is "yes and no".

My area of focus is a part of molecular similarity. The overall idea is that molecules which are similar tend to have similar functionality. There are many approximate ways to estaimte molecular similarity. The one I focus on maps chemical substructures ("features" or "descriptors") to 1 or several bits on a bitstring of length ~1024 bits, called a fingerprint.

We use Jaccard similarity (called Tanimoto similarity in my field) of two fingerprints as a proxy for molecular similarity computed as popcount(A & B) / popcount(A | B). Since popcount(A) and popcount(B) can be pre-computed, this ends up being popcount(A & B) / (popcount(A) + popcount(B) - popcount(A & B). If the fingerprints are grouped by popcount then this boils down to computing popcount(A & B), plus some tiny amount of integer math.

This can be used to find the k-nearest matches to a single query, to cluster the fingerprints, to identify diverse fingerprints, and so on.

These methods bounded by two things: 1. the time needed to compute popcount of (A & B), and 2. the memory bandwidth.

The CPU bottleneck really is the popcount calculation. At https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... I compared different popcount implementations and found POPCNT about 2-3x faster than the fastest pure-C popcount.

However, POPCNT on Intel processors isn't all that fast. Rather, when I was really focused on this 5+ years ago, Intel processors only had one execution port than could handle POPCNT, so I could only get one 8 bytes per cycle. (Some AMD processors have several such ports, but I never tried one of those CPUs.)

Instead, Wojciech Mula, Nathan Kurz and Daniel Lemire showed that AVX2 instructions were even faster than sequential POPCNT instructions because AVX2 could handle more things in parallel. See my table for numbers.

For small bitstrings (I think 512 bits was the boundary) POPCNT is still the fastest.

With AVX2 it's fast enough that memory bandwidth becomes the limiting issue, and I need to start worrying about cache locality.




Since Intel Ice Lake and AMD Zen 4, the Intel and AMD CPUs with AVX-512 support (or AVX10 in the future) have the VPOPCNT instruction, which works on a short vectors of up to 512-bit length.

With VPOPCNT it is easy to accelerate any POPCNT dependent algorithm to speeds far beyond of what is possible with any other instructions.


Yes, I certainly expect so!

In my 2019 paper I earlier linked to I predicted "The VPOPCNTDQ instruction in the AVX-512 instruction set computes a 512-bit popcount, which should be faster still." because at the time I didn't have access to such hardware.

While those are easier to find now, I haven't revisited the issue because my experiments back then strongly suggested my code is now limited by memory bandwidth, not popcount evaluation performance.

I also don't know how many of my customers deploy on VPOPCNT-capable hardware.

My bitvectors have a 1-bt density of 5%, so I think the real next step is to look into something like BLOSC, where I store the data in compressed form, using a compression form which supports on-the-fly decompression faster than a memory read, then do the popcount on that transient data. 75% compression would need a 4x faster popcount.

I've tried to use BLOSC for this, but wasn't quickly able to figure how how to integrate it with my code, and realized it would likely require some pretty breaking big changes in my code that I couldn't really justify, so I've been putting it off for years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: