The part of the article I found most amusing: "Intel added AVX512-VP2INTERSECT t...

The part of the article I found most amusing:

"Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really slow. (microcoded ~25 cycles/46 uops) It was so slow that someone found a better way to implement its functionality without using the instruction itself. Intel deprecates the instruction and removes it from all processors after Tiger Lake. (ignoring the fact that early Alder Lake unofficially also had it) AMD adds it to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up with it. Needless to say, Zen5 had probably already taped out by the time Intel deprecated the instruction. So VP2INTERSECT made it into Zen5's design and wasn't going to be removed.

But how good is AMD's implementation? Let's look at AIDA64's dumps for Granite Ridge:

AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg. set] T: 0.23ns= 1.00c

Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I just can't...

Intel was so bad at this that they dropped the instruction. And now AMD finally appears and shows them how it's done - 2 years too late."