> \* VPTERNLOGD (your swiss army railgun for bitwise logic, can often fuse 2 or ...

sparkie · on June 20, 2023

I doubt it would be much use as cryptographic operations tend to mainly use xor on two inputs.

VPTERNLOGD basically works by constructing a truth table for 3 inputs.

    | A | B | C |  R
    | 0 | 0 | 0 |  x
    | 0 | 0 | 1 |  x
    | 0 | 1 | 0 |  x
    | 0 | 1 | 1 |  x
    | 1 | 0 | 0 |  x
    | 1 | 0 | 1 |  x
    | 1 | 1 | 0 |  x
    | 1 | 1 | 1 |  x

You pick the values you want for R, then pass this 8-bit value as the operand to the instruction along with the 3 values.

For example, A ∧ B ∧ C would be 0b10000000. A ∧ ¬B ∧ ¬C would be 0xb00010000

There are 256 such tables and many of them can be represented by multiple boolean expressions.

mmozeiko · on June 20, 2023

Here's a fancy trick from LLVM source: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Targ...

  #define A 0xf0
  #define B 0xcc
  #define C 0xaa

And then you can build immediate for VPTERNLOG operation by writing bitwise expression with A/B/C values in source code.

For example, A^B^C=150. A^(~B&C)=210. And so on...

Also mentioned by Fabian here: https://twitter.com/rygorous/status/1187032693944410114

addaon · on June 20, 2023

This is basically the same way as the basic primitive of an FPGA, a LUT (look-up table) works. It's a small ROM or RAM of size 2^N x 1, that is looked up by the N-bit "address" of the inputs. Modern FPGAs tend to use N=4 to N=6, with some additional fanciness occasionally present to make the up-to-64-bits of ROM/RAM useful in other ways as well.

marssaxman · on June 20, 2023

That is delightful. Thanks for explaining it so clearly.

dougall · on June 20, 2023

Yeah – ARM specifically added EOR3 and BCAX instructions to accelerate SHA-3 hashes, both of which can be handled by VPTERNLOGD.

Nyan · on June 20, 2023

Very useful. In fact, it speeds up a single instance (i.e. not taking advantage of SIMD) of MD5 by 20%: https://github.com/animetosho/md5-optimisation#x86-avx512-vl...