As someone who used to eat assembly instructions for breakfast back in the days and remembering when a MUL was taking more than 1 cycle, is there any resource you'd recommend to learn about using the highly vectorized/parallelized instruction sets in modern CPUs?
I know about Daniel Lemire / lemire.me
Anybody / anything else you'd recommend?