I'll check your new aesni hash, but would recommend to try a simple fast 32bit hash instead, like the builtin crc32c. It needs half as much space and the cache misses will kill you essentially. Twice lesser cache misses.
And a simple linear hash table instead of cookoo will also help in less cache misses. There are no deletions. Should be 20% faster, I think.
So I tested your aesnihash with smhasher. It's really bad. A much better 64bit aesni variant would be falkhash, which is 4x faster, supports a seed and passes most tests.
The main point of this hash, in this context, is to do streaming hash and find \n at in one loop. The intention is to reduce data loads _mm_loadu_si128 (I already have user data in xmm0, so why not do some aesni already?). Because it's streaming I can't for example derive the initial seed based on the chunk length, since it's unknown at the time of calling hash. See:
I don't need full aes hash, but maybe that could be an option as well.
In other words, in my case I don't care just about hash() speed. I care about memchr() + hash() speed. I would like to understand/measure the hash quality itself. Maybe adding another aesenc round would be sufficient to fix it.
Even falkhash is not that great, and feature an abnormal amount of collisions as the nb of hashes increase. Basically, all "naive" AES implementations share this design weakness.
And a simple linear hash table instead of cookoo will also help in less cache misses. There are no deletions. Should be 20% faster, I think.
Or even gperf.