Nit: you call _mm_aesenc_si128() once per call to next(), right? I don't see how that constitutes "1/4th of an AES round per output word". (You do output 1/4th of an AES block.)
You are, of course, right about the actual point you're making. And calling _mm_aesenc_si128() once per 4 calls to next() may well suffice to pass a statistical test. Then again, even an LSFR passes most statistical tests...
Which only uses one AES call every 4 words (and only one block of storage). Instead, I chose to avoid the `if` and compute blocks ahead of time, which makes for more predictable performance.
You are, of course, right about the actual point you're making. And calling _mm_aesenc_si128() once per 4 calls to next() may well suffice to pass a statistical test. Then again, even an LSFR passes most statistical tests...