Several people have raised questions about SIMD. For many workloads, SIMD is obviously a huge win. But in my experience, for problems like this, the gains are marginal and the additional complexity often isn't worth it. e.g. Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?
I couldn't tell you why - it's possible something could be tweaked to make it actually execute faster, or that enabling -mavx2 causes other code to get slower, but I wanted to illustrate that enabling SIMD is not always an automatic go fast button.
I did look at the instructions generated and they seem reasonable:
I suppose it's also worth pointing out that, while string scanning is an important part of JSON parsing, the instruction histogram is generally spread across the entire parser. That is, a doubling in string parsing performance would only help overall parse times by 10-20% maybe.
Edit: Upon thinking about it, it's probably slower because there's a dependency chain through a pile of high-latency vector instructions all the way to the bsr (should be bsf, I fixed on the branch) to the next loop iteration's load. The branch predictor doesn't have much opportunity to help us out here. You could look at Agner Fog's tables and calculate the critical path latency of this loop, but I'm guessing it's not pretty.
I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):
1 shift (4 bits right), unfortunately no byte-granularity shift is available.
2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)
2 PSHUFB operations to look up a table
1 PAND to combine the table.
... and then rejoins your implementation at the VPMOVMSKB.
I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.
Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.
If you're using AVX instructions make sure to insert a vzeroupper after you're done, otherwise you take a big speed hit transitioning between SSE and AVX. The compiler will generate these automatically for AVX instructions it is emitting:
Little known fact about AVX: every time you use an AVX instruction Intel processors reduce the core frequency by a few percents for about a millisecond to avoid overheating (I guess because they need to turn on an otherwise unused functional unit).
As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).
"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."
...
"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."
Several people have raised questions about SIMD. For many workloads, SIMD is obviously a huge win. But in my experience, for problems like this, the gains are marginal and the additional complexity often isn't worth it. e.g. Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?
That said, it never hurts to try, so I implemented the fast string path in AVX2 (see https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3...). It actually benchmarks slower than the scalar code: https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7a...
I couldn't tell you why - it's possible something could be tweaked to make it actually execute faster, or that enabling -mavx2 causes other code to get slower, but I wanted to illustrate that enabling SIMD is not always an automatic go fast button.
I did look at the instructions generated and they seem reasonable:
I suppose it's also worth pointing out that, while string scanning is an important part of JSON parsing, the instruction histogram is generally spread across the entire parser. That is, a doubling in string parsing performance would only help overall parse times by 10-20% maybe.Edit: Upon thinking about it, it's probably slower because there's a dependency chain through a pile of high-latency vector instructions all the way to the bsr (should be bsf, I fixed on the branch) to the next loop iteration's load. The branch predictor doesn't have much opportunity to help us out here. You could look at Agner Fog's tables and calculate the critical path latency of this loop, but I'm guessing it's not pretty.