Hacker News new | past | comments | ask | show | jobs | submit login

Hi all, author here.

Several people have raised questions about SIMD. For many workloads, SIMD is obviously a huge win. But in my experience, for problems like this, the gains are marginal and the additional complexity often isn't worth it. e.g. Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?

That said, it never hurts to try, so I implemented the fast string path in AVX2 (see https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3...). It actually benchmarks slower than the scalar code: https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7a...

I couldn't tell you why - it's possible something could be tweaked to make it actually execute faster, or that enabling -mavx2 causes other code to get slower, but I wanted to illustrate that enabling SIMD is not always an automatic go fast button.

I did look at the instructions generated and they seem reasonable:

    .LBB5_118:
        vmovdqu	ymm5, ymmword ptr [rbx - 32]
        vpcmpgtb	ymm6, ymm5, ymm0
        vpcmpeqb	ymm7, ymm5, ymm1
        vpcmpeqb	ymm8, ymm5, ymm2
        vpcmpeqb	ymm5, ymm5, ymm3
        vpxor	ymm5, ymm5, ymm4
        vpor	ymm6, ymm8, ymm6
        vpor	ymm5, ymm5, ymm6
        vpor	ymm5, ymm5, ymm7
        vpmovmskb	esi, ymm5
        test	esi, esi
        jne	.LBB5_148
        mov	rbp, rax
        sub	rbp, rbx
        mov	rsi, rbx
        lea	rbx, [rbx + 32]
        cmp	rbp, 31
        jg	.LBB5_118
        jmp	.LBB5_120

    .LBB5_148:
        add	rbx, -32
        bsr	eax, esi
        xor	eax, 31
        lea	rsi, [rbx + rax]
        mov	bl, byte ptr [rbx + rax]
    .LBB5_125:
        mov	r10, qword ptr [rsp + 24] # 8-byte Reload
        mov	r9b, byte ptr [rsp + 39] # 1-byte Reload
        sub	rcx, r8
        movzx	eax, bl
        cmp	eax, 34
        jne	.LBB5_127
I suppose it's also worth pointing out that, while string scanning is an important part of JSON parsing, the instruction histogram is generally spread across the entire parser. That is, a doubling in string parsing performance would only help overall parse times by 10-20% maybe.

Edit: Upon thinking about it, it's probably slower because there's a dependency chain through a pile of high-latency vector instructions all the way to the bsr (should be bsf, I fixed on the branch) to the next loop iteration's load. The branch predictor doesn't have much opportunity to help us out here. You could look at Agner Fog's tables and calculate the critical path latency of this loop, but I'm guessing it's not pretty.




I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):

1 shift (4 bits right), unfortunately no byte-granularity shift is available.

2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)

2 PSHUFB operations to look up a table

1 PAND to combine the table.

... and then rejoins your implementation at the VPMOVMSKB.

I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.

Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.


> Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?

With glibc there's also the IFUNC option:

https://sourceware.org/glibc/wiki/GNU_IFUNC


If you're using AVX instructions make sure to insert a vzeroupper after you're done, otherwise you take a big speed hit transitioning between SSE and AVX. The compiler will generate these automatically for AVX instructions it is emitting:

https://godbolt.org/g/jNXJyA

Here's an intrinsic for it:

http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_c...


Little known fact about AVX: every time you use an AVX instruction Intel processors reduce the core frequency by a few percents for about a millisecond to avoid overheating (I guess because they need to turn on an otherwise unused functional unit).

As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).


Can you provide a citation for that or a link to more reading? I'm not asking because I don't believe you, but because I would like to learn more. :-)


Sure, see this doc from Intel: https://computing.llnl.gov/tutorials/linux_clusters/intelAVX...

"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."

...

"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: