Writing a Fast JSON Parser

chadaustin · on May 26, 2017

Hi all, author here.

Several people have raised questions about SIMD. For many workloads, SIMD is obviously a huge win. But in my experience, for problems like this, the gains are marginal and the additional complexity often isn't worth it. e.g. Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?

That said, it never hurts to try, so I implemented the fast string path in AVX2 (see https://github.com/chadaustin/sajson/commit/e87da27883ffe0a3...). It actually benchmarks slower than the scalar code: https://gist.github.com/chadaustin/6fe8d250ac8402ad70a724e7a...

I couldn't tell you why - it's possible something could be tweaked to make it actually execute faster, or that enabling -mavx2 causes other code to get slower, but I wanted to illustrate that enabling SIMD is not always an automatic go fast button.

I did look at the instructions generated and they seem reasonable:

    .LBB5_118:
        vmovdqu	ymm5, ymmword ptr [rbx - 32]
        vpcmpgtb	ymm6, ymm5, ymm0
        vpcmpeqb	ymm7, ymm5, ymm1
        vpcmpeqb	ymm8, ymm5, ymm2
        vpcmpeqb	ymm5, ymm5, ymm3
        vpxor	ymm5, ymm5, ymm4
        vpor	ymm6, ymm8, ymm6
        vpor	ymm5, ymm5, ymm6
        vpor	ymm5, ymm5, ymm7
        vpmovmskb	esi, ymm5
        test	esi, esi
        jne	.LBB5_148
        mov	rbp, rax
        sub	rbp, rbx
        mov	rsi, rbx
        lea	rbx, [rbx + 32]
        cmp	rbp, 31
        jg	.LBB5_118
        jmp	.LBB5_120

    .LBB5_148:
        add	rbx, -32
        bsr	eax, esi
        xor	eax, 31
        lea	rsi, [rbx + rax]
        mov	bl, byte ptr [rbx + rax]
    .LBB5_125:
        mov	r10, qword ptr [rsp + 24] # 8-byte Reload
        mov	r9b, byte ptr [rsp + 39] # 1-byte Reload
        sub	rcx, r8
        movzx	eax, bl
        cmp	eax, 34
        jne	.LBB5_127

I suppose it's also worth pointing out that, while string scanning is an important part of JSON parsing, the instruction histogram is generally spread across the entire parser. That is, a doubling in string parsing performance would only help overall parse times by 10-20% maybe.

Edit: Upon thinking about it, it's probably slower because there's a dependency chain through a pile of high-latency vector instructions all the way to the bsr (should be bsf, I fixed on the branch) to the next loop iteration's load. The branch predictor doesn't have much opportunity to help us out here. You could look at Agner Fog's tables and calculate the critical path latency of this loop, but I'm guessing it's not pretty.

glangdale · on May 26, 2017

I think this is marginally more expensive than my suggestion, but I'm not 100% sure, and I think the critical paths are similar. My suggestion requires (after the load):

1 shift (4 bits right), unfortunately no byte-granularity shift is available.

2 PAND operations to ensure that the 'magic high bit' (0x80) is switched off (this bit is magic to PSHUFB)

2 PSHUFB operations to look up a table

1 PAND to combine the table.

... and then rejoins your implementation at the VPMOVMSKB.

I don't think your instructions are high latency unless you are using something surprising - VPOR/VPXOR/VCMP*B are all latency 1 on most recent Intel architecture operations.

Whether these techniques depends on context. I'd be surprised if vector operations do that well against short-to-medium writes. We found that a few of our sequences that are design to 'find the first match' saw no benefit from moving from SSE to AVX2 due to similar effects.

hoytech · on May 26, 2017

> Do you have a build-time option? Check at runtime at every call? Use templates to check at startup and run a different code path?

With glibc there's also the IFUNC option:

https://sourceware.org/glibc/wiki/GNU_IFUNC

haberman · on May 26, 2017

If you're using AVX instructions make sure to insert a vzeroupper after you're done, otherwise you take a big speed hit transitioning between SSE and AVX. The compiler will generate these automatically for AVX instructions it is emitting:

https://godbolt.org/g/jNXJyA

Here's an intrinsic for it:

http://technion.ac.il/doc/intel/compiler_c/main_cls/intref_c...

ot · on May 26, 2017

Little known fact about AVX: every time you use an AVX instruction Intel processors reduce the core frequency by a few percents for about a millisecond to avoid overheating (I guess because they need to turn on an otherwise unused functional unit).

As a result, if you sprinkle some AVX instructions in your code but not enough to cause a significant efficiency win you might actually end up reducing your capacity (and affecting latency).

burntsushi · on May 27, 2017

Can you provide a citation for that or a link to more reading? I'm not asking because I don't believe you, but because I would like to learn more. :-)

ot · on May 27, 2017

Sure, see this doc from Intel: https://computing.llnl.gov/tutorials/linux_clusters/intelAVX...

"Intel AVX instructions require more power to run. When executing these instructions, the processor may run at less than the marked frequency to maintain thermal design power (TDP) limits."

...

"When the processor detects Intel AVX instructions, additional voltage is applied to the core. With the additional voltage applied, the processor could run hotter, requiring the operating frequency to be reduced to maintain operations within the TDP limits. The higher voltage is maintained for 1 millisecond after the last Intel AVX instruction completes, and then the voltage returns to the nominal TDP voltage level."

glangdale · on May 26, 2017

I felt moved to comment on the thread. HN user burntsushi has been very nice by writing up a trick we developed here on the Hyperscan team that should be considerably faster than this system assuming there's enough data to make using SIMD pay off (this would be a lose if the sizes are small enough).

It's possible that a SIMD implementation (assuming that you routinely pass over enough bytes) can blow this away. I'm not sure what the cross-over point is - it depends on whether you are falling out of this code quickly or not. Obviously using a big SIMD sequence to replace a "2 cycles per byte" implementation is pretty stupid if the SIMD sequence is "0.5 cycles a byte" but the fixed cost of using the SIMD means that your costs are >=16 cycles and the typical case is small.

My comment follows:

It's also not clear that you actually need a SIMD instruction - if you've squashed your hotspot to the point that adding the complexity of per-platform SIMD specialization is only a small win, then why bother?

All that being said...

I've been really lazy in writing this up, but someone kindly did a decent write-up for us in the Rust reimplementation of our "Teddy" literal matcher. Please see https://github.com/rust-lang/regex/blob/master/src/simd_acce... if interested or look up the Hyperscan project at https://github.com/01org/hyperscan for a few different versions of this trick.

glangdale · on May 26, 2017

The trick: very briefly:

Use two 16-byte values in SIMD registers as 16-entry lookup tables with "8 buckets" (really just bits) each.

Lookup those two separate tables with the high nibble (bits 4..7) and low nibble (bits 0..3) of your input (you can do this in parallel either 16/32/64 ways depending on whether you are using SSSE3/AVX2/AVX512).

AND together the table results (or OR them, depending on the sense of the table).

PMOVMSKB (or VPMOVMSKB) to extract the results back into GPR-land.

So if you want to match, say 0x12 and 0xab at once, assign 0x12 to 'bucket #0' and 0xab to 'bucket #1'.

High nibble table gives you 0x1 -> 0x1 (that is, 0x10 in the high nibble gives you bucket #0) and 0xa -> 0x2

Low nibble table gives you 0x2 -> 0x1 and 0xb -> 0x2.

Tables are otherwise zero.

You can now look up lots of different useful character classes (not all! That takes three shuffle tables) in about 6 instructions. Sadly the instructions are pretty much a long linear chain of dependencies so that latency is a bit worse than throughput, but we've built some fun stuff around them.

ot · on May 26, 2017

You don't even have to go into SIMD and lookup tables: on 64-bit archs you can process 8 bytes at a time using a few bit tricks. Lookup tables work great in microbenchmarks because the tables are always in cache, but not so great if there's cache pressure from other code.

See this implementation of JSON escaping for example: https://github.com/facebook/folly/commit/2f0cabfb48b8a8df84f...

De-escaping is even simpler because you only have to detect two characters, \ and ".

glangdale · on May 26, 2017

The bit tricks are OK. We used them extensively back when we supported platforms that lacked SSSE3, but they are way slower than the SIMD techniques because of all the bit-bashing and far less flexible.

The technique I outline can match most character classes that you run into (and there's a 3-PSHUFB variant that can handle all predicates on bytes), while the bit-twiddling version will get linearly slower as you add more cases. If there's a way to get at what we did with general purpose operations and bit twiddling hacks I'm too stupid to think of it.

ot · on May 26, 2017

Well of course, they're different trade-offs. Here we don't care about flexibility, we're talking about a specific syntax (JSON).

Also SSE will be beneficial only if your strings are long enough. I'd be curious to benchmark my implementation against an SSE one on a representative dataset.

EDIT: the microbenchmark indicates that the bit-trick implementation spends about 1 cycle per byte (for the entire escaping routine, if there's nothing to escape).

glangdale · on May 26, 2017

1 cycle per byte is pretty good, and its properties might be very nice for small strings.

Assuming you can prep everything in advance, the SIMD version is 3 loads and about 5 SIMD operations (1 shift, 2 PANDs, 2 PSHUFBs) and a couple transfer/scan operations (PMOVMSKB and whatever bit scan operation you want) to eat 32 bytes. I'm figuring 10 operations. Depending on your workload you might care about reciprocal throughput or latency. Generally our work has been more throughput oriented so those 10 operations can be scheduled along with lots of others, but this isn't always the case.

The other way SIMD (or for that matter bit-twiddling) can be helpful is that you can cover a wider variety of sizes in a branch-free fashion, so if your sizes are distributed <32 bytes (or <8 for bit-twiddling) the code can be straight line with no branch misses. So a SIMD version of the code might be "slower" than a bit-twiddling version even if both sides always get 9 bytes, but if it's a mix of sizes and thus unpredictable, it might take fewer branch misses (unless some clown starts giving you 33 byte inputs half the time, of course).

Again, benchmark data is always a good cure for folkloric beliefs whether pro-SIMD ("SIMD is always fast, yay!") or anti-SIMD ("the overheads are never worth it").

glangdale · on May 26, 2017

Also see the definition of 'block' in https://github.com/01org/hyperscan/blob/master/src/nfa/shuft...

jokoon · on May 26, 2017

People will keep laughing at me when I tell them to just produce a binary, pre-parsed format for their data.

Parsing will always be slow no matter what you do, and you can hardly parallelize it. Of course it's fast on a non-mobile CPU, but it will always be slow on a smartphone, hence the need for such a format and it's the reason smartphones use apps instead of HTML+JS.

It's funny because computers are very fast, but batteries are not so good, so are we realizing that well designed software formats have an effect on battery life?

I've seen several articles about how webpages are bloated, but I don't see any real proposition on how to fix it.

dom0 · on May 26, 2017

The fastest file format is always the one you can just dump into memory and fix a few pointers. But these tend to be also the hardest formats to change/evolve...

Densely coded formats, such as msgpack, are neat for archival / long term storage, but their coding means that the unpackers are quite complex and not very fast (perhaps drawing some lessons from parallel-length x86 decoding could help implementations?); they tend to get slower as the data structure becomes more complex. Since the output format of these are cumbersome, applications will copy everything yet another time. In case we're not using C/C++/Rust, these will also mean a lot of allocator activity. Plus, data validation is separate from format validation, which means at least another pass over the structure.

So purely from a performance perspective these formats are not that good (despite all their claims). Designing purpose-built formats is obviously superior, but also takes quite a bit of time, and means more parsers to test (and fuzz). Stuff like Capnproto might be an adequate middle ground, I'm not sure, never worked with it / I don't know much about how they realized it (I assume custom accessor code that mostly does nothing on x86/ARM).

nly · on May 26, 2017

> The fastest file format is always the one you can just dump into memory and fix a few pointers. But these tend to be also the hardest formats to change/evolve...

Flatbuffers is kind of like this. With untrusted data you just have to run the validation routine (that the code generator spits out) to check that all the internal offsets are closed within your buffer.

https://google.github.io/flatbuffers/

falcolas · on May 26, 2017

> I've seen several articles about how webpages are bloated, but I don't see any real proposition on how to fix it.

Amp - Google = Fast webpages anyone can host.

Webpages are slow because of either fetch time (which can be addressed by not sending more data than needed) or processing time (which can be addressed by minimizing your time spent in Javascript).

Pure HTML/CSS pages are always going to be fast. Just look at HN for an example. It's when you are asking users to download huge images/movies, or when Javascript is used to do all of the DOM creation and fetching of assets, that things start to slow down.

vanni · on May 26, 2017

https://capnproto.org/

Tempest1981 · on May 27, 2017

I was about to suggest Protobuf, but it looks like it has evolved into this, and/or FlatBuffers.

https://developers.google.com/protocol-buffers

Cieplak · on May 26, 2017

Looks like he's fuzzed it, too:

https://github.com/chadaustin/sajson/tree/master/sajson-fuzz

chadaustin · on May 26, 2017

My first exposure to American Fuzzy Lop! What an amazingly easy tool to get running.

nly · on May 26, 2017

I've found if you're in the situation where JSON parsing performance is becoming a bottleneck then data/schema validation bottlenecks are probably not far behind.

I have several hot paths in server projects where parsing IP addresses, dates, and other such data types that JSON doesn't support, eat cycles.

Or maybe nobody else bothers to validate their data.

herf · on May 26, 2017

Nobody has mentioned PCMPISTRI (and the other SSE4.2 string extensions), but they deserve a benchmark here. Some of them appear in my regular glibc now, and they're hard to beat.

I found this Intel article for XML (but not JSON): https://software.intel.com/en-us/articles/xml-parsing-accele...

glangdale · on May 26, 2017

Intel employee here, working in string and regex processing. SSE4.2 has its uses, but it is not always the fastest thing you can use. We don't use it anywhere in Hyperscan. SSE4.2 instructions are not particularly hard to beat with other sequences, and it's worth noting that these instructions have not been promoted to AVX2, much less AVX 512.

chadaustin · on May 26, 2017

PCMPISTRI and friends are great, but they're not very friendly to the problem described in the article, since they can only check membership in a 16-element set.

burntsushi · on May 26, 2017

Eh? PCMPISTRI has a few different modes of operation, including full substring search and character classes. e.g., You can use PCMPISTRI on a needle that contains adjacent classes. For example, `azAZ09` would check if any byte in the search string is in any of the ranges a-z, A-Z and 0-9.

Regardless, in the OP, they're specifically looking for one of a small number of bytes, which is exactly what PCMPISTRI is supposed to be good for.

With that said, my experience mirrors glangdale's. Every time I've tried to use PCMPISTRI, it's either been slower than other methods or not enough of an improvement to justify it.

chadaustin · on May 26, 2017

Oh, didn't know that! Thanks.

brianolson · on May 26, 2017

This is a cool story of micro-optimization. But now I wonder about comparing to other serializations entirely (protobuf, cap'n'proto, cbor) when not constrained to JSON.

lefty2 · on May 26, 2017

If you really want to it to be faster get rid of the DOM. Most projects that use a parser just transfer the data from the DOM into project specific data structures. You save memory and performance if to transfer directly the data into project data structure and bypass the DOM.

Aissen · on May 26, 2017

Very interesting. I was looking for a comparison between ujson and sajson/rapidjson and I stumbled on this benchmark:

https://github.com/miloyip/nativejson-benchmark

chadaustin · on May 26, 2017

Milo Yip (author of RapidJSON) has done fantastic work on validating the correctness of and benchmarking JSON parsers. The results listed there predate the current round of sajson optimizations and sajson's dynamic allocation mode, so it would be great to see them rerun.

duozerk · on May 26, 2017

For json parsing, I really like JSMN (https://github.com/zserge/jsmn/). It doesn't allocate memory, has a simple and easy to use API, and is pretty fast.

m_eiman · on May 26, 2017

I started out using JSMN for parsing JSON in our embedded devices at work, but the requirement to pre-allocate all the tokens turned out to be a problem - I was running out of memory when the configuration files got too large…

So I wrote a new parser that trades lower RAM usage for higher CPU usage: https://bitbucket.org/tlabwest/json-reader/overview

Sample usage:

  json_token_t json;
  json_validate(json_string, &json);
  
  json_token_t value;
  json_find(&json, "key1/subkey:2/value1", &value);
  
  char s[bufsize];
  json_decode_string(&value, s, bufsize);
  
  printf("%s\n", s);
  
  json_free(&json);

Asmod4n · on May 26, 2017

Its sadly not strict enough what it accepts, got a buffer overflow quite easily. So you probably shouldn't use it for user supplied data.

ythn · on May 26, 2017

In terms of pure syntactic sugar/ease of use, it's hard to beat nlohmann/json: https://github.com/nlohmann/json

It may not be as fast as these other guys but it's sure as heck a lot more intuitive and convenient. For casual JSON usage cases, I've been recommending this library.

treeform · on May 26, 2017

Where did he get that assembler output from? It uses "jump-if-less" and "load" instead of "jl" and "mov"? Is that a swift/llvm syntax?

ma2rten · on May 26, 2017

I think it's pseudo code.

Cyph0n · on May 26, 2017

Excellent write up. I have actually never thought about how having pointers as locals instead of e.g. a class member is an optimization. Also, that goto-based state machine is wonderful: I'll probably be stealing that for my next project. I'm curious why you went with that instead of the more common switch statement approach though.

yawniek · on May 26, 2017

we recently wrote our own json parser for etl style operations on huge json streams. the idea was that you are still able to define your structs/classes and annotated members are auto-parsed at full speed. some sse4 optimizations where used and performance should be on par or faster than rapidJSON (could not benchmark yet).

http://code.dlang.org/packages/asdf

as a simple tool you could try je, which is similar to a simple version of jq but to extract json objects as a csv/tsv table. its about 10x faster than jq. https://github.com/tamediadigital/je

js8 · on May 26, 2017

I find it mildly amusing that people are competing on the fastest decoder of horribly inefficient format. Don't get me wrong, I like e.g. ZX Spectrum demos, but the practical utility of such endeavor is questionable.

But that aside, I wonder how this compares to the compressed indexes approaches (see https://www.youtube.com/watch?v=uA0Z7_4J7u8&t=2850s and http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf), obviously, once the index is built.

dzdt · on May 26, 2017

The practical utility is the reason for such work! JSON is an interchange format. When you work on a project that interfaces with others, the interchange format is likely outside of your control. So you have to do the best you can with that format.

js8 · on May 26, 2017

Yeah, I know, but it was more of a lament. Although - if somebody is producing so much JSON for you that you need to care about the speed of parsing it, wouldn't it make sense for both of you to agree on a more efficient format? I mean they would probably save money too on generating it..

Klathmon · on May 26, 2017

Unless you are at either a very large scale, or a really small "embedded" scale, I think the time spent engineering a different interchange format would dwarf the savings from it, especially when you consider that the "bottleneck" of development is almost always the developers, and pulling them off to re-implement another format is time that they don't get to spend on the actual product.

krallja · on May 26, 2017

There already exist formats other than JSON for data exchange. This wouldn't require reengineering; just integration with an existing library that meets your licensing needs.

Klathmon · on May 26, 2017

I'd just like to point out that it's still engineering to re-implement a library into your code (it's a bit of a sore spot with me... I've been told to "just use a library it shouldn't take any time at all" in the past, and apparently I'm still not over it!)

But regardless, it's still a non-zero amount of work, and in my experience choosing a mutual format isn't always a simple meeting...

Not to mention that if JSON is still needed elsewhere in your system (either internally, or externally with others) all you've done is add complexity to your application and spend developer time all for some savings on parsing speed.

js8 · on May 27, 2017

But you already accepted the cost of this by playing this game - this particular library is in C, which might be an effort to integrate with a different programming language.

On the other hand, there are CBOR or MsgPack libraries that are fast and native for many languages.

some_guy2234 · on May 27, 2017

This seems like a use-case where the Mill CPU architecture would work quite well.

With the Mill's Loop pipelining and Mill's Vector ops this could be pretty well-optimized.

Veedrac · on May 29, 2017

Pretty sure you could do a single-instruction vectorized version of the C code on Gold with 2c latency, yeah.

kevindqc · on May 26, 2017

What is a critical execution path when talking about CPU? Why is only the 'increment p'?