Hacker News new | past | comments | ask | show | jobs | submit login
Building a high performance JSON parser (cheney.net)
532 points by davecheney on Nov 5, 2023 | hide | past | favorite | 189 comments



Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)


What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!


Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.

Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.


The large documents are often fixed by using mmap/virtualalloc of the file, but Boost.JSON has a streaming mode and is reasonably fast and the license is good for pulling into anything. It's not the fastest, but faster than rapid with the interface of nlohmann JSON. For most tasks, it does seem that most of hte libraries taking a JSON document approach are wasting a lot of time/memory to get to the point that we want normal data structures, not a JSON document tree. If we pull that out and parse straight to the data structures there is a lot of win in performance and memory with less/no code, just mappings. That's how I approached it at least.


> that most of hte libraries taking a JSON document approach are wasting a lot of time/memory

I agree. That's the same situation as with XML/HTML. In many cases you don't really need to build a DOM or JSOM in memory. If your task is about deserializing some native structures.

This XML scanner of mine does not allocate any memory at all while parsing HTML/XML: https://www.codeproject.com/Articles/14076/Fast-and-Compact-...

It is even simpler than SAX parser.


For the interesting JSON of a significant size, an interator/range interface that parses to concrete types works really well. Usually they are large arrays or JSONL like things


If you have files that are large enough that json is a problem, why use json in the first place? Why not use a binary format that will be more compact and easier to memory map?


Chances are they can’t control that; they’re perhaps provided by a vendor.


I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".


I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)

I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).

There are probably lots and lots of reasons to write your own parser.


This reminds me of my favourite quote about standards.

>The wonderful thing about standards is that there are so many of them to choose from.

And, keeping with the theme, this quote may be from Grace Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.


Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.


There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either


What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

How big is 'large' here?

I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.

I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?


> What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

I've been in the industry for a while. I've probably left more than one client site muttering "I've seen some things ...".

If it can be done, it will be done. And often in a way that shouldn't have even been considered at all.

Many times, "it works" is all that is needed. Not exactly the pinnacle of software design. But hey, it does indeed "work"!


Insurance price transparency can have 16gb of compressed JSON that represents a single object.

Here is the anthem page. The toc link is 16gb

https://www.anthem.com/machine-readable-file/search/

They are complying with the mandate. But not optimizing for the parsers


I've seen people dump and share entire databases in JSON format at my job....


I've seen tens of millions of market data events from a single day of trading encoded in JSON and used in various post-trade pipelines.


Ah, that's a dataset with a size certainly intimidating, and in an environment where performance means money. Thanks for pointing that out!


In my case, sentry events that represent crash logs for Adobe Digital Video applications. I’m trying to remember off the top of my head, but I think it was in the gigabytes for a single event.


Chrome trace format files also use JSON and can also become large and are a pain to work with.


Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.


Someone misunderstood the JSONParserFactory somewhere along the line.


The walkthrough is very nice, how to do this if you're going to do it.

If you're going for pure performance in a production environment you might take a look at Daniel Lemire's work: https://github.com/simdjson/simdjson. Or the MinIO port of it to Go: https://github.com/minio/simdjson-go.


If your JSON always looks the same you can also do better than general JSON parsers.


Andreas Fredriksson demonstrates exactly that in this video: https://vimeo.com/644068002


I really enjoyed this video even though he lost me with the SIMD code.


I like this video because there's a lot of a good actionable advice before he gets into SIMD code.


Very enjoyable video!


You might also move to something other than JSON if parsing it is a significant part of your workload.


Most of the times I’ve had to deal with JSON performance issues, it involved a 3rd party API and JSON was the only option.

If you’re building something net-new and know you’ll have these problems out the gate, something other than JSON might be feasible, but the moment some other system not in the closed loop needs to work with the data, you’re back to JSON and any associated perf issues.


I wonder: can fast, special-case JSON parsers be dynamically autogenerated from JSON Schemas?

Perhaps some macro-ridden Rust monstrosity that spits out specialised parsers at compile time, dynamically…


For json schema specifically there are some tools like go-jsonschema[1] but I've never used them personally. But you can use something like ffjson[2] in go to generate a static serialize/deserialize function based on a struct definition.

[1] https://github.com/omissis/go-jsonschema [2] https://github.com/pquerna/ffjson


Hey, go-jsonschema is my project. (Someone else just took over maintaining it, though.) It still relies on the standard Go parser; all it does it generate structs with the right types and tags.


A fundamental problem with JSON parsing is that it has variable length fields that don't encode their length, in a streaming scenario you basically need to keep resizing your buffer until the data fits. If the data is on disk and not streaming you may get away with reading ahead to find the end of the field first, but that's also not particularly fast.

Schemas can't fix that.


Why couldn't they? Schemas can allow you to have that as part of your schema. E.g. JSON Schema lets you define max and min lengths on variable-sized things. You can avoid all dynamic resizing if you're careful enough.

I'll definitely agree that most things won't fully take advantage of that even if you provide that information, but it is definitely possible to do so.


Unless you have fixed field lengths, you're still doing twice the work either scanning or resizing the buffer (or over-allocating memory I guess).

That said, JSON is designed for human readability above performance, so it's a design concession that makes sense. What doesn't make sense is using JSON anywhere performance matters.


Only if you are using pointers/slices into the buffer as an optimisation.

Otherwise there is no need to keep a buffer of anything after it has been parsed.


I'm talking about during parsing.

Let's assume I send you a JSON object that is one very long string and nothing else. It's e.g. 1 GB in size. To know you need to allocate a 1GB buffer, you need to first scan it, and then copy it; or keep reallocating the same buffer until it fits.

It's an absurd case, but shorter strings face similar overhead.


Doesn't the serde crate's json support do precisely this? It generates structs that have optional in all the right places and with all the right types anyway. Seems like the llvm optimiser can probably do something useful with that even if the serde feature isn't using apriori knowledge out of the schema.


Somewhat tangentially related, Fabian Iwand posted this regex prefix tree visualiser/generator last week [0], which may offer some inspiration for prototyping auto generated schemas.



You forgot to include the link?


It's relatively common in D application to use the compile time capabilities to generator a parser at compile time


Pydantic does that to some extend I think.


Last time I compared the performance of various json parsers the simd one turned out to be disappointingly slow.


The fastest json lib in Go is the one done by the company behind Tiktok.



Fastest at what?


> For all sizes of json and all scenarios of usage, Sonic performs best.

The repository has benchmarks


I’m not seeing simdjson in them though? I must be missing something because the Go port of it is explicitly mentioned in the motivation[1] (not the real thing, though).

[1] https://github.com/bytedance/sonic/blob/main/docs/INTRODUCTI...


Excellent treat vector.


Excellent treat vector.


simdjson has not been the fastest for a long long time


What is faster? According to https://github.com/kostya/benchmarks#json nothing is.


My own lessons from writing fast json parsers has a lot of language-type things but here are some generalisations:

Avoid heap allocations in tokenising. Have a tokeniser that is a function that returns a stack-allocated struct or an int64 token that is a packed field describing the start, length and type offsets etc of the token.

Avoid heap allocations in parsing: support a getString(key String) type interface for clients that what to chop up a buffer.

For deserialising to object where you know the fields at compile time, generally generate a switch case of key length before comparing string values.

My experience in data pipelines that process lots of json is that choice of json library can be a 3-10x performance difference and that all the main parsers want to allocate objects.

If the classes you are serialising or deserialising is known at compile time then Jackson Java does a good job but you can get a 2x boost with careful coding and profiling.

Whereas if you are paying aribrary json then all the mainstream parsers want to do lots of allocations that a more intrusive parser that you write yourself can avoid, and that you can make massive performance wins if you are processing thousands or millions of objects per second.


I've taken a very similar approach and built a GraphQL tokenizer and parser (amongst many other things) that's also zero memory allocations and quite fast. In case you'd like to check out the code: https://github.com/wundergraph/graphql-go-tools


You might also want to check out this abomination of mine: https://github.com/graph-guard/gqlscan

I've held a talk about this, unfortunately wasn't recorded. I've tried to squeeze as much out of Go as I could and I've went crazy doing that :D


It's a bit verbose.


How big of an issue is this for GQL servers where all queries are known ahead of time (allowlist) - i.e. you can cache/memorize the ast parsing and this is only a perf issue for a few minutes after the container starts up

Or does this bite us in other ways too?


I build GraphQL API gateways / Routers for 5+ years now. It would be nice if trusted Documents or persisted operations were the default, but the reality is that a lot of people want to open up their GraphQL to the public. For that reason we've build a fast parser, validator, normalizer and many other things to support these use cases.


In n2[1] I needed a fast tokenizer and had the same "garbage factory" problem, which is basically that there's a set of constant tokens (like json.Delim in this post) and then strings which cause allocations.

I came up with what I think is a kind of neat solution, which is that the tokenizer is generic over some T and takes a function from byteslice to T and uses T in place of the strings. This way, when the caller has some more efficient representation available (like one that allocates less) it can provide one, but I can still unit test the tokenizer with the identity function for convenience.

In a sense this is like fusing the tokenizer with the parser at build time, but the generic allows layering the tokenizer such that it doesn't know about the parser's representation.

[1] https://github.com/evmar/n2


It's possible to improve over the standard library with better API design, but it's not really possible to do a fully streaming parser that doesn't half fill structures before finding an error and bailing out in the middle, which is another explicit design constraint for the standard library.


Maybe I overlooked something, but the author keeps repeating that they wrote a "streaming" parser, but never explained what that actually means. In particular, they never explained how did they deal with repeating keys in "hash tables". What does their parser do? Calls the "sink" code twice with the repeated key? Waits until the entire "hash table" is red and then calls the "sink" code?

In my mind, JSON is inherently inadequate for streaming because of hierarchical structure, no length know upfront and most importantly, repeating keys. You could probably make a subset of JSON more streaming-friendly, but at this point, why bother? I mean, if the solution is to modify JSON, then a better solution would be something that's not JSON at all.


Great to see a shout out to Phil Pearl! Also worth looking at https://github.com/bytedance/sonic


I'm surprised there's no way to say 'I really mean it, inline this function' for the stuff that didn't inline because it was too big.

The baseline whitespace count/search operation seems like it would be MUCH faster if you vectorized it with SIMD, but I can understand that being out of scope for the author.


Of course you can force-inline.


Obviously you can manually inline functions. That's what happened in the article.

The comment is about having a directive or annotation to make the compiler inline the function for you, which Go does not have. IMO, the pre-inline code was cleaner to me. It's a shame that the compiler could not optimize it.

There was once a proposal for this, but it's really against Go's design as a language.

https://github.com/golang/go/issues/21536


You can in any systems programming language.

Go is mostly a toy language for cloud people.


> toy language

You may be surprised to hear that Go is used in a ton of large scale critical systems.


I don't consider cloud technology a critical system.


"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications


Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.


Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.


On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.


Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.


If performance is important why would you keep large datasets in JSON format?


Because you work at or for some bureaucratic MegaCorp, that does weird things with no real logic behind it other than clueless Dilbert managers making decisions based on LinkedIn blogs. Alternatively desperate IT consultants trying to get something to work with too low of a budget and/or no access to do things the right way.

Be glad you have JSON to parse, and not EDI, some custom deliminated data format (with no or old documentation) - or shudders you work in the airline industry with SABRE.


sometimes it's not your data


Usually because the downstream service or store needs it


If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.


Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).


> If so, how do you deal with repeated keys in "hash tables"?

depending on the parser, behaviour might differ. But looking at https://stackoverflow.com/questions/21832701/does-json-synta... , it seems like the "best" option is to have 'last key wins' as the resolution.

This works fine under a SAX like interface in a streaming JSON parser - your 'event handler' code will execute for a given key, and a 2nd time for the duplicate.


> This works fine

This is a very strange way of using the word "fine"... What if the value that lives in the key triggers some functionality in the application that should never happen due to the semantics you just botched by executing it?

Example:

    {
      "commands": {
        "bumblebee": "rm -rf /usr",
        "bumblebee": "echo 'I have done nothing wrong!'"
      }
    }
With the obvious way to interpret this...

So, you are saying that it's "fine" for an application to execute the first followed by second, even though the semantics of the above are that only the second one is the one that should have an effect?

Sorry, I have to disagree with your "works fine" assessment.


you're layering the application semantics into the transport format.

It's fine, in the sense that a JSON with duplicate keys is already invalid - but the parser might handle it, and i suggested a way (just from reading the stackoverflow answer).

It's the same "fine" that you get from undefined C compiler behaviour.


Why do you keep inventing stuff... No, JSON with duplicate keys is not invalid. The whole point of streaming is to be able to process data before it completely arrived. What "layering semantics" are you talking about?

This has no similarity with undefined behavior. This is documented and defined.


A JSON object with duplicate keys is explicitly defined by the spec as undefined behavior, and is left up to the individual implementation to decide what to do. It's neither valid nor invalid.


last key wins is terrible advice and has serious security implications.

see https://bishopfox.com/blog/json-interoperability-vulnerabili... or https://www.cvedetails.com/cve/CVE-2017-12635/ for concrete examples where this treatment causes security issues.

the https://datatracker.ietf.org/doc/html/rfc7493 defines a more strict format where duplicate keys are not allowed.


Last key wins is the most common behavior among widely-used implementations. It should be assumed as the default.


I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.


RAM is always expensive no matter where you get it from.

Would you rather do two hours of work or force thousands of people to buy more RAM because your library is a memory hog?

And on embedded systems RAM is a premium. More RAM = most cost.


If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)


In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?


Is the HTTP request body part of the input?



You might want to take a look at https://github.com/ohler55/ojg. It takes a different approach with a single pass parser. There are some performance benchmarks included on the README.md landing page.


> Any (useful) JSON decoder code cannot go faster that this.

That line feels like a troll. Cunningham’s Law in action.

You can definitely go faster than 2 Gb/sec. In a word, SIMD.


we could re-frame by distinguishing problem statements from implementations

Problem A: read a stream of bytes, parse it as JSON

Problem B: read a stream of bytes, count how many bytes match a JSON whitespace character

Problem B should require fewer resources* to solve than problem A. So in that sense problem B is a relaxation of problem A, and a highly efficient implementation of problem B should be able to process bytes much more efficiently than an "optimal" implementation of problem A.

So in this sense, we can probably all agree with the author that counting whitespace bytes is an easier problem than the full parsing problem.

We're agreed that the author's implementation (half a page of go code that fits on a talk slide) to solve problem B isn't the most efficient way to solve problem B.

I remember reading somewhere the advice that to set a really solid target for benchmarking, you should avoid measuring the performance of implementations and instead try to estimate a theoretical upper bound on performance, based on say a simplified model of how the hardware works and a simplification of the problem -- that hopefully still captures the essence of what the bottleneck is. Then you can compare any implementation to that (unreachable) theoretical upper bound, to get more of an idea of how much performance is still left on the table.

* for reasonably boring choices of target platform, e.g. amd64 + ram, not some hypothetical hardware platform with surprisingly fast dedicated support for JSON parsing and bad support for anything else.


Everything you said is totally reasonable. I'm a big fan of napkin math and theoretical upper bounds on performance.

simdjson (https://github.com/simdjson/simdjson) claims to fully parse JSON on the order of 3 GB/sec. Which is faster than OP's Go whitespace parsing! These tests are running on different hardware so it's not apples-to-apples.

The phrase "cannot go faster than this" is just begging for a "well ackshully". Which I hate to do. But the fact that there is an existence proof of Problem A running faster in C++ SIMD than OP's Probably B scalar Go is quite interesting and worth calling out imho. But I admit it doesn't change the rest of the post.


"But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto."

From 1989:

https://raw.githubusercontent.com/spitbol/x32/master/docs/sp...

"Indirection in the Goto field is a more powerful version of the computed Goto which appears in some languages. It allows a program to quickly perform a multi-way control branch based on an item of data."


I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

I don't know how "true" that comment is but I thought I should try to write a parser myself to get a feel :D

So I wrote one, in Python - https://arunmani.in/articles/silly-json-parser/

It was a delightful experience though, writing and testing to break your own code with different variety of inputs. :)


> I remember reading a SO question which asks for a C library to parse JSON. A comment was like - C developers won't use a library for JSON, they will write one themselves.

> I don't know how "true" that comment is

Either way it's a good way to get a pair of quadratic loops in your program: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...


I wrote a small JSON parser in C myself which I called jsoncut. It just cuts out a certain part of a json file. I deal with large JSON files, but want only to extract and parse certain parts of it. All libraries I tried parse everything, use a lot of RAM and are slow.

Link here, if interested to have a look: https://github.com/rgex/jsoncut


The words you’re looking for are SAX-like JSON parser or streaming json parser. I don’t know if there’s any command line tools like the one you wrote that use it though to provide a jq-like interface.


I tried JQ and other command line tools, all were extremely slow and seemed to always parse the entire file.

My parser just reads the file byte by byte until it finds the target, then outputs the content. When that's done it stops reading the file, meaning that it can be extremely fast when the targeted information is at the beginning of the JSON file.


You're still describing a SAX parser (i.e. streaming). jq doesn't use a SAX parser because it's a multi-pass document editor at its core, hence why I said "jq-like" in terms of supporting a similar syntax for single-pass queries. If you used RapidJSON's SAX parser in the body of your custom code (returning false once you found what you're looking for), I'm pretty sure it would significantly outperform your custom hand-rolled code. Of course, your custom code is very small with no external dependencies and presumably fast enough, so tradeoffs.


I guess there are only so many ways to write a JSON parser b cause one I wrote on a train in Python looks very similar!

I thought it would be nice and simple but it really was still simpler than I expected. It's a fantastic spec if you need to throw one together yourself, without massive performance considerations.


Good for you but what does this have to do with the article?


I remember this JSON benchmark page from RapidJSON [1].

[1] https://rapidjson.org/md_doc_performance.html


I've recently held a talk (https://youtu.be/a7VBbbcmxyQ?si=0fGVxfc4qmKMVCXk) about github.com/romshark/jscan that I've been working on. It's a performance-oriented JSON iterator / tokenizer you might want to take a look at if interested in high performance zero allocation JSON parsing in Go.


This is fantastically useful.

Funny enough I stumbled upon your article just yesterday through google search.


This is a very poor and overly simplified text to write basic JSON parsers, not touching any topic of writing actually fast JSON parsers. Such as not-copying tokenizers (e.g. jsmn), word-wise tokenizers (simdjson) and fast numeric conversions (fast_double_parser at al).


A person who helped me out a lot when I was learning to code wrote his own .NET JSON library because the MS provided one had a rough API and was quite slow.

His lib became the defacto JSON lib in .NET dev and naturally, MS head-hunted him.

Fast JSON is that important these days.


Writing a json parser is definitely an educational experience. I wrote one this summer for my own purposes that is decently fast: https://github.com/nwpierce/jsb


Can someone explain to me why JSON can't have comments or trailing commas? I really hope the performance gains are worth it, because I've lost 100s of man-hours to those things, and had to resort to stuff like this in package.json:

  "IMPORTANT: do not run the scripts below this line, they are for CICD only": true,


It can't have comments because it didn't originally had comments, so now it's too late. And it originally didn't have comments, because Douglas Cockford thought they could be abused for parsing instructions.

As for not having trailing commas, it's probably a less intentional bad design choice.

That said, if you want commas and comments, and control the parsers that will be used for your JSON, then use JSONC (JSON with comments). VSCode for example does that for its JSON configuration.


JSONC also supports trailing commas. It is, in effect, "JSON with no downsides".

TOML/Yaml always drive me batty with all their obscure special syntax. Whereas it's almost impossible to look at a formatted blob of JSON and not have a very solid understanding of what it represents.

The one thing I might add is multiline strings with `'s, but even that is probably more trouble than it's worth, as you immediately start going down the path of "well let's also have syntax to strip the indentation from those strings, maybe we should add new syntax to support raw strings, ..."


Amusingly, it originally did have comments. Removing comments was the one change Crockford ever made to the spec[1].

[1] https://web.archive.org/web/20150105080225/https://plus.goog... (thank you Internet Archive for making Google’s social network somewhat accessible and less than useless)


Does JSONC have a specification or formal definition? People have suggested[1] using JSON5[2] instead for that reason

[1] https://github.com/microsoft/vscode/issues/100688

[2] https://spec.json5.org/


Unfortunately, JSON5 says keys can be ES5 IdentifierName[1]s, which means you must carry around Unicode tables. This makes it a non-option for small devices, for example. (I mean, not really, you technically could fit the necessary data and code in low single-digit kilobytes, but it feels stupid that you have to. Or you could just not do that but then it’s no longer JSON5 and what was the point of having a spec again?)

[1] https://es5.github.io/x7.html#x7.6


Or take the SQLite route, if you don't need to reject strictly invalid JSON5. When they extended SQLite's JSON parser to support JSON5, they specifically relaxed the definition of unquoted keys (in a compatible way) to avoid Unicode tables[1]:

> Strict JSON5 requires that unquoted object keys must be ECMAScript 5.1 IdentifierNames. But large unicode tables and lots of code is required in order to determine whether or not a key is an ECMAScript 5.1 IdentifierName. For this reason, SQLite allows object keys to include any unicode characters greater than U+007f that are not whitespace characters. This relaxed definition of "identifier" greatly simplifies the implementation and allows the JSON parser to be smaller and run faster.

[1]: https://www.sqlite.org/json1.html


Or take the XML route. Unicode offers several strategies to avoid continuously updating the Unicode table [1] which has been adapted by XML 1.1 and also later editions of XML 1.0. The actual syntax can be as simple as the following (based from XML, converted to ABNF as in RFC 8295):

    name = name-start-char *name-char
    name-start-char =
        %x41-5A / %x5F / %x61-7A / %xC0-D6 / %xD8-F6 / %xF8-2FF / %x370-37D /
        %x37F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF /
        %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
    name-char = name-start-char / %x30-39 / %xB7 / %x300-36F / %x203F-2040
In my opinion, this approach is so effective that I believe every programming language willing to support Unicode identifiers but nothing more complex (e.g. case folding or normalization or confusable detections) should use this XML-based syntax. You don't even need to narrow it down because Unicode explicitly avoided identifier characters outsides of those ranges due to the very existence of XML identifiers!

[1] https://unicode.org/reports/tr31/#Immutable_Identifier_Synta...


Yeah, definitely for small devices. For things like VS Code's configuration file format (the parent comment) or other such use cases, I don't see a problem


If you want commas and comments, another term to search for is JWCC, which literally stands for "JSON With Commas and Comments". HuJSON is another name for it.


Yeah if you could get NPM to allow JSONC in package.json, that'd be great.


It’s not that it “can’t”, more like it “doesn’t”. Douglas Crockford prioritized simplicity when specifying JSON. Its BNF grammar famously fits on one side of a business card.

Other flavors of JSON that include support for comments and trailing commas exist, but they are reasonably called by different names. One of these is YAML (mostly a superset of JSON). To some extent the difficulties with YAML (like unquoted ‘no’ being a synonym for false) have vindicated Crockford’s priorities.


There's no real reason for that. It just happened like that. These aren't the only bad decisions made by JSON author, and not the worst either.

What you can do is: write comments using pound sign, and rename your files to YAML. You will also get a benefit of a million ways of writing multiline strings -- very confusing, but sometimes useful.


I don't know the historic reason why it wasn't included in the original spec, but at this point it doesn't matter. JSON is entrenched and not going to change.

If you want comments, you can always use jsonc.


I wrote a Rust library that works similarly to the author's byteReader: https://crates.io/crates/fixed-buffer


These are always interesting to read because you get to see runtime quirks. I'm surprised there was so much function call overhead, for example. And it's interesting you can bypass range checkong.

The most important thing, though, is the process: measure then optimize.


This was a very good read, and I did learn some nice tricks, thank you very much.


I notice 'sample.json' contains quite a few escaped nulls \u0000 inside quoted strings.

Is "\u0000" legal JSON?

P.S. ... and many other control characters < \u0020


My favorite bit about this is his reference to John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312

Note the distinct lack of:

    if err != nil {


How is this compared to Daniel Lemire's simdjson? https://github.com/simdjson/simdjson


In what cases would an application need to regularly parse gigabytes of JSON? Wouldn't it be advantageous for the app to get that data into a DB?



Wish I wasn't 4 or 5 uncompleted projects deep right now and had the time to rewrite a monkey parser using all these tricks.


what does this bring over goccy's json encoder?


nowadays I am more interested in a "forgiving" JSON/YAML parser, that would recover from LLM errors, is there such a thing?


Perhaps not quite what you're asking for, but along the same lines there's this "Incomplete JSON" parser, which takes a string of JSON as it's coming out of an LLM and parses it into as much data as it can get. Useful for building streaming UI's, for instance it is used on https://rexipie.com quite extensively.

https://gist.github.com/JacksonKearl/6778c02bf85495d1e39291c...

Some example test cases:

    { input: '[{"a": 0, "b":', output: [{ a: 0 }] },
    { input: '[{"a": 0, "b": 1', output: [{ a: 0, b: 1 }] },

    { input: "[{},", output: [{}] },
    { input: "[{},1", output: [{}, 1] },
    { input: '[{},"', output: [{}, ""] },
    { input: '[{},"abc', output: [{}, "abc"] },
Work could be done to optimize it, for instance add streaming support. But the cycles consumed either way is minimal for LLM-output-length=constrained JSON.

Fun fact: as best I can tell, GPT-4 is entirely unable to synthesize code to accomplish this task. Perhaps that will change as this implementation is made public, I do not know.


If the LLM did such a bad job that the syntax is wrong, do you really trust the data inside?

Forgiving parsers/lexers are common in language compilers for languages like rust or C# or typescript, you may want to investigate typescript in particular since it's applicable to JSON syntax. Maybe you could repurpose their parser.


I feel like trying to infer valid JSON from invalid JSON is a recipe for garbage. You’d probably be better off doing a second pass with the “JSON” through the LLM but, as the sibling commenter said, at this point even the good JSON may be garbage …


The jsonrepair tool https://github.com/josdejong/jsonrepair might interest you. It's tailored to fix JSON strings.

I've been looking into something similar for handling partial JSONs, where you only have the first n chars of a JSON. This is common with LLM with streamed outputs aimed at reducing latency. If one knows the JSON schema ahead, then one can start processing these first fields before the remaining data has fully loaded. If you have to wait for the whole thing to load there is little point in streaming.

Was looking for a library that could do this parsing.


See my sibling comment :)


halloween was last week


[flagged]


> “Json" and "Go" seem antithetical in the same sentence as "high performance" to me

Absolute highest performance is rarely the highest priority in designing a system.

Of course we could design a hyper-optimized, application specific payload format and code the deserializer in assembly and the performance would be great, but it wouldn’t be useful outside of very specific circumstances.

In most real world projects, performance of Go and JSON is fine and allow for rapid development, easy implementation, and flexibility if anything changes.

I don’t think it’s valid to criticize someone for optimizing within their use case.

> I feel like it should deserve a mention of what the stack is, otherwise there is no reference point.

The article clearly mentions that this is a GopherCon talk in the header. It was posted on Dave Cheney’s website, a well-known Go figure.

It’s clearly in the context of Go web services, so I don’t understand your criticisms. The context is clear from the article.

The very first line of the article explains the context:

> This talk is a case study of designing an efficient Go package.


This is an article about optimizing a JSON parser in Go.

As always, try to remember that people usually aren't writing (or posting their talks) specifically for an HN audience. Cheney clearly has an audience of Go programmers; that's the space he operates in. He's not going to title his post, which he didn't make for HN, just to avoid a language war on the threads here.

It's our responsibility to avoid the unproductive language war threads, not this author's.


Even sticking within the confines of json there's low hanging fruit, e.g. if your data is typically like:

    [{"k1": true, "k2": [2,3,4]}, {"k1": false, "k2": []}, ...]
You can amortize the overhead of the keys by turning this from an array of structs (AoS) into a struct of arrays (SoA):

    {"k1": [true, false, ...], "k2": [[2,3,4], [], ...]}
Then you only have to read "k1" and "k2" once, instead of once per record. Presumably there will be the odd record that contains something like {"k3": 0} but you can use mini batches of SoA and tune their size according to your desired latency/throughput tradeoff.

Or if your data is 99.999% of the time just pairs of k1 and k2, turn them into tuples:

    {"k1k2": [true,[2,3,4],false,[], ...]}
And then 0.001% of the time you send a lone k3 message:

    {"k3": 2}
Even if your endpoints can't change their schema, you can still trade latency for throughput by doing the SoA conversion, transmitting, then converting back to AoS at the receiver. Maybe worthwhile if you have to forward the message many times but only decode it once.


JSON is probably the fastest serialization format to produce and parse, which is also safe for public use, compared to binary formats which often have fragile, highly specific and vulnerable encoding as they're directly plopped into memory and used as-is (i.e. they're not parsed at all, it's just two computers exchanging memory dumps).

Compare it with XML for example, which is a nightmare of complexity if you actually follow the spec and not just make something XML-like.

We have some formats which try to walk the boundary between safe/universal and fast like ASN.1 but those are obscure at best.


I prefer msgpack if the data contains a lot of numeric values. Representing numbers as strings like in JSON can blow up the size and msgpack is usually just as simple to use.


Msgpack is nice and simple sure.


JSON is often the format of an externally provided data source, and you don't have a choice.

And whatever language you're writing in, you usually want to do what you can to maximize performance. If your JSON input is 500 bytes it probably doesn't matter, but if you're intaking a 5 MB JSON file then you can definitely be sure the performance does.

What more do you need to know about "the stack" in this case? It's whenever you need to ingest large amounts of JSON in Go. Not sure what could be clearer.


Exactly, it's not too hard to implement in C. The one I made never copied data, instead saved the pointer/length to the data. The user only had to Memory Map the file (or equivalent), pass that data into the parse. Only memory allocation was for the Jason nodes.

This way they only paid the parsing tax (decoding doubles, etc..) if the user used that data.

You hit the nail on the head


The linked article is a GopherCon talk.

The first line of the article explains the context of the talk:

> This talk is a case study of designing an efficient Go package.

The target audience and context are clearly Go developers. Some of these comments are focusing too much on the headline without addressing the actual article.


Yup and if your implementation uses a hashmap for object key -> value lookup, then I recommend allocating the hashmap after parsing the object not during to avoid continually resizing the hashmap. You can implement this by using an intrusive linked list to track your key/value JSON nodes until the time comes to allocate the hashmap. Basically when parsing an object 1. use a counter 'N' to track the number of keys, 2. link the JSON nodes representing key/value pairs into an intrusive linked list, 3. after parsing the object use 'N' to allocate a perfectly sized hashmap in one go. You can then iterate over the linked list of JSON key/value pair nodes adding them to the hashmap. You can use this same trick when parsing JSON arrays to avoid continually resizing a backing array. Alternatively, never allocate a backing array and instead use the linked list to implement an iterator.


Like how https://github.com/zserge/jsmn works. I thought it would be neat to have such as parser for https://github.com/vshymanskyy/muon


I've made a JSON parser which works like this too. No dynamic memory allocations, similar to the JSMN parser but stricter to the specification.

Always nice to be in control over memory :)

https://sr.ht/~cryo/cj


> The user only had to Memory Map the file (or equivalent)

Having done this myself, it's a massive cheat code because your bottleneck is almost always i/o and memory mapped i/o is orders of magnitude faster than sequential calls to read().

But that said it's not always appropriate. You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system. A very fast streaming JSON decoder is the move here.


> memory mapped i/o is orders of magnitude faster than sequential calls to read()

That’s not something I’ve generally seen. Any source for this claim?

> You can have gigabytes of JSON to parse, and the JSON might be available over the network, and your service might be running on a small node with limited memory. Memory mapping here adds quite a lot of latency and cost to the system

Why does mmap add latency? I would think that mmap adds more latency for small documents because the cost of doing the mmap is high (cross CPU TLB shoot down to modify the page table) and there’s no chance to amortize. Relatedly, there’s minimal to no relation between SAX vs DOM style parsing and mmap - you can use either with mmap. If you’re not aware, you do have some knobs with mmap to hint to the OS how it’s going to be used although it’s very unwieldy to configure it to work well.


Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.

The latency comes from the fact you need to have the whole file. The use case I'm talking about is a JSON document you need to pull off the network because it doesn't exist on disk, might not fit there, and might not fit in memory.


> Experience? Last time I made that optimization it was 100x faster, ballpark. I don't feel like benchmarking it right now, try yourself.

I have. Many times. There's definitely not a 100x difference given that normal file I/O can easily saturate NVMe throughput. I'm sure it's possible to build a repro showing a 100x difference, but you have to be doing something intentionally to cause that (e.g. using a very small read buffer so that you're doing enough syscalls that it shows up in a profile).

> The latency comes from the fact you need to have the whole file

That's a whole other matter. But again, if you're pulling it off the network, you usually can't mmap it anyway unless you're using a remote-mounted filesystem (which will add more overhead than mmap vs buffered I/O).


I think you misunderstood my point, which was to highlight exactly when mmap won't work....


In my experience mmap is at best 50% faster compared to good pread usage on Linux and MacOS.


I also really like this paradigm. It’s just that in old crusty null-terminated C style this is really awkward because the input data must be copied or modified. But it’s not an issue when using slices (length and pointer). Unfortunately most of the C standard library and many operating system APIs expect that.

I’ve seen this referred to as a pull parser in a Rust library? (https://github.com/raphlinus/pulldown-cmark)


Did you publish the code somewhere? I'd be interested in reading it.


"Antiethical".

Is it true that mindless bashing Go became some kind of a new religion in some circles?


Did you even open the article? Following is literally in first paragraph

> This package offers the same high level json.Decoder API but higher throughput and reduced allocations


How does that contradict what the parent poster says? I think it's very weird to call something "high performance" when it looks like it's maybe 15-20% of the performance of a simdjson in c++. This is not "going from normal performance to high performance", this going from "very subpar" to "subpar"


Par is different for different stacks. It's reasonable for someone to treat their standard library's JSON parser as "par", given that that's the parser that most of their peers will be using, even if there are faster options that are commonly used in other stacks.


Ok but how many teams are building web APIs in C++?


I worked with a guy who did this. It was fast, but boy howdy was it not simple.


Because even in C++ people don't use json simd most project use rapidjson which Go is on part with.


> "Json" and "Go" seem antithetical in the same sentence as "high performance" to me. As long as we are talking about _absolute performance_.

Even just "Json" is problematic here as wire protocol for absolute performance no matter what will be programming language.


I think people say that as they give disproportional weight to the fact it's text-based, while ignoring how astoundingly simple and linear it is to write and read.

The only way to nudge the needle is to start exchanging direct memory dumps, which is what ProtoBuff and the like do. But this is clearly only for very specific use.


> while ignoring how astoundingly simple and linear it is to write and read.

code maybe simple, but you have lots of performance penalties: resolving field keys, you need to construct some complicated data structures through memory allocations, which is expensive.

> to start exchanging direct memory dumps, which is what ProtoBuff and the like do

Protobuff actually is doing parsing, it is just binary format. What you describing is more like Flatbuffer.

> But this is clearly only for very specific use.

yes, specific use is high performance computations )


Resolving what keys? JSON has keyval sequences ("objects") but what you do with them is entirely up to you. In a streaming reader you can do your job without ever creating maps or dehydrating objects.

Plus no one makes people use objects in JSON. If you can send a tuple of fields as an array... then send an array.


> In a streaming reader you can do your job without ever creating maps or dehydrating objects.

but you need some logic which will check that key is what you want to process, and also do various type transformation (e.g. json string to long).

> Plus no one makes people use objects in JSON. If you can send a tuple of fields as an array... then send an array.

then this shouldn't be called JSON parsing anymore.


The "logic" to match a key is not slow. Your CPU can parse hundreds of keys while waiting for the next keyval sequence to be loaded from RAM.

As for whether you must stuff everything in objects for it to be JSON, I mean that makes no sense. JSON arrays are also JSON. And JSON scalars are also JSON.

If this argument requires we deliberately go out of our way to be dumb, then it's a bad argument.

I'm writing a platform which has a JSON-like format for message exchange and I realized early on that in serialized data, maps are at best nominal. You process everything serially (hence the word serialization). It's a stream of tokens. The fact some tokens are marked as keys and some as values is something that can be useful to communicate intent, but it doesn't mandate how you utilize it.

Everything else is just prejudices and biases, such as "maps take resources to allocate". JSON doesn't force you to build maps when you have an object. Point in the spec where JSON mandates how you must store the parsed results of a JSON object. Should it be a hashmap? A b-tree? A linked list? A tuple? Irrelevant.


> The "logic" to match a key is not slow. Your CPU can parse hundreds of keys while waiting for the next keyval sequence to be loaded from RAM.

RAM on my computer works with pages loaded into CPU cache.

Rants, personal attacks, handwavings are ignored.


No, pages are not loaded in cache. Cache lines are. RAM pages are typically 4kb, and cache lines are most commonly 64 bytes. This means you have 64 cache lines per RAM page. And this entire detour has no relevance to what I said in the first place, which still stands. But you know, someone was wrong on the Internet.


> RAM pages

I didn't say "RAM pages", well vectorrized code can dump MBs of data preemptively into cache, instead of reading "next keyval sequence" each time.

> And this entire detour has no relevance to what I said in the first place,

or you just don't understand such relevance.


I never said the CPU reads one key at a time. I said it can decode hundreds of keys at the time it loads one from memory. This is completely irrelevant of how memory reads are batched. It's about a ratio, like 100:1 get it? Seems like you felt your ego attacked, and you just had to respond in a patronizing way about something, but didn't know about what.

Hashing a string as you read it from memory and jumping to a hash bucket is not an expensive operation. This entire argument sounds like some kindergarten understanding of compute efficiency. This is not a 6052.


Will ignore you this time )


Excellent work. Don't drop to my level where I made a simple factual statement.


lets sync on numbers maybe? My engine processes 600MB/s (mebabytes, not megabits) of data per core (I have very many cores) for my wire format, current bottleneck is that linux vpages system can't allocate/deallocate pages fast enough when reading from NVME raid.

What are your numbers for your cool json serializer?


I think maybe you possibly forgot what our argument was. I said the bottleneck is memory, and not processing/hashing the keys to match them to the symbol you want to populate.

And you're currently telling me the bottleneck is memory.

I also said you don't need to parse a JSON object to a hashmap or a b-tree. The format suggests nothing of the sort. You can hash the key and fill it into a symbol slot in a tuple which... literally only takes the amount of RAM you need for the value, while the key is "free", because it just resolves to a pointer address.

Additionally, if you have a fixed tuple format, you can encode it as a JSON array, thus skipping the keys entirely. None of that is against JSON. You decide what you need and what you don't need. The keyvals are there when you need keyvals. No one is forcing you to use them where you don't need them at gunpoint.

I have a message format for a platform I'm working on, it has a JSON option, just for compatibility. It doesn't use objects at all, yet (but it DOES transfer object states). Nested arrays are astonishingly powerful on their own with the right mindset.


> And you're currently telling me the bottleneck is memory.

Not memory, but virtual pages implementation in linux, which is apparently single threaded and doesn't scale to high throughput. There was a patch to fix this, but it didn't make to mainline: https://lore.kernel.org/lkml/20180403133115.GA5501@dhcp22.su...


"High throughput" seems like an odd problem to have. You don't have to throw away pages and allocate new ones all the time. You can reuse a page.


That's not up to me. It is Linux virtual fs implemented that way




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: