Hacker News new | past | comments | ask | show | jobs | submit login

Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.

That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)




What line of work are you in that you've "written far too many JSON parsers already" in your career?!!!


Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.

Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.

I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.


The large documents are often fixed by using mmap/virtualalloc of the file, but Boost.JSON has a streaming mode and is reasonably fast and the license is good for pulling into anything. It's not the fastest, but faster than rapid with the interface of nlohmann JSON. For most tasks, it does seem that most of hte libraries taking a JSON document approach are wasting a lot of time/memory to get to the point that we want normal data structures, not a JSON document tree. If we pull that out and parse straight to the data structures there is a lot of win in performance and memory with less/no code, just mappings. That's how I approached it at least.


> that most of hte libraries taking a JSON document approach are wasting a lot of time/memory

I agree. That's the same situation as with XML/HTML. In many cases you don't really need to build a DOM or JSOM in memory. If your task is about deserializing some native structures.

This XML scanner of mine does not allocate any memory at all while parsing HTML/XML: https://www.codeproject.com/Articles/14076/Fast-and-Compact-...

It is even simpler than SAX parser.


For the interesting JSON of a significant size, an interator/range interface that parses to concrete types works really well. Usually they are large arrays or JSONL like things


If you have files that are large enough that json is a problem, why use json in the first place? Why not use a binary format that will be more compact and easier to memory map?


Chances are they can’t control that; they’re perhaps provided by a vendor.


I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.

In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).

With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".


I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)

I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).

There are probably lots and lots of reasons to write your own parser.


This reminds me of my favourite quote about standards.

>The wonderful thing about standards is that there are so many of them to choose from.

And, keeping with the theme, this quote may be from Grace Hopper, Andrew Tanenbaum, Patricia Seybold or Ken Olsen.


Probably anywhere that requires parsing large JSON documents. Off the shelf JSON parsers are notoriously slow on large JSON documents.


There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either


What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

How big is 'large' here?

I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.

I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?


> What on Earth are you storing in JSON that this sort of performance issue becomes an issue?

I've been in the industry for a while. I've probably left more than one client site muttering "I've seen some things ...".

If it can be done, it will be done. And often in a way that shouldn't have even been considered at all.

Many times, "it works" is all that is needed. Not exactly the pinnacle of software design. But hey, it does indeed "work"!


Insurance price transparency can have 16gb of compressed JSON that represents a single object.

Here is the anthem page. The toc link is 16gb

https://www.anthem.com/machine-readable-file/search/

They are complying with the mandate. But not optimizing for the parsers


I've seen people dump and share entire databases in JSON format at my job....


I've seen tens of millions of market data events from a single day of trading encoded in JSON and used in various post-trade pipelines.


Ah, that's a dataset with a size certainly intimidating, and in an environment where performance means money. Thanks for pointing that out!


In my case, sentry events that represent crash logs for Adobe Digital Video applications. I’m trying to remember off the top of my head, but I think it was in the gigabytes for a single event.


Chrome trace format files also use JSON and can also become large and are a pain to work with.


Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.


Someone misunderstood the JSONParserFactory somewhere along the line.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: