Looks pretty good! Even though I've written far too many JSON parsers already in my career, it's really nice to have a reference for how to think about making a reasonable, fast JSON parser, going through each step individually.
That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)
Reasons differ. C++ is a really hard place to be. It's gotten better, but if you can't tolerate exceptions, need code that is as-obviously-memory-safe-as-possible, can parse incrementally (think SAX style), off-the-shelf options like jsoncpp may not fit the bill.
Handling large documents is indeed another big one. It sort-of fits in the same category as being able to parse incrementally. That said, Go has a JSON scanner you can sort of use for incremental parsing, but in practice I've found it to be a lot slower, so for large documents it's a problem.
I've done a couple in hobby projects too. One time I did a partial one in Win32-style C89 because I wanted one that didn't depend on libc.
The large documents are often fixed by using mmap/virtualalloc of the file, but Boost.JSON has a streaming mode and is reasonably fast and the license is good for pulling into anything. It's not the fastest, but faster than rapid with the interface of nlohmann JSON. For most tasks, it does seem that most of hte libraries taking a JSON document approach are wasting a lot of time/memory to get to the point that we want normal data structures, not a JSON document tree. If we pull that out and parse straight to the data structures there is a lot of win in performance and memory with less/no code, just mappings. That's how I approached it at least.
> that most of hte libraries taking a JSON document approach are wasting a lot of time/memory
I agree. That's the same situation as with XML/HTML. In many cases you don't really need to build a DOM or JSOM in memory. If your task is about deserializing some native structures.
For the interesting JSON of a significant size, an interator/range interface that parses to concrete types works really well. Usually they are large arrays or JSONL like things
If you have files that are large enough that json is a problem, why use json in the first place? Why not use a binary format that will be more compact and easier to memory map?
I've written JSON parsers because in one instance we had to allow users to keep their formatting but also edit documents programmatically. At the time I couldn't find parsers that did that, but it was a while back.
In another instance, it was easier to parse into some application-specific structures, skipping the whole intermediate generic step (for performance reasons).
With JSON it's easier to convince your boss that you can actually write such a parser because the language is relatively simple (if you overlook botched definitions of basically every element...) So, for example, if the application that uses JSON is completely under your control, you may take advantage of stupid decisions made by JSON authors to simplify many things. More concretely, you can decide that there will never be more than X digits in numbers. That you will never use "null". Or that you will always put elements of the same type into "lists". Or that you will never repeat keys in "hash tables".
I've seen "somebody doesn't agree with the standard and we must support it" way too many times, and I've written JSON parsers because of this. (And, of course, it's easy to get some difference with the JSON standard.)
I've had problems with handling streams like the OP on basically every programing language and data-encoding language pair that I've tried. It looks like nobody ever thinks about it (I do use chunking any time I can, but some times you can't).
There are probably lots and lots of reasons to write your own parser.
There are several that are into the GB/s of performance with various interfaces. Most are just trash for large documents and sit in the allocators far too long, but that's not required either
What on Earth are you storing in JSON that this sort of performance issue becomes an issue?
How big is 'large' here?
I built a simple CRUD inventory program to keep track of one's gaming backlog and progress, and the dumped JSON of my entire 500+ game statuses is under 60kB and can be imported in under a second on decade-old hardware.
I'm having difficulty picturing a JSON dataset big enough to slow down modern hardware. Maybe Gentoo's portage tree if it were JSON encoded?
In my case, sentry events that represent crash logs for Adobe Digital Video applications. I’m trying to remember off the top of my head, but I think it was in the gigabytes for a single event.
Not necessarily, for example Newtonsoft is fine with multiple hundreds of megabyes if you use it correctly. But of course depends on how large we are talking about.
That said, I will say one thing: you don't really need to have an explicit tokenizer for JSON. You can get rid of the concept of tokens and integrate parsing and tokenization entirely. This is what I usually do since it makes everything simpler. This is a lot harder to do with something like the rest of ECMAscript since in something like ECMAscript you wind up needing look-ahead (sometimes arbitrarily large look-ahead... consider arrow functions: it's mostly a subset of the grammar of a parenthesized expression. Comma is an operator, and for default values, equal is an operator. It isn't until the => does or does not appear that you know for sure!)