Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A Fast, Malloc-Free C++14 Json Parser and Encoder (github.com/matt-42)
98 points by matt42 on Nov 14, 2014 | hide | past | favorite | 45 comments



This uses std::string all over the place, which allocates memory under the hood.

I'm not sure how any JSON parser could avoid memory allocation without a difficult to use interface. Numbers, arrays, strings, and objects are unbounded by the JSON spec, so a truly malloc free library would need to provide a kind of streaming interfaces where things are returned in fixed sized chunks.

JSON wouldn't be my first choice for data storage in situations where I needed to avoid dynamic memory.


The parser does not extensively use std::string. to parse json keys, It uses an internal stringview class mirroring the json string (string_view is candidate to get into the next C++ standard). When I said malloc free, I meant that no intermediate memory is allocated: actually, If your json object only contains int or floats, the parser will not use malloc: it will just traverse the string and fill the scalars of your destination object (thanks to static introspection), a big difference with state of the art json parsers returning an intermediate dynamically allocated tree that you have to interpret after the parsing. Of course, if the destination object stores instances of std::string of std::vector, the iod library will allocated them and fill them with the json values.


It also uses vector, which is almost certainly using malloc and/or placement new even if the SIO somehow really does reserve space for a single object. Its also using the heap unless there is some hidden Allocator object in the code that explicitly is using stack allocated space.


I wrote a JSON-Parser in C with minimal validation in ~250 LOC that does not allocate memory, but leaves the type check and conversion to the user. https://github.com/vurtun/json


It would be nice if the README mentioned it was originally based on my https://github.com/quartzjer/js0n :)


Added your project as a reference. Sorry for the missing reference I originally never planned to make the project known publicly and therefore never added the references.


This is absolutely beautiful. I didn't know you could get an address of a label in C or to do a goto to an address from variable instead of a literal label. Is this new in C, or was it always there?



Is there an equivalent for Clang/LLVM?


Yes.

http://blog.llvm.org/2010/01/address-of-label-and-indirect-b...

A venerable interpreter implementation trick.


Unfortunately it won't port to IAR, which is a shame. I don't understand why pointers to labels never made it to the C spec; it's one of the (equivaling) features that actually makes assembler look elegant.


That test.c file looks familiar!


Yeah I was looking for unit test macros and found your project. I added your project as a reference in the README for clarification of the origin.


That's a very nice code.


I wrote a JSON parser that avoids using dynamic memory but works a bit differently. It's a callback interface like SAX, but instead of pushing already "typed" data to the user, the user calls a function to interpret the data. In this fashion, if the user wants a float instead of an integer, they call the function for float. Likewise, if the user has some space allocated for string values, the library can convert the already inmemory JSON format string to UTF-8 in the user's buffer.

The C interface is purposely ugly but the C++ is much easier to work with.

https://codehero.github.io/benejson/


> JSON wouldn't be my first choice for data storage in situations where I needed to avoid dynamic memory.

I work a lot with very low-resource embedded systems and we like to use JSON host-side for structured data configuration, but convert to BSON (http://bsonspec.org) for target-side storage. This can be parsed and iterated with very low-overhead and no dynamic memory. Similar idea to protocol buffers but with JSON-like data.

Avoiding a build-time conversion would be nice, though, and this idea is interesting.


sajson allocates once: it turns out you can structure your parse tree such that each byte of input only requires one machine word of output. http://chadaustin.me/2013/01/sajson-why-the-parse-tree-is-bi...

https://github.com/chadaustin/sajson


That's correct in the worst case, but in the best case, std::string does not allocate memory anymore (short string optimisation), e.g. libc++: http://stackoverflow.com/a/21710033

So it might be "malloc-free" for configuration files where the possible values are limited, etc. :)


jsmn is a nice JSON tokenizer. You malloc a chunk of memory up front, but that's it. It gives you pointers to the tokens and you can do what you like with them. I am using it in an app that has a fixed-size data structure that is populated from a JSON file. So my code is malloc-free also.


Link for those interested: https://bitbucket.org/zserge/jsmn/wiki/Home

jsmn is pretty awesome, and contrary to this (std::string, std::vector) actually puts you in charge of allocations. It's also in C.

In a similar vein, and also possibly already mentioned:

- yayl (C): https://lloyd.github.io/yajl/

- microjson (C): http://esr.ibiblio.org/?p=6262


I've got a streaming JSON parser in GCC C around here somewhere that does basically that - numbers and keys are truncated, but everything else is chopped up into chunks and streamed. Someday I ought to clean it up, document it and release it. Runs quite nicely on an Arduino.


> As of today, all json parsers rely dynamic data structures to parse and store the json objects.

I'm not sure that's entirely fair. Callback-based parsers like YAJL leave the application free to store the data in whatever data structure they want, or even to stream-process the input without storing in a data structure at all.

But regardless, the meta-programming approach described here is interesting and novel. Generating structure-specific parsing code is a well-explored area (for example, Protocol Buffers is designed entirely around this idea), but doing it as C++ metaprogramming is a novel approach (Protocol Buffers relies on compile-time code generation).

I don't actually understand how the object inspection and compile-time codegen works with this meta-programming approach; will be interesting to dig in a little deeper and learn more.


I've used Boost Fusion associative maps, in conjunction with a bit of glue code and an appropriate JSON or XML library, to provide pretty seemless serialisation and de-serialisation for both XML and JSON for a while. For XML it's a little tricky because you have to have a means of mapping the simple flat relationship between a struct and its fields, to the various relationships between XML elements, subelements, attributes, and CDATA etc. Here's an example of what my declarations for XML look like in some code I wrote last year.

    BOOST_FUSION_ADAPT_ADT (
        xml_encoded<Description>,
        XML_ATTR        (string, "summary")
        XML_TEXT        (string)
    )

    BOOST_FUSION_ADAPT_ADT (
        xml_encoded<Event>,
        XML_ATTR        (string, "name")
        XML_SUBTREE     (shared_ptr<Description const>, "description")
        XML_ATTR        (optional<unsigned>, "since")
    )
The #defines for each macro are short, and beyond Fusion there's only a small support header

Here's a talk from CppConn describing a similar use case, but for binary data formats https://www.youtube.com/watch?v=wbZdZKpUVeg


The introspection comes from the SIO's of the iod library, they are novel types of object offering introspection at compile time. The iod library makes iterating on the members of an sio object trivial with the foreach primitive (see https://github.com/matt-42/iod/blob/master/iod/json.hh#L417) so generating a serializer / deserializer like this one is not so complicated since you have access to the type and the name of each members. For more info, check the readme of https://github.com/matt-42/iod I did my best to explain the concepts of the library. But I'm still lacking of time to polish the documentation....


Not that it's well-known or anything, but the Gin library that's part of Chromium uses C++ meta-programming to generate structure-specific code to 'parse' JavaScript types into C++ types:

https://code.google.com/p/chromium/codesearch#chromium/src/g...


On one hand this is undoubtedly a very clever use of ++ features. On the other hand that heck of a lot of scaffolding (just look in /iod directory) and the more scaffolding there is, the more caution is needed in adopting the code.

The same goal - not parsing what's not needed - can be done with a conventional callback-based C code. You basically go through the json data, parse, say, a field and call the app asking "is this ok? shall I proceed?". If it's a yes, then you indeed proceed and parse out the value chunk and pass it to the app the same way. If it's a no, you either abort or skip over the value. The end effect is the same - parsing of an invalid json input is aborted as soon as the app flags the first bad entry; and unwanted fields are never parsed in full.

So I seriously doubt that this is a little more than a marketing spin of a proud developer -

  This makes its performances impossible to match
  in other languages such as C or Java that do not
  provide static introspection.
I am fairly certain that vurtun's code [1] can match and most likely beat this lib's performance, with ease.

[1] https://news.ycombinator.com/item?id=8609236


Vurtun's code is probabely faster than the iod json parser since it leaves the type check and conversion to the user. Since iod knows the structure of the object you are parsing, it directly checks types and throw exceptions if the json string does not contains all the required fields of the destination object, things that you cannot do without compile-time introspection, and that vurtun's library leaves to the user.


No, I meant of course that vurtun + all the callbacks are going to be at least as fast as iod. You'd basically pass a bit of context to every callback, so that the app would know where exactly it is in its parsing. It's clearly a bit more of leg work in terms of coding, but I would personally take that over introducing a dependency on a larger abstract framework. I come from the embedded programming background, so I would opt for a pre-processor step that takes in a struct definition and generates a proper set of parsing callbacks. This way you'd at least see the interim code before it gets compiled. But I digress, to each his own and that's not a point here. I spent a lot of time optimizing C code and I find your blanket performance claims to be too absolute and cocky for comfort.


Sax parsers are tools that you have to use to implement a parser for a given json object. The goal of iod's json parser is to do it automatically thanks to static introspection. It also reports errors if the json string is not compatible with the structure of your object, things that you have to do manually with other json parsers.


See also: Scala Pickling [1]. Serialization and deserialisation logic optimised for specific datatype is generated purely at compile time using Scala Macros [2].

[1] http://lampwww.epfl.ch/~hmiller/pickling/

[2] http://scalamacros.org


> This makes its performances impossible to match in other languages such as C or Java that do not provide static introspection.

A CHALLENGE!

So, er, who's up for it?

You could implement an analogue of this approach in Java. It's true that Java doesn't have language constructs that would let you do this as part of compilation, but Java has its ways. You could write an annotation processor to do this at compile time, or use a bytecode parser at runtime (this is yucky, but a fairly standard technique these days). Either way, the output would be a pair of synthetic classes which implemented the parser and encoder. A tool like this would be moderately laborious to write, but a straightforward matter of programming.


It reminds me of jackson-afterburner[0] in Java, which generates byte code for parsing and generating JSON.

[0] https://github.com/FasterXML/jackson-module-afterburner


Interesting approach. Could be a use case for the ByteBuddy library at http://bytebuddy.net


RapidJSON claims to support "in-situ parsing", which is presumably mostly zero copy, and presumably doesn't allocate much either. I'd like to see benchmarks over comparable code.


in-situ parsing would only work when decoding from an existing buffer (returning slices from the original buffer instead of copying the data), not when decoding JSON from e.g. a socket, right?


If you're using C++, it's by definition malloc-free ;)


Random downvote? Using malloc in C++ is considered terrible practice. Should I even source it?


> In classic C or C++, you would define a function taking optional arguments as :

Is that true? Can classic C have default (optional) arguments?


  void fun(int mandatory_arg, int optional_arg1 = 1, int optional_arg2 = 12, int optional_arg3 = 12);
No, this isn't legal ISO C. There are ways to simulate them though: https://gustedt.wordpress.com/2010/06/03/default-arguments-f...


My mistake, I just fixed the readme.


No, classic C doesn't have default arguments.


That's what I thought.


Sorry, didn't read the code yet. But malloc free means stack allocation, thus dangerous to stack attacks. Please clarify.


I also haven't read the code yet but malloc free may simply mean the code doesn't call malloc but it can still allocate things on the heap using New.


This program appears to use heap allocation, at least indirectly through its use of std::string.

On a more general note, libraries that perform zero dynamic allocation (and instead require the library user to pass in memory) can be very convenient for systems programming where portability is a concern. For example, I use Intel XED instruction encoder/decoder, which is a library that performs no dynamic memory allocation. This allows me to use it in user space and kernel space without hijacking the malloc symbol.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: