Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: ffjson: faster json serialization in Go (querna.org)
83 points by pquerna on March 31, 2014 | hide | past | favorite | 35 comments



As someone who has been working on parsing/serialization for many years, I can absolutely confirm that generating schema-specific code will beat generic code almost every time.

The article discovers an important tradeoff: speed vs. convenience. Generated code is faster but less convenient, because it adds steps to your compile. And you pay this compile-time overhead for every message type you want to manipulate. The pain of this generated code was one of Kenton Varda's motivations for creating Cap'n Proto after working on Protocol Buffers for years. Unlike Protocol Buffers, Cap'n Proto doesn't need to generate parsing/serialization code because its serialization format also works as an in-memory format.

I have taken a somewhat different approach to the problem, with my Protocol Buffer / JSON / etc. serialization framework upb (https://github.com/haberman/upb). Instead of using static code generation, I use a JIT approach where I generate specialized code on-the-fly. This approach is particularly important when you are wrapping the library in an interpreted language like Python/Ruby/Lua/JavaScript, because users of these languages don't have a compile cycle at all, so adding one is a large inconvenience.

My library isn't ready for prime time but I'm hoping to have something ready to use this year.


> The article discovers an important tradeoff: speed vs. convenience. Generated code is faster but less convenient, because it adds steps to your compile.

Rust's approach to serialization eliminates this unnecessary tradeoff, IMO: it uses the built-in macro system to parse the structure of your serializable data types and generates the specialized code at compile time. There is no special build step, you get the full speed of specialized code, and it's as simple as writing "#[deriving(Encodable)]" at the top of your data structures.

(As an added bonus, we're using the same infrastructure to generate trace hooks for precise GC in Servo, which eliminates a major pain point of integrating systems-level code with precise garbage collectors.)


The Encodable trait sounds convenient for lots of use cases. However, speed and convenience are not the only things you might want; multi-language support and protocol evolvability are often also important, and when they are, something like Cap'n Proto's schema language is very useful.

Instead of "serialization frameworks", I like to call these libraries "type systems for distributed computing". I can then ask: where are your types defined?


Agreed!

I would also take it one step further and suggest that these libraries are "type systems for data interchange."

Say you want to stuff your logs into a database like BigQuery (which I work on at Google) for later analysis. What is the schema of your logs? I dream of a world where the logfile->JSON parser can use the same schema definition as the database itself, so there is no glue/conversion required to turn one into the other.

Likewise I think these schema languages are ideal for defining ASTs and other compiler infrastructure. Wouldn't it be cool if you could dump your AST into a database and run queries over it? Again, the goal is to do this without requiring a big schema mapping / transformation first.

I believe in this vision, and I think Protocol Buffer schemas fit the bill as something that can be usefully applied to much more than just Protocol Buffers. The first step I'm taking towards this is making JSON a first-class citizen in my library.


That's cool, I'll have to look into this at some point. This approach would simplify compile-time code generation, but it doesn't help interpreted languages that have no compile step at all.

> As an added bonus, we're using the same infrastructure to generate trace hooks for precise GC in Servo, which eliminates a major pain point of integrating systems-level code with precise garbage collectors.

I don't know what this means, but it sounds really interesting, do you have a reference with more info?


> This approach would simplify compile-time code generation, but it doesn't help interpreted languages that have no compile step at all.

Well, it does if they have macro systems like Scheme :) Of course, mainstream interpreted languages don't.

> I don't know what this means, but it sounds really interesting, do you have a reference with more info?

There isn't an official writeup that I'm aware of, but I can briefly explain what it is. In precise GCs, you have to have a way for the GC to traverse all the GC'd objects that a particular object points to. This is a problem in languages like C++ that have no or insufficient reflection. Traditionally the solution has been for everyone to manually implement "trace hooks" or "visitor hooks" for all objects that participate in garbage collection, which are C++ methods that enumerate all of the objects that a given object points to; this is what Gecko (with the cycle collector) and Blink (with Oilpan) do. But this is tedious and error-prone, and is especially scary when you consider that errors can lead to hard-to-diagnose use-after-free vulnerabilities.

We observed that this is a very similar problem to serialization; in serialization you want to call a method on every object that a given object points to (to serialize it), while in tracing you also want to call a method on every object that a given object points to (to trace it). So we decided to reuse the same compiler infrastructure. This has worked pretty well, despite the weirdness of saying `#[deriving(Encodable)]` for GC stuff. (We'll probably want to rename it.)


> Well, it does if they have macro systems like Scheme :)

I wouldn't generally count that because generated code in dynamic languages is (in my experience) an order of magnitude slower anyway.

For example, generated protobuf-parsing code in Python is something crazy like 100x slower than "the same" generated code in C++. Python might not be the best example since it's a lot slower than other dynamic languages like JavaScript or Lua (don't know about Scheme). But in general my experience is that generated code in dynamic languages isn't in the same ballpark as generated code in a low-level language like C/C++ (and probably Rust).

> So we decided to reuse the same compiler infrastructure.

Very interesting. What is the function signature of the generated functions? Are you saying that the functions you generate for serialization are the same (and have the same signature) as the functions you generate for GC tracing?


> Very interesting. What is the function signature of the generated functions? Are you saying that the functions you generate for serialization are the same (and have the same signature) as the functions you generate for GC tracing?

Yes, they're the same. They take the actual serializer (JSON, or YAML, or the GC, etc) as a type parameter so that you can just write `#[deriving(Encodable)]` and have it be used with different serializers. Type parameters are expanded away at compile time, so this leads to zero overhead at runtime.


Got it, so it looks like "Encoder" is this trait/interface: http://static.rust-lang.org/doc/master/serialize/trait.Encod...

I think of what you call "Encoder" as a "Visitor" (just in case you're looking for renaming input :)

So the function that you are generating at runtime is similar to a template function in C++, templated on the specific serializer (Encoder/Visitor/etc).

One thing that this approach does not currently support (which admittedly is probably not required for most users, and is probably not in line with the overall design of Rust) is the ability to resume. Suppose you are serializing to a socket and the socket is not write-ready. You would need to block until the socket is write-ready (or allocate a buffer to hold the serialized data). This interface doesn't provide a way of suspending the visit/encode and resuming it later.

This also doesn't seem to have a way of identifying the fields -- is this true? Like if you were trying to encode as JSON but wanted to know the name of each field?


`encode_field` has the name in it—was there something else you were thinking of? `#[deriving(Encodable)]` should be able to read attributes on fields and provide the name accordingly.

And yes, you can't resume with this interface. You can implement traits for types outside the module they were defined in though, so a "resumable serialization" library could provide a new "ResumableEncoder" type if it wanted to.


The Encoder trait I linked to doesn't seem to list an "encode_field" function -- am I looking in the wrong place?


emit_struct_field, sorry.



"static if", __traits, and mixin are the D equivalent of Rust's macros, more or less.


There may not be "macros", per se, but there is definitely a lot of compile-time logic.


> The article discovers an important tradeoff: speed vs. convenience. Generated code is faster but less convenient, because it adds steps to your compile.

Not if you use languages with built-in metaprogramming capabilities that are expressive enough to generate scheme-specific code at compile-time (e.g. Scheme, Scala, D, Rust, Nimrod etc)

I'm personally a huge fun of typeclasses as implicits + macro materializers in Scala: http://docs.scala-lang.org/overviews/macros/implicits.html


Though that would require the implementation of those features being more advanced than the go json library's "runtime reflection".


The last time I looked at upb, it only supported parsing, which was a deal-breaker. Have you implemented serializing since then? It's hard to tell from the git log, which is mostly "dump from google internal sources XX-YY-ZZZZ".

I'm very excited about upb! Thanks for your work on it over the years. Do you have any tasks that an outside contributor could help with?

Thanks!


Still no serialization, sorry. This is getting close to being my top priority though. It's a hard problem to design the kind of generic visitor I want while being speed-competitive with the fastest stuff that's out there.

Probably can't use a lot of outside help, but thanks for asking. What's taking so long is a very intense process of refining the design. I am often reminded of this bit from "Worse is Better":

"The right thing takes forever to design, but it is quite small at every point along the way."

This is exactly the approach I am taking. It's taken forever to design, but over time it becomes more and more capable without the core growing in size very much. I hope that this intensive design work will pay off soon.

Thanks for your interest and I hope to soon be able to reward it with a highly compelling parsing/serialization experience!


Awesome, looking forward to that day :). Keep up the great work.


This is way more complicated that it needs to be. Use code.google.com/p/go.tools/go/types to process the AST and you'll get basically the same information that the compiler sees. With that you can generate the code pretty easily. For comparison, our JSON code generator implementation is just ~350 lines and supports having different JSON representations for the same type, which vary depending on the container type.

Also, if you want to make the serialization faster you need to understand exactly what makes encoding/json slow (hint: is not only reflect) and remove all reasonable bottlenecks. You state that megajson does not support the MarshallJSON interface like that's a bad thing, but I'm pretty sure that's deliberate because it's indeed a feature. When encoding/json encounters a type which implements MarshalJSON it does the following:

1 - Call MarshallJSON to obtain its JSON representation as []byte 2 - Validate the produced JSON using a slower-than-the-bad's-guy-horse function based state machine which processes each character individually 3 - Copy the []byte returned by MarshallJSON to its own buffer

Unsurprisingly (after reading encoding/json's code, of course) having a MarshalJSON method is way slower than letting encoding/json use reflection if the JSON you're generating is anything but trivial and without almost any nesting, because it avoids extra allocations, copies and the validation step.


Shameless plug, but this looks like the exact inverse of gojson[0], which generates code (struct definitions) based on sample JSON.

I originally wrote it when writing a client library to interface with a third-party API; it saves a lot of effort compared to typing out struct definitions manually, and a lot of type assertions compared to using map[string]interface{} everywhere.

[0] https://github.com/ChimeraCoder/gojson


The implementation of this is pretty interesting, in that it generates code that imports your code, compiles it, and then uses reflection to generate the serialization code. And in the end, that worked out better for the author than using the AST.


If anything, that would be an indication that the tooling is nowhere near good enough. Any reflection should be doable entirely with data the compiler has anyway.


Actually there is a tool for this. I think the author was simply un-aware of it. It is called oracle* and is able to answer this typing question and many more.

*: https://code.google.com/p/go/source/browse/cmd/oracle/main.g...


The tooling is excellent, but the author wasn't aware of it. In fact, the stdlib and the go.types repository provide almost anything required to write a front-end for a Go compiler.


I've made a library this weekend that doesn't need code generation to achieve 2x improvement[1] over the standard library.

While the OP only implemented the encoding part, I only implemented the decoding part =):

https://github.com/aybabtme/fatherhood

So I guess they overlap nicely in that.

[1]:https://github.com/aybabtme/fatherhood#performance


Looking at the name, at first I thought ffmpeg author and general programming god Fabrice Bellard had come up with it.


Seems like this approach might also work for building type-specific collection implementations as well!


On the .Net side, Jil is doing similar, creating a custom static serializer per type. It’s able to do the code generation at runtime by emitting IL: https://github.com/kevin-montrose/Jil


You can get near the same speedup by just avoiding reflection. You unmarshal into an interface{} then pull the data out manually using type assertions as necessary. In my last project I think I got about a 1.6-1.7 speedup this way.


Nice hack (in the hacker sense), but not exactly convenient.


Good stuff, there.

Feature request: optionally emit canonicalized (key-sorted) JSON.


This is a good example of why everyone should avoid using reflect package as much as possible.

I use reflect for quick development and then remove it before production roll out.


I've had to use reflect due to missing functions in an upstream API. We could have forked the upstream package or used reflect in one function.

We decided to put in a patch and use reflect until it is accepted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: