Hacker News new | past | comments | ask | show | jobs | submit login
On the complexity of JSON serialization (2020) (einarwh.wordpress.com)
92 points by fanf2 on Jan 24, 2021 | hide | past | favorite | 103 comments



This really has nothing at all to do with JSON. The author is really complaining about encoding and decoding domain objects (happening to use JSON as a representation) and being able to rehydrate them into full blown objects. This is actually a very trivial problem indeed!

You gotta remember that JSON doesn’t support all of the types that many languages do. So you need to engineer around that by annotating your attributes with a type or relying on a naming convention on keys. You can’t get mad at JSON for not being good at this. It would like getting mad at a dog for barking at the mail man. It’s doing what it was designed to do.

There are lots of tools and approaches for doing this. The native DynamoDB storage format comes to mind.


I didn't really read JSON as the core issue to begin with. In practice, JSON serialization libraries are used to map between domain objects and a JSON representation. The author notes that this really breaks into two problems, one of which is specific to JSON and the other of which is truly dependent on what exactly your domain model looks like.

The whole point is that no single library can solve a problem of unbounded (domain-dependent) complexity, and yet we rely on libraries like GSON to do that for us anyway. This leads to some problems, which the author calls out.

> This is actually a very trivial problem indeed!

I'm not sure I agree. The author specifically mentions the problem of coupling field names to names in the representation, such that an otherwise safe refactoring ends up breaking format compatibility.

The more general problem is a strict coupling between the structure of your domain objects and the structure of your representation. For instance, if your domain models cyclically-related objects, you have to decouple these structures somehow -- JSON cannot directly represent cycles. Sometimes you can tell whatever domain mapping library you're using to cut cycles in certain ways, but the author rightly calls this a Faustian bargain.

Retaining full control over these representational issues obviates a lot of the problems you'd otherwise face when adapting a library to your domain.


> safe refactoring ends up breaking format compatibility

I have a moderately strict rule for my teams that API objects must exist separately from normal domain objects (ie, persistence). And any APIs that have separate lifecycles (say, private webclient API vs official published API) get separate DTOs (which is what they really are).

This works fine? It's not much work, not even in Java (thanks to lombok). There's clear migration strategies for these api objects and you can refactor your domain objects without risk of breaking something.

I guess this is the "by hand" mapping that the article concludes with, but honestly it seems like a lot of words just to say "keep your API objects separate from your domain objects".

> JSON cannot directly represent cycles

It's incredibly easy to tweak JSON to allow it, and you don't even need a special parser. I wrote this five years ago: https://github.com/jsog/jsog


> I guess this is the "by hand" mapping that the article concludes with

Not "the", "a". One of the comments under the OP itself links to another blog post [0] describing an alternative to explicit DTOs, which I personally prefer.

The point isn't just to pick some other solution and be done with it; the point is to understand the problem in the first place.

[0] http://www.ballofcode.com/ddd/boundaries/serialization/2016/...

> It's incredibly easy to tweak JSON to allow it

Sure, but that was just an example of a general class of coupled-representation problems. As discussed elsewhere in this thread, the choice of JSON isn't even essential to the problem being discussed.

Also, even though (or precisely because) JSOG is 100% JSON, the fact remains that after your JSON parser finishes reading your JSOG document, you still have to hook up all the cycles. Either you do this by hand (per the article), or you wrap it up into parser-level knowledge (which breaks somewhat from the "100% JSON" intent).


  static JObject CreateMessageRepresentation(Customer customer)
  {
    return new JObject(
      new JProperty("customer",
        new JObject(
          new JProperty("name", customer.Name),
          new JProperty("address",
            new JObject(
              new JProperty("street", customer.Address.Street),
              new JProperty("zipCode", customer.Address.Zip),
              new JProperty("town", customer.Address.City)
            )
          )
        )
      )
    )
  }

Yuck.

Just make CustomerData and AddressData classes, even if you only use them for that one API response. And even if you have ten other versions of CustomerData and AddressData for ten other methods. You get type safety and your tests refactor nicely.

  @Value
  static class AddressData {
    String street;
    String zipCode;
    String town;
  }

  @Value
  static class CustomerData {
    String name;
    AddressData address;
  }

  @Value
  static class Message {
    CustomerData customer;
  }


  public Message createMessage(Customer customer) {
    final Address addy = customer.getAddress();
    return new Message(new CustomerData(customer.getName(), new AddressData(addy.getStreet(), addy.getZip(), addy.getCity())));
  }
You could format this nicer, and adding some constructors would help, but at least the typechecker is doing work for you.


Well you have chosen an example that is little more than an API object in the first place.

    class IdealGas implements EquationOfState {
        private final double gamma;

       public IdealGas(double gamma) {
           this.gamma = gamma;
       }

        public double energyDensity(double pressure, double internalEnergy) {
            return (1 + gamma) * pressure * internalEnergy;
       }
    }
Why create a separate type over this class which is just a projection of its data? You can just use a JSONObject as the API object. You are already going to need some special tricks to deal with the union with other EquationOfStates on top of some out of band type field to designate which class is to be used.

You will have the same sort of boiler plate in either case. Either a `public EquationOfStateData getEOSData()` or a `public JSONObject getJSON()`. In one case you use type safety and the deserializer provides your validation messages but should still do some custom validation on top to handle mismatched unions, in the other case you perform the type check (ordinarily done by using method like `json.getDouble()` and get to give custom messages.

Choose your poison, they really aren't all that different.


> Why create a separate type over this class which is just a projection of its data?

Because it lets you refactor your API separately from your domain. And use the full power of the IDE/typechecker in both cases.

This approach especially shines with JAX-RS; you can write fully typed tests against your API methods, which look like pure logic functions.


Adapting your example to a component of a real project I'm working on:

    JsonParser<Customer> customerP
      = productP
      . field("customer", productP
        . field("name", stringP),
        . field("address", productP
          . field("street", stringP)
          . field("zipCode", stringP)
          . field("town", stringP)
          . map(uncurry3(street -> zipCode -> town ->
              new Address(street, zipCode, town))))
        . map(uncurry2(name -> address ->
            new Customer(name, address))));
This is a parser, not a serializer, but hopefully it's clear how this approach can be applied in the other direction. (I still have the analogous code on a branch somewhere, but the parsing logic needed a cleanup more, and sooner.)

Some of the uncurried function stuff can be cleaned up with dedicated wrappers for `BiFunction` and whatnot. And of course, the address parser could be extracted out if we want to unit test it separately.

I don't find the code I give above to be any worse than the code you gave (for that matter, I don't have the same "yuck" reaction to the C# example, either). I much prefer not needing extra data types that only serve to configure analogous translation code.

We can keep going back and forth, addressing the concerns we have about each other's approach, but it ultimately comes down to preference.


You can use field names or you can encode the type alongside. Ex:

{ “birthdate”: [“date”, “01-02-1991”] }

And then in your codebase you know that all the values in your data are actually a list of (type, value). In this specific case I’d decide a “date” as “mm-dd-yyyy”

These are trivial problems. We have all solved them.

The root of the problem is wanting to put all your trust/hope/faith in the regular old JSON.encode and JSON.decode methods. Reading between the lines that is what I believe the author is frustrated with. It sounds like they want the equivalent to a Python’s pickle or PHP’s serialize. But of course you will quickly run into limitations there, too.

The lesson being: we will all need to augment vanilla serialization tools with little tweaks and enhancements to fit our specific use cases. Again, blaming the tool is not productive. JSON is not the problem. JSON can do all of the things the author wants it to do. The issue is that JSON.stringify cannot.


> You can use field names or you can encode the type alongside. Ex:

> { “birthdate”: [“date”, “01-02-1991”] }

> And then in your codebase you know that all the values in your data are actually a list of (type, value). In this specific case I’d decide a “date” as “mm-dd-yyyy”

This addresses neither of the problems I referenced from the article. The problem you've chosen to demonstrate is, indeed, trivial.

> It sounds like they want the equivalent to a Python’s pickle or PHP’s serialize.

They very much want the opposite. I wonder how you can read the article and conclude that they want an all-in-one serialization facility. The image at the end of the post well summarizes their position [0].

[0] https://einarwh.files.wordpress.com/2020/05/json-serializati...


This is how everyone is already using JSON.

This post is complaining about a problem that doesn’t exist.


Sadly, I can personally attest that the problem does exist, and that not everybody is performing serialization in such a reasonable, decoupled way.

It sounds like you agree with the article, but find its existence unnecessary. That's fine.


I have to agree with the other bloke, this is very much a problem in certain 'magical' JSON 'serializers' in certain very popular server languages, e.g. Java, Spring, FasterXML.

I've seen JSON ser done right where you do just ser/deser direct from/to a JSON object, then perform a manual mapping to domaim or business objects, I've seen it done the wrong way where the domain and data objects are highly coupled, by default.

I think the OP is bemoaning that being a common practice.


It can even happen without cycles. If object A contains two references to object B and you want to read A back into memory the same way (rather than have it contain two different but identical objects B) you have to deal with the issue of referential integrity. No cycles needed.

If your domain model is strictly that of rows in an RDBMS, there's no problem. Otherwise you're not storing objects but graphs of objects and overlooking that simple fact is the source of many serialization problems.


It actually is a JSON problem, specifically because ...

> You gotta remember that JSON doesn’t support all of the types that many languages do. So you need to engineer around that

And now you have N+1 problems.

> You can’t get mad at JSON for not being good at this.

It's not about being mad at JSON or even critical. It's about recognizing that the format's simplicity pushes some logic into either your language, your codec/serdes or your application itself.

Unfortunately, what I've seen is that people opt for the "application itself" approach, do it completely ad hoc, and trust all kinds of godawful things that have no reason to be trusted.

If you're using JSON and represent richer types (edit to add: and almost every JSON API does, there's no Date type so you're definitely making some assumptions about names or string [or bless your heart number] data), you should either:

- Use a transport format that extends it in a consistent way like Transit.

- Use the same codec logic on coupled client/server, and make your server's codec logic a first class part of your SDK which you should provide.

There's good tools for sharing and translating type definitions. Use those too. There are better tools even than that (one of which I'm working on, and hope to have a `Show HN` soon).


I think a shorter way of saying this is that everyone ad-hoc invents new formats that are subsets of JSON but don’t implement validators or parsers for them.


That’s true (except I assume you mean superset) but I think some of the shortening of it has a risk of masking just how awful that is. And in this case it’s not a JSON problem, it’s general data interchange design.

I’ve worked on too many applications that have HTTP body, query, header, cookie logic deep in business logic. Or distributed systems just pulling rando values out of messages. Or database queries making assumptions about blob data. It’s pervasive. Engineers routinely design systems where they overload the types of their tools and just whistle while their world burns.


> You gotta remember that JSON doesn’t support all of the types that many languages do. So you need to engineer around that by annotating your attributes with a type or relying on a naming convention on keys. You can’t get mad at JSON for not being good at this.

Oh I absolutely can get mad at it. JSON was not designed to be good at this. That’s the problem! Well, that and for various reasons (some good, some bad) programmers decided to use JSON for _everything_.

JSON sucks. If it didn’t exist the world would probably invent and standardize on something better. But it does exist, and it sucks, and we’re stuck with it.


I guess you haven’t spent a lot of time time with XML? JSON is not perfect but I wouldn’t say it sucks.


XML is a dumpster fire. JSON merely sucks.


JSON replaced the previous iteration XML which also had a million ways to do things, plus rarely-used schemas, and miles of abstractions between your object and the text XML representation. The standard libraries for working with XML also had extremely insecure defaults around entities that were tantamount to remote code execution.

Overall JSON simplified serialization tasks for the 80% case. The remaining 20% of special situations leads to the pain and suffering, especially in static typed languages.


Yes, JSON was not designed for this sort of thing; it was designed to follow the Javascript object model, which is VERY limited. It was a quick and clever hack to get serialization working between Javascript systems, so it did mostly solve the problem of rehydrating whole objects (in JS). The only tricky problem is if you change the data structure and then try to rehydrate from an old document (the same problem you encounter in databases when you change the schema). And most use cases have done alright despite that.

But of course once you take JSON and try to make it work in other languages, you run into trouble with the lack of types. There have been many attempts at serialization formats that cover the 80% case, each with varying levels of success. The thing is, JSON and XML opened the door to a new requirement: human readability. Nowadays we want to be able to inspect and edit the data without needing some kind of specialized binary editor to do it. We want to be able to load it up into our text editor of choice, make changes, and expect it to work when fed into the machine. That's huge, and marked a paradigm shift in data communications and serialization (despite the terrible inefficiencies this introduces). But it's always been the lack of types that tripped up nontrivial use cases, requiring all sorts of unportable hacks and workarounds to shoehorn it into your particular use case.

I've been developing Concise Encoding [1] over the past 3 years specifically to tackle this problem. For the 80% use case, the format must:

- Be ad-hoc capable (most real-world use cases don't actually need a schema)

- Be human readable/writable as text

- Be efficient (don't waste energy constantly serializing/deserializing text)

- Support all common types (bool, int, float, date, URL, UUIDs, list, map, array, etc)

- Support recursive data

- Support pseudo-objects like comments and metadata

- Be precise (no lossy conversions from things such as floats)

Also another pet peeve of mine is date formats. Pretty much every serialization format either gets it wrong, or doesn't support all of the cases it should (this includes ISO-8601).

[1] https://concise-encoding.org


Your response is technically correct but completely useless.

Of course if we all just programmed in the JSON data model everything would be simple. The author correctly points out the uselessness of that. He correctly identifies that the complexity lies in providing a useful JSON serialization library that is both compliant in input/output yet flexible enough to be configured to go from JSON object model to business model.

It's actually a complex problem, going from one type system to another. Want to know how complex? Just look at the compilers and interpreters that transform instructions for complex data types down to x86 (or M1, or RISC, or whatever) opcodes.


You can’t model the entire world. Well, you can, but you can’t. Take a look at schema.org to see where that rabbit hole will take you. So instead we use more primitive tools that allow us to model our domain without needing a tool that magically knows how to do it for us. My domain has unique nouns that differ from someone else’s.

JSON offers building blocks of primitive types like integers and strings and can be used to build more sophisticated types like “timezone aware datetime”. We use ISO8601 for this. We can use that inside JSON. Now you and I can agree on what a date is.

We could build this into the language itself - but many would argue the beauty of JSON is in the fact that it is very naive and can be used quite literally to model any scenario.


> We use ISO8601

Ah yes, the old "stringly typed" data model. https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b...

Application developers aren't trying to model the entire world, they are trying to model their domain, and I'm sorry, but primitive strings, numbers, lists, and maps are the building blocks of a domain model, but they are not sufficient. See for example https://fsharpforfunandprofit.com/ddd/


The F# post argues for marking OptionalStrings, but that’s bad design. There’s basically never a real distinction between string absent and string blank and if somehow your domain did have that distinction, you would need to record it as a bool or enum that clearly lays out “this is really blank and not just we don’t know it” or whatever.


> There’s basically never a real distinction between string absent and string blank

On the contrary, this comes up all the time; there are plenty of times when blank really is a valid value, and even if blank isn't a valid value, abusing it as a representation for "unknown" is bad for all the usual reasons in-band signalling is bad.

> you would need to record it as a bool or enum that clearly lays out “this is really blank and not just we don’t know it” or whatever.

An actual sum type is a million times better than using two awkwardly coupled fields. Type systems without sum types are a joke and there's no excuse for using them in the current millenium.


Give examples. I have never seen a case where there was a business reason to distinguish a null vs blank string. Dates? Yes. Numbers? Yes. Strings? Nope, never.

Let’s start with the most common false example: middle names. What are you going to do differently in your app if some one has an unknown vs a blank middle name? … Nothing. The displays will be the same. The search will still have to account for ignoring middle names because missing middle name might show up as “NMI” in an external system. There are no implications to having a missing middle name vs no middle name. You might need a flag for “date that the user verified that this is their full name” but that has nothing to do with null strings.


Error message "" is very different from no error message. Any legitimate use case for a null/absent string is an example; all those use cases are arguable and domain-specific, but they're a different thing from blank. For example if you're using null/absent to represent "unknown" then your search behaviour should be different from empty - for your example of middle names, a search with first=William/middle=Henry/last=Gates should return records with middle=null, but should not return records with middle="".


Why would you send a blank error message? Surely you would at least send “Unknown error”.


You probably wouldn't send "hlcrkorluatnsuh" either. But using that to represent "no error" would still be a bad idea.


I don’t understand what the structure you’re proposing is. There are a lot of ways of doing form validation. Typically you’ll have field levels errors and form level errors. In those cases, the no error state is represented by an empty list. I have never seen errors represented in a way that null means no error but blank string means unknown error. That would be a weird way to represent it.


Incidentally, the JS DOM represents no error on an input element as a blank string, not null: https://developer.mozilla.org/en-US/docs/Web/API/HTMLObjectE...


This reminds me of using DynamoDB, which didn't (until recently) accept the empty string as a value, instead helpfully offering to change it to null. However, there was a clear difference between a field in our database that a user had intentionally left blank and null as in the missing value which was usually indicative of some kind of scripting, API, or validation error. It is rather annoying that these assumptions get made.


Lots of programming languages treat 0 and null as the same thing. Numeric database columns, however, do not. This has resulted, at one of my employers, of a mix of different languages and systems treating 0 and equal to null. Sometimes, but not always. That inconsistency is a source of a fair number of bugs.


Numbers are totally different than strings. 0 is a distinct number in a way that blank is not a distinct string.


> There’s basically never a real distinction between string absent and string blank

Oh, child. If someone came to me claiming that something not existing was the same as that thing existing but being empty, I might be inclined to refer them to a remedial course on logic.


Give a business example for a string specifically being modeled where blank is valid and distinct from null which isn’t better represented by a separate column for entry state.


There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy


I too can quote something instead of addressing the topic of discussion. “Shimmy shimmy ya shimmy ya shimmy yay.”


It's not like everyone has the choice to not use JSON.


???

There is absolutely nothing inherent to JSON that would cause the woes the author is experiencing.

Any data can be serialized with JSON.


Can't any data also be serialized as a string? Does that mean "string payload" is good enough for exchanging data between systems?


Bingo.


I guess my point was that I'm rather on the side of "Ok then maybe it is still worth finding a serialization format that maps better to my domain, like perhaps that can serialize numbers with the same precision without having to route them through a string and back." This is like claiming that it's not worth comparing programming languages because machine code is Turing complete.


Everything is encoded in bits. What's like, the difference between anything?


Any data can be serialized with JSON.

Yep, by writing ad-hoc and context-dependent formats on top of JSON for every language on your system. JSON set of types are just too limited.


In Scala (and I assume many other languages), this is exactly how it works in the popular libraries. You design, or automatically derive, the mapping from your domain objects and a JSON AST. As the author mentions, the mapping between the AST and the string encoding is something you can consider to be a black box, unless you want to change settings like compact vs pretty printing.

To me, the most interesting question is what the author alluded on how to manage the coupling between arbitrary business objects and JSON AST. Writing a ton of boilerplate code is not at all fun. But as he points out, attempting to automate the translation through metaprogramming results in a problematic coupling of internal system details and the external contract.

An idea I like is something more like code generation. Macros where the expansion gets written to disk, so hand edits can be made. There should be automatic verification of any changes to the contract, which can be done by generating an OpenAPI spec and version controlling it. Also, it should be really easy to differentiate between the default generated code and hand edits. There shouldn't be any need to crawl through all of the mapping code if there are far fewer changes relative to the stuff that was autogenerated.


Well, I don't know if the author is going to see this, but...

I'm a young programmer and I think you've just, I don't know, "opened" something in me, something much bigger than the topic of JSON mapping. Not sure yet how to express it, about the more general topic of abstraction, maybe.

You've made my world less simple but probably more correct. Thank you I guess...


Welcome! You'll be glad for it (and cranky about it). This sensation is what many of us feel when we find the incidental complexities of things designed to be simple. Once you find enough you start to spot them yourself, and then you see them everywhere. The ugly part is... you see them everywhere. The beautiful part is you don't just get pattern recognition, you get a familiar solution mapping.

I felt this way as a young programmer really getting my teeth in. I was also self-taught, so I didn't have familiarity with some things that would probably be considered basics/fundamentals.

My advice regardless is: when you get this unsettling mind-expanding feeling go research prior art. Go find out how other people solve problems like it. Even if you come up wanting more/better, at least you have a lay of the land. And learn the terminology used describing the problem space to expand your hunt. You'll be amazed what you turn up!

Edit: since this is on the topic of JSON (de)serialization, while I’d love to tout the very good pattern I see in my usual stack (TypeScript) where I’m working on an offering in the space, I’d actually recommend looking at prior art in a very different stack with very different goals:

- Transit[1] which standardizes type metadata within JSON (but leaves type resolution up to producers/consumers).

- EDN[2], which is the philosophical basis for Transit, written in Clojure syntax. It’s demonstrably worse for performance but syntactically a nicer format/DX if you have tooling to deal with it, and it’s nearly tooling-free if you use the stack.

A lot of efforts to standardize rich data type representation in JSON unfortunately do it very haphazardly, so I wanted to include examples that come from the “pattern recognition/solution mapping” side as an example. Both have downsides, but they’re exceptionally well designed for what they are and deserve to be part of this discussion.

[1]: https://github.com/cognitect/transit-format

[2]: https://github.com/edn-format/edn


Treasure those moments! They definitely get rarer over time (for many reasons: the amount of work (and so, time) required to reach another epiphany becomes greater, the amount of time you can dedicate to learning (rather than maintenance or leadership) decreases, and the likelihood that you'll be introduced to an interesting new problem in the course of your daily life becomes lower), but they're definitely worth it.


You might like this post, which is roughly on that topic: https://ideolalia.com/essays/composition-is-interpretation.h...

(I thought of it when I first saw your comment but couldn't remember where, then I found it closing some few-day-old tabs :) )


I did. Thank you for telling me about this experience. The world isn't simple, but at least it's interesting.


I can't articulate it either but I know exactly what you mean. I've had these moments of epiphany before. It's wonderful.


I think the word he needs is "serdes" or "sedes". Just like codec is short for COder/DECoder an modem is short for MOdulator/DEModulator.

The other word he is looking for is marshaling.


You know, it never occurred to me to ask why Rust's popular serialization package was named https://serde.rs but now I can't un-see it. :)


> So what am I suggesting? I’m suggesting letting JSON serialization be about JSON only. Let JSON serializer libraries handle translating between text and a representation of the JSON object model. They can do that one job really well, quickly and robustly. Once you have that, you take over! You take direct control over the mapping from JSON to your own model.

I’m mainly immersed in the Scala world so I’m not sure what solutions in other languages look like, but this is essentially what Circe[0] does, and very well IMO. You write Encoders/Decoders for your domain model which serialize your data to the JSON object model and the library takes care of serializing that to text. In most simple cases you can probably just use the (semi)auto-derived Encoder/Decoders so you don’t need to actually write them, but that’s besides the point which is that Circe does have this separation of concerns and makes working with JSON mostly painless (at least for me).

[0] https://circe.github.io/circe/


This problem tends not to occur in Clojure codebases because Clojure is a data-centric (vs. model-centric) language.

In other words, all our core data is expressed as json-like data (e.g. EDN), so the translation is quite trivial - basically from one format to the other.

Building a codebase around plain old data is admittedly not for everyone, but it works! You certainly can still have a type-like system (in dev/test time) and validation (in production) using the various available mechanisms (Spec, Schema, Malli).


This is one of the reason I think FP is better than OOP for business development.

You can still have entities (that are plain maps, possibly annotated with types so you can keep your sanity), but the business logic now works with simple data structures: you input some data and gets another more data back. Do this transformation a few more times and very complex business requirements can be expressed. Testing is also easy since most functions are pure.

Now you need to transform the data? Ok, convert this data to another data format and pass it to serializer. Same thing for deserializing.

Traditional OOP really makes this last step more complicated than it should, since now you have an object, and the object should be opaque so you don't leak implementation details. To serializer you possible have a complex hierarchy of objects so your serializing logic is more complex too.

Of course OOP doesn't need to do this, but the FP model seems so much simpler to enforce this data processing part.


I'm not so sure whether this is truly a benefit of FP. Is it the same in languages like haskell or ocaml? (haven't used them in any significant way)

As for clojure, it's described as a "data-driven programming", which is almost as important as it's FP designation. I think it's that data-driven part that brings a lot of the benefits you're talking about.


Yes, smooth sailing in Haskell.

Define the datatype and let GHC generics and the Aeson library handle the rest.

The worst-case scenario is when keys in your JSON conflict with keywords or existing functions: 'id' and 'data' are pretty common. It means you need to rename your datatype fields to something like '_id' and '_data', and provide a mapping function to Aeson.

Where the 'fully automatic' declaration looks like:

    instance FromJSON Coord
The declaration where you remap to avoid name clashes could look like:

    instance FromJSON Coord where
        parseJSON = genericParseJSON defaultOptions { fieldLabelModifier = dropWhile (=='_') }


> I'm not so sure whether this is truly a benefit of FP.

Yeah, it is not. But like pure functions (that are also not necessary something from FP, you can have pure functions in probably any language out there), it is easier to apply in FP and also more idiomatic there.

Like, you can simply generate a map or dict in most languages and have it transformed and converted in most languages out there, but once things gets complicated probably someone somewhere will say in a Pull Request "this is not the idiomatic way in OOP, you should create a data object and store this data there" and blah blah blah. FP doesn't have this kinda of baggage.


I've written enough Clojure over the years to think that this is probably not the case. The reason this problem doesn't happen as much with Clojure is that the Clojure data structure is closer to JSON (and when using EDN instead of JSON, is identical.) If your serialization object model is close to your in-code object model, translating between the two is easy. If your serialization object model is distant from your in-code object model, then translation becomes difficult.

You can directly serialize and deserialize plain old Java objects to and from a native format and it's trivial to do so, but serializing to and from JSON is trickier because JSON is meant to represent dynamically typed data stored in maps and lists, but Java doesn't naturally store data in that manner. Java's manner is more obtuse, but if you could be certain that both the producer and the consumer of an object was going to be Java, creating a serialization format that could be transparently used would be a straightforward task. That is, so long as both the source and target of your serialization are in the same language, the problem is a lot simpler, and if your language happens to be closer Javascript (the source of JSON) it's also relatively simpler to write a JSON parser for it.


I see more similitudes than differences in our thinkings :)

One can make the problem easier in Java by using HashMaps instead of classes, just like one could make it harder in JavaScript by emulating rigid OOP.

Which is to say, practically all languages have all the necessary technology for JSON not to be a pervasive problem. The only impediments are social/ideological.

(An idea which resonates with the original article's message)


You might expect your service to only receive a particular JSON format, but it could really receive any JSON, or any String, or any bytes, etc.

At some point in the program you want to rely on your data already having been validated. HashMaps are not a good fit for this.

For me the neatest way to address both of these is to have a very narrow conversion window. Transform the data from bytes/string/hashmaps into well-typed objects only within the controller.


In Python land, we use a library called marshmallow for mapping domain object attributes to JSON fields. In most cases, these mappings are a direct reflection of the associated domain objects, so the result is a bit verbose but still works nicely because you can easily version them as your model definition evolves over time.

I was under the impression that having an interface between models and JSON serialization was commonplace, given the popularity of marshmallow library in Python. Is this not the case in other languages and ecosystems?


Pydantic is another great library. Golang also has the builtin json struct deserialiser.


Mapping JSON to rigidly typed languages is also a bit of a problem. E.G. in golang you can use Interface, for some level of performance cost, to handle arbitrary maps of (string) keys to (arbitrary, Interface?) values.

Usually I've encountered JSON blobs that start with a map (dictionary / object / whatever) of keys to values, but they might also be lists, or maybe even just a string in the case of errors. The values can also be of any type.

For rapid prototyping or cases where performance is trivial, I would really like a language to have some kind of object tree type which can represent any of JSON's data types. It would only need to be converted / cast at the leaf level, when transforming data to/from JSON.

The same data structure would also be useful for manipulating any other 'trivial within memory' documents, like XML or anything else that easily fits within memory.

Extra bonus points if the interface type is also an option for external databases to wired up and mapped as an ORM.

Golang has some kind of comments based decorator that does this for XML and JSON mapping of native objects, but it tends to be clunky and I generally end up hating myself for using it on anything not trivial. Offhand I'm not sure if there's a more proper way of implementing it via configuration files / scripts or programatically (build or runtime?)...

It'd be interesting to see if any languages have an effective and easy solution to this issue; any suggestions from others?


I've been pretty happy with Elixir's Ecto. I wouldn't call it easy, though. There is a LOT of ceremony, but it's designed to be declarative and it sure puts me at ease.

For database ingress, I use schemas, and put in aggressive validations in the changesets. For jsonb values, I use embedded_schemas (https://thoughtbot.com/blog/embedding-elixir-structs-in-ecto...), and cast their creation as a normal part of filling in the database object. For egress to a third party that accepts JSON, I use naked embedded_schemas (even better if they can be shared with the database) and use a protocol to build an custom encoder that can for example send the 3rd party a deeply structured JSON from a flat struct. Due to the way protocols work in elixir, when I send it to Oauth, Oauth will (indirectly) call my protocol code.


This sounds pretty sweet, do you happen to have any resources handy that dive deeper into these concepts?


unfortunately it's not super well documented. Some of the concepts are a bit new (~2 yrs or so), and I kind of stumbled into this as a best practice by accident literally in the last two months when I had to do this for work and was like, "oh, this is a thing".

This video helped: https://www.youtube.com/watch?v=k_xDi7zAcNM


This article got my attention. Related to what you are saying, in Java, the problem that I was really fed up of was creating domain specific JSON object models to map the JSON documents into to use in code. In other words, mapping JSON to rigidly typed language structure. Its boiler plate, is tedious to do (as the author points out in the article), difficult to change and usually a pain. I solved this problem by creating unify-jdocs which completely eliminates the need to create object models or POJO classes to represent your JSON object. You can read more about it here -> https://github.com/americanexpress/unify-jdocs I hope it helps you and others.


There is a beautiful json lib for F# called Thoth. Works on .NET and Fable.

Encode.object [ "blockId", Encode.guid data.Id "blockName", Encode.string data.Name "nestedType", Nested.Encode data.Nested ]

Decode.object (fun get -> { Block.Id = get.Required.Field "blockId" Decode.guid Block.Name = get.Required.Field "blockName" Decode.string Block.Nested = Nested.Decode get }


The author understates the problem. Many half-baked solutions handle serializing, but don't consider how you're going to deserialize data. And if you do anything with numbers, you may need to guard against intermediate processors that truncate all numbers to 64-bit float accuracy.

I took a shot at it with https://pypi.org/project/json-syntax/ after doing some ad hoc solutions.

My approach was:

1. Don't serialize directly to JSON, but convert to the "jsonic" types. 2. Allow users to pick the rules they need, and make it reasonably easy to create new rules. 3. Generate picklable encoding and decoding functions.

It's worked pretty well for a project that has a very complex set of objects that need to be transmitted via JSON. The main difficulty is that it depends on interrogating Python's type annotations, which were pretty sketchy in earlier versions.


Creating Java object models to map the JSON documents into was the problem that I was really fed up of. In my work, we have hundreds of JSON documents to manage and many a time the structure of the JSON document also changes. Managing the JSON object model classes in Java is all boiler plate which adds very limited value in my opinion. To solve this, I wrote unify-jdocs. You can completely eliminate the use of domain specific object model classes and work directly on the JSON document (an intermediate construct similar to what the author is pointing to). You can read more about it here at https://github.com/americanexpress/unify-jdocs


"Databinding", though it also and now more commonly refers to UI-data mapping.


Naming fields is a small pain, the PITA is that JSON does not have native encoding for enum/variant/sum types. So now there are multiple ways to encode them: internally tagged, externally tagged, adjacently tagged, implicit/untagged. And JSON schema does not solve that at all: you can write schemas for all of them, but they do not convey the same information in the end languages even if they have all the type capabilities.


Enum and variant/sum types are supported by a lot of serialization frameworks that have JSON serializations though. Thrift, Avro, Protocol Buffers, Flatbuffers etc.

JSON is not a framework for serialization anymore than XML is.


Great point about how encoding-decoding custom objects/models to JSON is the biggest pain point. Coming with requirements before starting out the Object->JSON encoding, really helps you solve the more complex bottlenecks you might hit in the later "serialization".

Its super interesting especially when you are using floating points, dates, C-style unions (represented as nested lists/dicts).


This is a problem space that crystal-lang handles with beautiful elegance [1]. Include a single module, get sane behaviour out-of-the-box, and specify anything domain specific with a type annotations.

[1]: https://crystal-lang.org/api/0.35.1/JSON/Serializable.html


How does it look if you have two different json schemas/versions that are both valid (maybe an old one and a new one)?


You specify the type to deserialise to. This lets you be very concrete about this if you have versioned API endpoint or different data source. Alternatively if you deserialise into a type union, the first compatible type will be used. Finally, you can also use a discriminator field [1] to select an appropriate subtype.

[1]: https://crystal-lang.org/api/0.35.1/JSON/Serializable.html#u...


I mean, what if the type I deserialize into is the same (my current domain object type), but I have to deserialize two (or more) different versions of json into it.

I think what you linked is the case where my domain object type is either A or B - but that's not what I meant.


Those "floating point" numbers are part of the problem.

It's shocking how many multiply-adds you can do in the time it takes to serialize and deserialize a single float.

On top of that the numbers you are writing (say 0.1 or 0.2) don't really exist in the number system unless the number is a fraction over a power of two. So you get anomalies such as 0.1 + 0.2 != 0.3


JSON has no floating point numbers. It just has decimal numbers, which can be arbitrarily large and arbitrarily precise. If you deserialize them into floats and then are upset that they aren't exactly right, that's on you, not JSON. Deserialize them into some bigdecimal type if this is a problem for your domain.


If JSON is transported through any intermediate processor, you have to assume they're translating them to floats and back again. If you treat them as arbitrary decimals, you have to expect that data is being silently corrupted.

And this includes intermediate processors that are invoked manually, so if someone uses a command like jq to simply test something, numbers are being truncated and they're baffled as to why they're getting different results.


The "decimal floats" specified in JSON are exactly what is wrong with it; Javascript approaches numerics like Applesoft basic did, you have floats that can often stand in for ints. The JSON specification promises one thing, but it's not supported by the 'reference implementation' that it is based on.

Also it is a lose-lose situation.

Not only are floats wrong in many ways (e.g. I think 0.1 + 0.2 = 0.300000000000002 makes many people decide "computing is not for me") but parsing floats (really any ASCII numbers) is astonishingly slow and people get "frog boiled" into accepting it. (e.g. there is no problem with the speed of parsing 10 floats, but when you are parsing a million floats you have a real problem, and a million floats is just 4 MB of core, well in the range that a web or other application could handle on anything bigger than an 8-bit micro-controller.)

Like the numerous problems that cause programmers to not use the SIMD instructions in your Intel CPU, there are multiple problems with floats, each of which can be dismissed by apologists, but when you add them up it's a major drag on the industry.


You're still complaining about floats, and now JavaScript, neither of which is relevant to JSON. JSON is perhaps named poorly, but it merely "was inspired by the object literals of JavaScript" (quoting the spec) - there is no reference implementation, and it is defined by spec.

I also don't really see what alternative you're implying. If you want a human-readable format, but need to express non-integer numbers, what do you suggest we should do?


But almost all JSON libraries parse them to binary doubles and don't even do that correctly.


Is this a language issue? Users of dynamic languages can have almost fully general "Data model mapping" in like 40 LOC. That is, the tasks of translating between json dicts and class instances and handling graphs that have multiple in-edges to the same object can just be done once. Are there other issues?


I've tried this approach, and it's not great. You either have to pack a lot of noisy information in the JSON for the deserializer to work reliably, or you kludge it to try and guess, and then you find yourself designing your data structures around all the kludges.

You inevitably run the risk that your deserializer can, effectively, make arbitrary function calls. And your data is tied not only to your language, but also needs to find classes in specific modules.


I don't think I need to run arbitrary code to take a dict and make it into an instance of a specific class whose members are that dict, but this may depend on your choice of dynamic language!

Edit: I guess you mean the deserializer is running arbitrary code to import the necessary classes if they are not already imported. I would prefer that it fail in this case. I think it's not too much trouble to ask users who insist on using classes to have them around before instantiating them.


I mean that fundamentally the deserializer has lookup function and it's going to lookup anything it's told to.

You're right, some languages are worse about this, e.g. Python's pickle will perform arbitrary code executions.

But it's easy enough to restrict it to a set of known good constructors. You still have the classic late-binding dilemma: it's flexibility we don't actually want. Why should my find_matching_socks function ever have to worry that it might be passed basketballs?

The usual answer is, "it's fine, it'll just crash if they do that."

It might crash, or it might hang, or silently corrupt something. An attacker can stick the wrong kind of object into your code anywhere they like.


If you don't mind, can you show us what this 40 LOC looks like?


I imagine the complexity is bound to the language you are using. From JavaScript writing a JSON serialization library wasn’t that challenging. Compared to various other web technologies serializing JSON was almost trivial.


You’re aware that json stands for “JavaScript object notation”, right? ;)


Sometimes I still miss when I worked with Java and Jackson showed me that the databindibg was a solved problem. But it seems to me that no library writer for other languages still has learned it


I assumed the author was writing from a Java/Jackson POV. What was your reading of it? What do you use now that makes Jackson look good?


I think the article is missing the immense value of simply having a good default serialization.

For most data models expressed in most languages, you can define a single canonical JSON representation that can be easily provided by a library.

Then, there come the corner cases of course. And you do need to add a lot of code to handle them. But good defaults + some customizability is a very good strategy for many problems.

It is true that you can encounter problems where your data model is so far away from JSON that it's a better idea to just transform it entirely by hand. But the very reason JSON caught on is that these cases are rare in the industry.


Hardly anyone has a system with written in only one language. Typically you'll have a mix of both languages, some statically typed, some dynamic, so strongly typed, some not. The problem is not about serializing data from language A and deserializing back to language A, although that comes with enough problems (e.g. what if you language supports functions as first-class types?), but going from language A, serializing the data, and then deserializing it to a language that may or may not support something the source language has built in: floating point, complex types, UTF8 strings, and so on.


Presumably, if you're exchanging data between languages A and B, you have already figured out how to translate you want to map all of the data structures between them.

Then, what I would expect you do is that you have code in A that can translate Complex Data Structure -> Lowest Common Denominator Data Structure -> A built-in JSON serialization -> JSON -> B built-in JSON deserialization -> Lowest Common Denominator -> Complex Data Structure B.

Of course, the language/library can't help with the complex -> lowest common denominator translation part, that one will have to get written by hand. But that shouldn't mean that the lowest common denominator has to be as basic as JSON itself.


Essentially identifying the usefulness of a layered architecture with abstractions between each layer.


This sounds more about using openapi/protobuf/etc and language object mapping to some data format.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: