Hacker News new | past | comments | ask | show | jobs | submit login
Gojq: Pure Go Implementation of Jq (github.com/itchyny)
153 points by laqq3 on Aug 21, 2022 | hide | past | favorite | 74 comments



"gojq does not keep the order of object keys" is a bit disappointing.

I care about key order purely for cosmetic reasons: when I'm designing JSON APIs I like to put things like the "id" key first in an object layout, and when I'm manipulating JSON using jq or similar I like to maintain those aesthetic choices.

I know it's bad to write code that depends on key order, but it's important to me as a way of keeping JSON as human-readable as possible.

After all, human readability is one of the big benefits of JSON over various other binary formats.


I bet it's an artifact of Go having a randomized iteration order over maps [0]. Getting a deterministic ordering requires extra work.

[0] https://stackoverflow.com/questions/9619479/go-what-determin...


I used to have the exact same problem with Python, until Python 3.7 made maintaining sort order a feature of the language: https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/


Go actually went in the other direction for a bunch of reasons (e.g. hash collision dos) and made key order quasi-random when iterating. Small maps used to maintain order, but a change was made to randomize that so people didn't rely on that and get stung when their maps got larger: https://github.com/golang/go/issues/6719


Right, the startling thing about Python's previous dict was that it was so terrible that the ordered dict was actually significantly faster.

It's like if you did such a bad job making a drag racer that the street legal model of the same car was substantially faster over a quarter mile despite also having much better handling and reliability.

In some communities the reaction would have been to write a good unordered dict which would obviously be even faster, but since nobody is exactly looking for the best possible performance from Python, they decided that ordered behaviour was worth the price, and it's not as though existing Python programmers could complain since it was faster than what they'd been tolerating previously.

Randomizing is the other choice if you actually want your maps to be fast and want to resist Hyrum's law, but see the absl experience - they initially didn't bother to randomize tiny maps but then the order of those tiny maps changed for technical reasons and... stuff broke. Because hey, in testing I made six of this tiny map, they always had the same order therefore (ignoring the documentation imploring me not to) I shall assume the order is always the same...


> In some communities the reaction would have been to write a good unordered dict which would obviously be even faster

Actually an ordered dictionary has improved performance over an unordered dictionary for the kinds of common Python workloads you encounter in the real world. The reason why is that the design is only incidentally ordered, the design arises from trying to improve memory efficiency and iteration speed. The dict ends up ordered because they stash the real k/v pairs in a regular array which is indexed by the hash table, populating the array is most efficient in insertion order. For pure "unordered map" type operations the newer implementation is actually a tiny bit slower.


The main thrust of your claim obviously can't be true and I'm not sure what confusion could lead you to believe that.

Maybe it's easier to see if we're explicit about what the rules are: OrderedDict (now the Python dict) is exactly the same features as a hypothetical UnorderedDict except OrderedDict has the additional constraint that if we iterate over it we get the key/values in the order in which they were inserted, while UnorderedDict can do as it pleases here.

This means OrderedDict is a valid implementation of UnorderedDict. So, necessarily OrderedDict does not have, as you claim, "improved performance over an unordered dictionary". At the very worst it's break even and performance is identical. This is why it's remarkable that Python's previous dict was worse.

But, that's a pretty degenerate case, we can also see that after deletion OrderedDict must use some resources ensuring the ordering constraint is kept. An UnorderedDict needn't do that, and we can definitely do better than OrdererDict.


>The main thrust of your claim obviously can't be true

It's surprising that iterating a dense array is faster than iterating a hashmap? I don't think you are parsing the parent post correctly.

If dictionaries are commonly iterated in python, then iterating an array of 100 items that fits in one cache-line will be faster than iterating a hashmap which might have 100 items in 100 cache lines.


The claim was that: "Actually an ordered dictionary has improved performance over an unordered dictionary"

Having a dense array is not, as it seems both you and chippiewill imagine, somehow a unique property of ordered dictionaries. An unordered dictionary is free to use exactly the same implementation detail.

The choice to preserve order is in addition to using dense arrays. The OrderedDict must use tombstones in order to preserve ordering, and then periodically rewrite the entire dense array to remove tombstones, while a hypothetical UnorderedDict needn't worry because it isn't trying to preserve ordering so it will be faster here despite also having the dense arrays.

"iterating an array of 100 items that fits in one cache-line will be faster"

On today's hardware a cache line is 64 bytes, so fitting 100 "items" (each 3x 64-bit values, so typically total 2400 bytes with today's Python implementation) in a cache line would not be possible. A rather less impressive "almost three" items fit in a cache line.

But to be sure the dense array is faster for this operation, the problem is that that's not an optimisation as a result of being ordered. It's just an implementation choice and the UnorderedDict is free to make the same choice.


Enjoy and appreciate the discussion. Is this in the same neighborhood of why the reworked dict implementation in python 3.6 had insertion order as a detail, but explicitly claimed it was not a feature that should be relied upon? At least until python 3.7 cemented the behavior as a feature.


The problem with Python here is that CPython is not only the reference implementation but the de-facto specification. So dicts are still "supposed to be" unordered collections, but now dicts must also preserve insertion order as per the docs and the reference implementation, so now all alternative implementations must also conform to this even if it doesn't make sense for them to conform to it, or they must specifically choose to be non-comformant on this point.

Of course in this case, the order-preserving optimization was actually first implemented by an alternative implementation (PyPY), but I don't think that changes the issue.


Since Python 3.7 preservering insertion-order is part of the language specification.

"the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec."

https://docs.python.org/3/whatsnew/3.7.html


Right, but that's kind of my point. Adding it to the language spec now creates an additional and frankly somewhat unnecessary point of compliance for other implementations. Python is already so damn big and complicated, my opinion is that we shouldn't makes its spec even more complicated, even if its reference implementation adds more features like this.


> Right, the startling thing about Python's previous dict was that it was so terrible that the ordered dict was actually significantly faster.

I've never heard that before and it would be really surprising, given that Python's builtin dict is used for everything from local symbol to object field lookup. Do you have more information?


Here’s a description of the new map implementation and why it’s more efficient: https://www.pypy.org/posts/2015/01/faster-more-memory-effici...


Note that this applies more for python than efficient languages. In python, objects are big, and require an indirection. In faster languages, many objects can be smaller than a pointer and stored inline. As such, dictionaries that have vectorized lookups generally can be made faster.


`select{..}` cases with multiple valid channel operations also select randomly.

I really like it, it helps you discover (and fix) order-dependent logic WAY earlier. Though I would really like some way to influence how long it blocks before selecting one (to simulate high load scenarios, and trigger more logical races).


you'll need an interrupt chan for that if you're in a select{..}


This burnt me when I wrote an algorithm. I depended on the order of keys in dicts as it allowed me reference the value both by index and key.

I wrote the code in python 3.7+, and ended up spending a good amount of time debugging it when I ran it in a earlier python version.


Does Go not have more than one Map implementation in the standard library?


It does not. Maps are not even a real interface you can implement, it's compiler magic encoded in the language spec: https://dave.cheney.net/2018/05/29/how-the-go-runtime-implem...

This is all fallout of not having generics.


I would have never expected that a language (Especially a compiled one) does that kind of fuckery behind the scenes.


I fucking hate this so much. Honestly Go isn't a bad language, but I dunno why these kind of things just piss me off.


It seems as though a lot of people view it as hypocritical, e.g., generics for me but not for thee (dated example since there are now generics for everyone).

The fact that they needed to make a map a part of the language in order to allow it to be generic and statically-typed proves that generics are useful and should therefore have been a language feature much earlier than they became one.

There are a variety of things that the standard library or compiler deal with using weird workarounds that seem to indicate missing language features.

The thing is, the features are only "missing" if the language is designed to do the things those features permit. So the counterargument is that Go is a very opinionated language designed to do solve a few classes of problem very easily, like writing database-backed web services, and the reason the standard library or compiler teams have to do weird hacks at times is because Go wasn't made for writing those things, and designing to make those use cases easy would pollute the language from the perspective of someone using it for its intended purpose.


It's really hard to take those arguments without a whole serving of salt because the things that it's ostensibly good at handling really aren't that much easier. Why is (un)marshalling data in a type safe way so damn hard in Go? Why does doing the same thing over and over and over and over never get easier? (Because the language lacks high level abstractions.

I used Go extensively at my last job and I was left feeling that there were pretty much always better choices. If you care about developer velocity with your more unskilled engineers, Go is a bad choice for a multitude of reasons. If you're going to need the performance over something else, the JVM is right there, and so too is Rust.


It's fair to just say "Well it's not for that" if you're not a general purpose language.

Like it sure is hard to write a grammar checker in WUFFS. Well, it's not for that, it's a special purpose language, it doesn't even have strings, stop trying to write a grammar checker.

For a general purpose language this is a poor excuse. I think the best argument might be a desire to avoid the Turing Tar-pit where everything is possible but nothing is easy. C++ allows you to write a type that's generic over the floating point values, so e.g. Foo<NaN>. What does that mean? Nothing useful. But they could so they did.

In avoiding the Turing Tar-pit you must make some choices which weren't necessary but were, in your opinion (as Benevolent dictator, Steering Committee, Riotous Assembly Of Interested People, or whatever) more aesthetic, more practical for some particular purpose you had in mind, easier to implement or whatever.

My impression with Go, which I spent a few years programming but never really loved, was that it's main value was in being surprisingly efficient for that type of language. Particularly startup of Go is good, which would matter for gojq for example.


Well, keep in mind that it’s out of date; Go does have generics now.


So, golang is no longer suitable to nicely process json documents.


Nobody forces golang programmers to use the built-in map.

Also, some people would argue any map is unsuitable for many use cases of jq. If you want to keep the keys in the order of the input file, it certainly isn’t.

And yes, formally, json doesn’t change when reordering keys, but json often is treated as text and then, it is. You can use jq, for example, to do some transforms on a json file and get a nice git diff.

This tool may (or may not) produce a diff that’s much larger than necessary.


No, it isn’t. “gojq does not keep the order of object keys” isn’t about ordering keys consistently across runs, it’s about keeping them in the order of the input file.


Which it can't do because, as mentioned, Go randomly iterates over maps. That's the data structure that most would use to load arbitrary input files into the program.


If you have a hammer in your toolbox, it doesn’t mean you have to use it in every job. It golangs maps don’t do what you want, pick a different data structure.

This is like claiming that, because updating native integers isn’t guaranteed to be atomic in a language, you can’t do multi-threaded programming.


GP is correct, per the README:

> gojq does not keep the order of object keys. I understand this might cause problems for some scripts but basically, we should not rely on the order of object keys. Due to this limitation, gojq does not have keys_unsorted function and --sort-keys (-S) option. I would implement when ordered map is implemented in the standard library of Go but I'm less motivated.

And later in the same file:

  gojq does not support some functions intentionally;
  <snip>
  --sort-keys, -S (sorts by default because map[string]interface{} does not keep the order),


No, they aren’t. haasted replied to simonw’s

> "gojq does not keep the order of object keys" is a bit disappointing

with

> “I bet it's an artifact of Go having a randomized iteration order over maps. Getting a deterministic ordering requires extra work.”

But deterministic iteration order doesn’t imply that the order of keys is kept the same. There are map implementations that keep iteration follow insertion order, but the canonical map does not guarantee that. https://en.wikipedia.org/wiki/Associative_array#Hash_table_i...:

“The most frequently used general purpose implementation of an associative array is with a hash table: an array combined with a hash function that separates each key into a separate "bucket" of the array“

Such implementations iterate over maps in order of hash value (and hash collisions may or may not follow (reverse) insertion order)


I don't think the distinction you're trying to make is helpful here if I've understood you correctly. A good faith interpretation of haastad's comment would be that they were thinking of "insertion order" when they said "a deterministic ordering". Even if we were being pedantic, their comment is still correct - for iteration order to be the same as input order then deterministic iteration ordering isn't sufficient (this seems to be the point you're making) but it is necessary.

Their first sentence:

> I bet it's an artifact of Go having a randomized iteration order over maps

is correct per Gojq's author [0]:

> gojq cannot implement keys_unsorted function because it uses map[string]interface{} to represent JSON object so it does not keep the order. Currently there is no plan to implement unordered map so I do not implement this function.

It would of course be possible to work around this limitation of Go's built-in map type but that's not the point. The author makes it clear that this limitation is the cause for Gojq's behaviour.

[0] https://github.com/itchyny/gojq/issues/50


Yeah, this is a deal breaker. While technically the key order doesn’t matter, in the real world it really does matter. People have to read this stuff. People have to be able to differentiate between actual changes and stuff moving around just because. Luckily it’s a solved problem and you can write marshalers that preserve order, but it’s extra work and generally specific to an encoding format. It would be nice to have ordered maps in the base library as an option.


It does not even provide a `--sort-keys` option. That's like 90% of the the reason I ever lean on jq - to standardize API output for my human brain.


Best 3rd party Map library that I've found. https://github.com/cornelk/hashmap


Agree, this is deterring me from this tool. Many languages/tools nowadays guarantee object key order which is convenient in many ways.


For what it's worth, JSON Objects are not guaranteed to be ordered. Maps in many different languages are implemented without an order.


into it's not about code, it's about predicable and consistent layout so that you can easily diff


This. For one project, I even write a tool to reorder keys to a specific order. And of course this has no technically reason. But I used JSON here for the human readability and that non-technical people have best changes to understand and change the data. And therefore starting with id and name on top is important than with an huge array of data.


Ordered keys in json is not only for cosmetic reasons. If this ever touches disk you want the ability to diff them or stash them in git without the whole file changing with every update.


jq used to do this, but changed to preserve key order.


Not implementing key-sorting is a curious decision:

> gojq does not keep the order of object keys. I understand this might cause problems for some scripts but basically, we should not rely on the order of object keys. Due to this limitation, gojq does not have keys_unsorted function and --sort-keys (-S) option. I would implement when ordered map is implemented in the standard library of Go but I'm less motivated.

I feel like --sort-keys is most useful when it is producing output for tools that do not understand JSON - for example, generating diffs or hashes of the JSON string. There is value in the output formatting being deterministic for a given input.


I agree with you that there's value to sorted keys from a presentational standpoint (we are not beep-boop robots, humans have to read this stuff too), but now there also exists a JSON canonicalization RFC that tools can/should follow (with all the usual caveats about canonicalization being fraught): https://www.rfc-editor.org/rfc/rfc8785


I guess "Informational" is better than /dev/null, but unless everyone adopts it doesn't that run the risk of it just being My Favorite Canonicalization&trade;?

Either way, I'm guessing if the gojq author has that much heartburn about implementing --sort-keys, --canonical is just absolutely off the table :-(


> unless everyone adopts it doesn't that run the risk of it just being My Favorite Canonicalization

That's true regardless. The IETF has no enforcement arm. Even if people expend the effort to agree a standard, and make whatever signs and follow the rituals, if nobody implements it then de facto that isn't the standard after all.


Thank you for letting me know! I hadn't thought to look.


You can always add this feature but it's problematic to remove it.


Could pipe through gron and sort to resort


That helps when you want to sort by key, but not when you want to keep the order of object keys as in the input file.


Gron has the same issue, as it too is written in go and randomizes key order.


I have actually fully replaced my jq installation with gojq (including an `ln -s gojq jq`) for a few years, and no script has broken so far. I'm super impressed by the jq compatibility.

If you are going down this route, do be careful with performance. I don't know which is more performant as I've never really had to work with large data sets, but I can't help but feel jq will be faster than gojq in such case. I have no benchmarks backing this up, but who knows, maybe someone will benchmark both.

One of my favourite features is the fact that error messages are actually legible, unlike jq.


It's very possible it could be faster; jq seems to actually be fairly unoptimized. This implementation in OCaml was featured on HN a while back and it trashes the original jq in performance: https://github.com/davesnx/query-json

After seeing that one I did my own (less-complete) version in Rust and managed to squeeze out even more performance in the operations it supports: https://github.com/brundonsmith/jqr


Working with large json files is hard to parallelize. Just filtering the objects in a root array can take very long. jqr and gojq both die with OOM when running on large files like

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all...

A fast tool to split a json file like that into a format with one json file per line would already help a lot.


Mine can :)


jq is particularly bad at large stream progressing


To see if gojq works even with complex jq programs, I tested it on my wsjq[0] Whitespace language interpreter, which uses most of the advanced jq features. It impressively appears to support the full jq language, though I uncovered a bug[1] in gojq.

gojq's arbitrary-precision integer support will be useful (jq just uses 64-bit floating-point), though I suspect it will have performance regressions, since it uses math/big, instead of GMP.

[0]: https://github.com/andrewarchi/wsjq

[1]: https://github.com/itchyny/gojq/issues/186


jq uses bison (gnu's yacc), which is a nightmare for error diagnosis. Additionally, the founder (though brilliant - or maybe because brilliant) wouldn't accept improvements in error reporting.


Naming is hard, but please, do not repeat the mistake of many OSS project in the last 20 years calling each project by prefixing the name with the stack/environment involved.

Now a "trending" language can catch the attention, but tomorrow?.. maybe. So the value proposition and starting from it name should be different (if you want adoption).

For my use case, for a rewrite of jq I would expect one thing only: higher performance... show the numbers ;)


I'd also expect higher performance for a rewrite of jq or, for that matter, any other tool that works as expected and being used for a long time.


Last time I profiled jq in my particular use case - querying large GeoJSON files - I discovered it spent practically all of its CPU in assert, and it went a lot faster when built with -DNDEBUG, but since I could not rule out that some of its asserts have side effects I went back to the upstream package.

I think beating the performance of jq would be very easy for anyone who set out with that as a goal. It also has its own internal strtod and dtoa which are easily beaten by ryu or C++'s from/to_chars, so I would start there after dumping the weird asserts.


This looks quite cool! I'm not sure though why I would use this over the original jq. However, I can definitely see the value in embedding this into my own applications, to provide jq scripting inside of them.

Shameless plug: As I'm not a fan of the jq syntax, I've created jql[0] as an alternative to it. It's also written in Go and presents a lispy continuation-based query language (it sounds much scarier than it really is!). This way it has much less special syntax than jq and is - at least to me - much easier to compose for common day-to-day JSON manipulation (which is the use case it has been created for; there are definitely many features of jq that aren't covered by it).

It might seem dead, as it hasn't seen any commit in ages, but to me it's just finished, I still use it regularly instead of jq on my local dev machine. Check it out if you're not a fan of the jq syntax.

[0]: https://github.com/cube2222/jql


One reason to prefer gojq is that gojq’s author is one of the most knowledgeable person for the original jq (as seen by GitHub PRs and issues), and his gojq fixes many long standing issues in jq.

Plus, for my use cases, gojq runtime performance beats jq by a fair margin.


i neither know nor care what language the original jq was implemented in.


I can think of two reasons it matters here:

- Can be used as a library in Go projects

- Memory-safe (could be relevant when processing foreign data, esp as a part of some automated process)


Yep, Benthos is an example of a cool project that uses gojq for its jq syntax support.


FYNI jq was originally implemented in Haskell.


Why use a special syntax that's hard to remember when you can just use Python?

I wrote a jq-like that accepts Python syntax called pq: https://github.com/dvolk/pq

So you can write stuff like:

    $ echo '{ "US": 3, "China": 12, "UK": 1 }' | pq -c "sum(data.values())"
    16


i deny the premise obviously


Go is not the choice I would make when writing a parser for JSON, good luck though.


why?


RIIR? I mean, RIIG? It would be fun to do so :)

But it seems that being able to embed it into Go applications is a nice positive.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: