One of the best tools I've found for manipulating JSON via JsonPath syntax is https://jsonata.org
In addition to simple queries that allow you to select one (or multiple) matching nodes, it also provides some helper functions, such as arithmetic, comparisons, sorting, grouping, datetime manipulation, and aggregation (e.g. sum, max, min).
Thanks for this! The 5-min intro really undersells it as that part is not that much better than JSONPath. I started watching the London Node User Group video next and was about to switch it off but decided to hang on a bit longer because I busy with something anyway. Then it finally started getting into its real differentiators: constructing new objects and reductions.
I even had to click around the documentation link three times until I figured out how to get to the real docs. It's a really well thought out query language that is actually Turing complete. I really want to try it out on some real data in anger to see how well it holds up in practice. I'm also thinking about what could be taken over into PRQL which I'm somewhat involved in.
Not sure if this is a common problem, but I built a tool to help me quickly understand the main schema, and where most of the data is, for a new JSON file given to me. It makes the assumption that sometimes peer elements in a list will have the same structure (eg they'll be objects with similar sets of keys). If that's true, it learns the structure of the file, prints out the heaviest _aggregated_ path (meaning it thinks in terms of a directory-like structure), as well as giving you various size-per-path hints to help introduce yourself to the JSON file:
Is there a general name for the kind of data structure JSON represents?
We see this kind of Nete’s data all over the place (json, yaml, python dictionaries, toml, etc, etc) and I’m thinking wouldn’t it be nice if we had a path language that worked across these structures, just like how we can regex any strings?
So we can have a pathql executable that we can feed yaml and json data to, but I can reuse the query in Python when I want to extract values from a json stream I just deserialized.
Is it anything else than a tree with properties for each node?
I think you could well apply jsonpath to yaml, except for the different data types, which is what makes you need xmlpath, jsonpath, file path, css and so on. If you're willing to do some automagic conversions, you could probably do that right now, if you write the code for it.
They vary in syntax and how they deal with scalars, objects and collections. Having to write 'myKey' is a bit unfortunate in json for example. Still, for any document larger than 20 lines yaml will fall apart easily.
Xml (dialects) mark the beginning and end of nodes explicitly, which deviates from json/yaml. Xml can represent nodes within the value, which is impossible for json/yaml. To convert such xml value to the latter, you have to break up such a node in fragments and represent them as a collection in json/yaml.
> Xml can represent nodes within the value, which is impossible for json/yaml. To convert such xml value to the latter, you have to break up such a node in fragments and represent them as a collection in json/yaml.
Are you talking about XML like `<text>Something something <para> inside </para> something else</text>`? I thought this would also presented as the text element having three children, the text "Something something", the <para> tag with its subtree, and the text "something else". Am I misremembering?
Agreed, a lot of the features that make XML very nice for embedding tags in text documents make it very inefficient at expressing explicit tree data structures.
That is half the reason. JSON was limited to Javascript when SOAP/WSDL was king for data exchange. If I remember correctly, XML apis predates json in Javascript. IE5 and XML were really hot back then. xml.loadString() is equally simple.
Json got introduced because XMLHttpRequest was an IE-only component. Data exchange was done mostly between servers, back then the front-end was really dumb. Why xml got slowly replaced is because that for machine generated data, json proved to be sufficient. Any json library is easier to work with then their xml counterpart, exactly because of what we discuss here.
The complicated part is, that this tree has different types of nodes. Some support properties and child's, some not. So I would call it a chimera tree or mixed tree, or something like that.
For a query language, you can generalize to all nodes supporting all things. You will be able to write some queries that will never return results, at least in a given input data format. That does not seem so bad.
Coincidentally, this is exactly what XPath and XQuery do. E.g. text nodes don't have children nor attributes, yet it is perfectly legal to do text()/foo or text()/@foo - and the result is an empty sequence.
I think one popular name, independent of the format, is to call them documents. Or more specifically, the type of database which use them often runs under that term. Maybe specify it as data-document to not confuse it with freeform office-documents.
But one problem is, that each format has slightly different ways it works. Some have nodes which support properties, some not. This makes building a proper query-language a bit more complicated if it should not end up ugly.
What's the problem with nodes supporting or not some properties? For querying you can treat all nodes as supporting all things, and the user just needs to write a query that acutually works, which is what they would need to do anyway.
Yes, if you want to make the query language reject as incorrect queries that specify a property on a node that can't have properties, it is messy. But is this really necessary? You can write faulty queries anyway. And it's not a programming language for complex systems, where it really helps to prevent mistakes early. SQL works fine with no type safety and so on.
The problem is that you can't easily reuse a query with different formats, with a swallow designed language. You will still end up writing format-targeted queries, which then open the question of why you would even bother in the first place using a watered down language, instead of a format-optimized one.
XML is for data-exchange, competing with JSON and others. So I have no problem putting it to the data-documents, even though it's more a frankenstein. But HTML is an office-document, used for freetext, nobody really should use it for data, even though sometimes it's used that way.
So are JSON and YAML. The point is whether you have a programmatic and structured handling of a document's content, or whether it's random, where every line and word can demand for a different parser.
As to cross format path querying, I see limited value in such an endeavour; the reason that one deals with such formats is generally because it is used as an interface to configure some tool, but moving across different tools, besides the path queries, even the configuration schemas differ, and having a cross format converter would do you no good. The fact that many such configuration formats support subtly (or drastically) divergent types of objects does this idea no good either.
Speaking of this topic, I would be very interested in a CLI tool that translated between different dialects of regex. As opposed to general data or configuration, I find the case for cross format conversion very compelling for regexes; the objects that different regex languages deal with are functionally completely identical. I would be very happy to be able to craft a regex to search for something in vim, then convert the regex to grep or use another tool on the shell, then perhaps adapt it to some other scripting languages e.g. Javascript/Ruby/Perl.
I feel like `fq` has a query path language that's kind of generic across lots of file types. It can be fairly verbose for that reason. I was using it to debug MsgPack documents and it was a lot less intuitive than just using some dotted string paths with `jq`.
Hey, fq author here. Happy to hear it's useful! could you elaborate a bit more how it was less intuitive? fq's query language is jq with some small additions so i wonder if you might mean the decoded structure is more detailed/verbose as it includes all the "low level" details? maybe your looking for the "torepr" function that converts the detailed structure into the "represented" value?
Yes that's exactly what I meant, the MsgPack documents had quite a detailed structure.
torepr didn't quite work for me as I was dealing with objects containing large binary blobs and it was awkward.
fq is a great tool and I shouldn't have suggested this was a problem unique to it! I think this kind of "issue" is inevitable when dealing with so many types of input. And to be honest I struggle hard using jq as well for anything other than very basic paths, due to infrequent usage.
I see, thanks for replying and no worries! yeap some of the "self-describing" formats like msgpack, cbor etc will because of how fq works have to be decoded into something more of a meta-msgpack etc.
Although there are similar data structures, not all of them are supersets nor subsets of the data structures of JSON. For example, some data types that JSON lacks are:
- Integers (including 64-bit integers and longer)
- Non-finite floating point (Infinity, NaN)
- Keys of types other than strings
- Non-Unicode strings (e.g. byte sequences, TRON code, etc)
- Date/time
- Links
Additionally, they may differ of whether or not the order of keys should be retained.
grove - tree decorated w/ key-value pairs. eg JSON, XML w/o text nodes
object graph - grove plus object references. eg serialization formats
knowledge graph - object graph where parent-child relations are explicit, using subject-verb-object clauses. like how Java Spring "flattens" object
graphs.
The approach is to take the JavaScript object, convert it to XML DOM, run the query (either using standard XPath, or standard CSS selectors) and then either convert the DOM back into objects, or another way I've seen it done is to keep a register of the original objects and retrieve the original objects.
In this way, JSON, and any JavaScript object with non-circularity can be sifted and searched and filtered in reliable ways using already-standardized methods just by using those technologies together in a fun new way.
There is not necessarily a need for inventing a new custom syntax/DSL for querying unless you don't want to make use of CSS and XPath, or have very specific needs.
I've used postgresql's jsonpath support to create user defined filtering rules on db rows. It made things a lot easier than whatever other methods I could come up with.
Is the necessity of tools like JSON Path really just an indication that APIs are increasingly returning too much junk and / or way more data than the client actually requested and / or needs?
In dev mode, our internal APIs return pretty printed JSON so one can inspect via view-source, more, or text editor.
Not necessarily. JSON is used in a lot of places, also for large documents in data lakes and archives. It's useful to be able to query them with tools.
Only as inevitable as the dearth of interpolation/parametrized query primitives… though whether the industry has actually learnt the bitter lessons of SQL injection remains to be seen. I don’t hold my hopes up too much.
The standard was developed because its being used by so many people in so many ways with so many different implementations that a standard was required to align them all, so im not sure the "nobody is really using it" argument holds much weight.
JSONPath is good when it comes to querying large JSON documents. But in my opinion, more than this is the need to simplify reading and writing from JSON documents. We use POJOs / model classes which can become a chore for large JSON documents. While it is possible to read paths, I had not seen any tool using which we could read and write JSON paths in a document without using POJOs. And so I wrote unify-jdocs - read and write any JSON path with a single line of code without ever using POJOs. And also use model documents to replace JSONSchema. You can find this library here -> https://github.com/americanexpress/unify-jdocs.
AAH-HA... That's why this felt familiar to me. I haven't used K8 in over a year now, but I used this all the time at my previous job. Didn't know it was "JSON Path" just knew it as something I used in kubectl often.
I wish there weren’t so many JSON path syntaxes. I’m comfortable with jq, then there’s JSON path, I forget which one AWS CLI is using, MySQL has their own. It’s impossible for me to get muscle memory with any of them.
Has anyone had performance issues using JSONPath? We are processing large pieces of data per request in a node express service and we believe JSONPath is causing our service to lock up and slow down. We’ve seen the issue improve as we have started refactoring out JSONPath usage for vanilla iteration and conditional checks.
There are a lot of factors at play so we can’t quite put our thumb on JSONPath, but it’s the current suspect and curious if others have run into anything similar.
This is exactly the type of thing that LLMs are very good at generating and explaining for you. I've done this countless times to create and understand regex patterns.
We're using JSONPath to annotate parts of JSON fields in PostgreSQL that need to be extracted/replaced for localization. Whilst I'd naturally prefer we didn't store display copy in these structures, it was a fun thing to implement.
There’s nothing thing wrong with that approach if you’re working with jsonpaths on the regular. It’s all about time management, I guess. With json, and xml, and probably yaml, there is this recurring long-term pattern of:
1. Creat tree-structure document format that is flexible enough to handle all use cases.
2. Write a ton of content in this format.
3. Have to figure out a query pattern to accurately retrieve good info out of these structures.
Generally, I feel we’ve become good at querying normalized table data. But—-and maybe it’s just me being stupid—-wending through tree-structured data is still tricky. And I recently discovered LLMs are great at solving for it, if you ask clearly.
The thing about querying tree-structured data being currently humanly harder than tabular data rings true to me, I always struggle with some very simple tree-sitter queries.
These types of languages are a bad idea, just as XPath was. They are complex enough to be a maintenance/bug risk AND don't bring any additional benefit to just writing code in your normal programming language to do the same thing.
You can take my list comprehensions from my cold, dead hands.
There isn't a use case I've seen where these types of mini languages fit well. Ostensibly, you could give it to a user to write to query JSON in a domain-agnostic way in an app but I think it would just confuse most users as well as not being powerful enough for half of their use cases.
Yeah, I don't think javascript has that function in the standard library. Writing one is not super complicated, but having to put that into every file (or importing it) is not ideal.
You wouldnt use json path as replacement to any language. It might be marginally useful in configurations or passing queries between different services. But the complex syntax limits it in both cases, because you cannot easily automatically modify the query. In the case of configs it would be great to analyze hundreds of configs on different systems and change them automaticaly, same with queries exchanged between services which might even get stored in a database.
I do not understand why domain languages aren't designed with limited syntax in mind. In the style of lisp for instance. Because actually being able to programatically work with the language is a massive advantage that imo far outweights your own frustration with typing a paranthesis or two extra.
Yeah, it's not bad. Guido famously preferred the list comprehension style over the functional style, but when you have nested data types and the "monadic style" (ish) functions, it does really make sense. You could imagine it in Python (lets pretend lists have map and filter):
I’m a fan, but let’s go even further. JavaScript has pleasant definitions of functions with
filter((x) => x.price < 10)
but why can’t we just write
filter(x.price < 10)
and add a rule to the JS engine that says “when you encounter a ‘syntax error: undeclared identifier x’, rewrite the code to add `(x) => ` in front of where the syntax error occurred, if and only if this rewrite prevents the syntax error”.
You might protest that reacting to syntax errors by inserting extra code and checking if the errors go away is an insane strategy, but I would note that JavaScript is actually a semicolon-terminated language in which most developers never write a semicolon, and the JavaScript engine is already using this insane strategy on nearly every line of modern JS to insert a semicolon whenever it encounters a syntax error, so it’s obviously practical.
The problem with your approach is that `filter(x.price < 10)` is perfectly valid syntax, it's filter with a single boolean arg. You need something else to trigger the magic: change `x` to `it` and you have Kotlin and Groovy's shorthand syntax -- you just can't define a variable called `it` anymore. If you want closure semantics on arbitrary undefined variables, I think I might have to slap you on general principle ;)
I dislike that syntax, but it might be better to instead use a special syntax that can't be the name of any variable, like the asterisk in Raku as mentioned below. In such a case, perhaps the function that would call it should check if the value is already the correct type and use that instead of calling it as a function, since then it would be possible for the filter condition to be a constant true or false that does not depend on the values being filtered (this is not usually useful, but it is more consistent and sometimes it is useful).
I think that the automatic semicolon insertion is a bad feature of JavaScript.
Some languages do this already but with a designated placeholder, like
filter(_.price < 10)
That may not work because plain underscore is already a valid identifier but another placeholder could potentially be used and no need for the parser backtracking / function insertion (which I don't like the idea of, there may be cases where an undeclared identifier was a bug and it shouldn't be turned into a function)
I will never understand why people find the need to cram as much "mystery meat" code into one line as humanly possible. It makes it much harder to understand, debug, and optimize.
Meanwhile JS is still struggling to define even version 1.0 of a pipe operator, and the proposal has been bikeshedded into shabby oblivion with a syntax that's worse than just pulling out lodash pipe() or similar. TC39 does not fill me with hope.
You know, I think you're right. I never realized that Python list comprehension is basically filter/map (and reduce is an external function). That is actually pretty horrible syntax for it from that perspective.
So do you regularly hard code the number 10 in your code?
I think you missed my point.
In realistic code you'd be using string interpolation to put the number 10 into this query language, and worrying about injection vulnerabilities while you did it.
Or even calling another arbitrary function to do the filtering. Which this query language can't handle at all.
> don't bring any additional benefit to just writing code in your normal programming language to do the same thing.
In some cases advantage is that you don't create new code and you just use some relatively standard tool. You just fetch some public package that handles various edge cases and you just prepare script that describes what you want to do with some program. This is useful, if you work in containerized environment and configuration exists as json or yaml. Often I just use jq or yq, instead of reinventing wheel to just read or write some values.
One of the advantages of this sort of DSL is that they're easy to share between languages. One needs one implementation of JSON path per programming language, and then it's easy to, for example, iterate on the query in the REPL of a dynamic language, then copy it to a fast compiled language. Or share a query between the browser and the server. That sort of thing.
Regex has similar advantages, if one sticks to the subset of regex which is commonly understood between languages: so less so, for that very reason.
Another plus is the principle of least power. A JSON Path will halt, and it won't make syscalls. There are circumstances where that's useful.
I agree, it gives the same vibe as wanting to somehow bring back the simplicity of excel formulas rather than having to write normal code, to revive that dream of convenient one-liners.
But there is a reason excel has a ceiling of maintainability that always turns it into a spaghetti mess once it's big enough.
It's pretty much the exact same thing - let's you specify a file to access with a string (pathname) such as "/foo/bar/cat" rather than having to go step by step first open/reading directory "foo", then open/reading directory "bar", then finally accessing file "cat".
With XPath and JSONPath you're just dealing with DOM nodes and children rather than directories and children.
Except that XPath and JSONPath are overpowered. I'd welcome a simple standard path syntax for JSON, for instance, I'd want to use for reporting schema errors in JSON document ($.users[10].name must be a string). I can use JSONPath, sure, but even this subset is annoying to parse - compared to a filepath which you can parse with `path.split('/')`.
The disadvantages in being tied to the execution model of said Turing Complete language are also significant when compared to one that can do optimizations like stream fusion without having to work such optimizations into the language as a whole. But there's no technical reason we can't have both.
One advantage is slightly less typing, which might be useful in CLI tools and UI filters. Also, people don't want to bring a full JS interpreter in those kinds of interfaces for various reasons.
If you read the article you'd know that this is about using OpenAPI Overlays and other use cases where JSONPath is literally a requirement. I think you just read the first paragraph then wrote a few about how its bad.
More and more parts of the API ecosystem require JSONPath, and just saying "you should write code instead" doesn't actually help anyone write OpenAPI Overlays, so whats the point?
there is possibly a need for more unified standard across different implementations particularly from a software development and API design perspective.
During parsing and manipulation of JSON data, the syntactical discrepancies/behaviours between various libraries might need a common specification, for interoperability.
features like type-aware queries or schema validation, may be very helpful.
In addition to simple queries that allow you to select one (or multiple) matching nodes, it also provides some helper functions, such as arithmetic, comparisons, sorting, grouping, datetime manipulation, and aggregation (e.g. sum, max, min).
It's written in JS and can be used in Node or in the browser, and there's also a Python wrapper: https://pypi.org/project/pyjsonata/