Miller: Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON

zX41ZdbW · on March 16, 2023

I recommend using clickhouse-local[1] for these tasks.

It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].

[1] https://clickhouse.com/blog/extracting-converting-querying-l...

[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...

krick · on March 16, 2023

> curl https://clickhouse.com/ | sh

Jesus, this is disgusting. I'm not that picky and don't really complain about "... | sh" usually, but at least I took it for granted that I can always look at the script in the browser and assume that is has no actual evil intentions and doesn't rely on some fucking client-header magic to be modified on the fly.

maximilianroos · on March 16, 2023

Why be so cantankerous?

    curl https://clickhouse.com/ > install.sh
    cat install

zamnos · on March 17, 2023

User agent "sniffing" (I mean, it's right there. It's not exactly subtle.) has been going on since the before the IE6 days. That it now extends to make things easier for us as command line users is... kinda convenient? Another site where I've seen it done to good effect is http://ifconfig.me . Hit that with a web browser and get a page of accompanying information. Hit it with curl, and get back your ip in ascii - not even an extra newline character is returned!

The underlying question is do you trust clickhouse.com or not? You don't have to; I've never met the team or talked to them, and I can't make that decision for you. But whether you go to the site, laboriously find the download page, right click, download a binary, install the deb/rpm, ask your package manager for what files just got installed, then find and run the clickhouse binary, or just let your computer do it for you via a shell script, the end result is the same. Code from clickhouse (and we're sure it was from clickhouse because of TLS) was downloaded to the target machine, and then got run. Does things have to be difficult and annoying in order for you to like it? (Psychology studies say yes, actually.)

zX41ZdbW · on March 16, 2023

We provide .deb, .rpm, .tgz, Docker, or single-binary, for x86-64, AArch64, for Linux, Mac and FreeBSD.

zX41ZdbW · on March 16, 2023

https://clickhouse.com/docs/en/install/#available-installati...

cerved · on March 16, 2023

pipe it though less

jsyolo · on March 16, 2023

or sqlite for CSV/TSV, haven't tried it for json.

ydant · on March 16, 2023

duckdb and being able to write:

   select a,b,c from '*.jsonl.gz'

has been a huge improvement to my workflows.

zX41ZdbW · on March 16, 2023

How does it look from command-line for streaming processing of CSV/TSV?

jsyolo · on March 16, 2023

you can pipe to it

nousermane · on March 16, 2023

Yep:

  $ cat foo.tsv

  name    foo     bar
  Alice   10      8888
  Bob     20      9999

  $ cat foo.tsv | sqlite3 -batch \
    -cmd ".mode tabs" \
    -cmd ".import /dev/stdin x" \
    -cmd "select foo from x where bar > 9000;"

  20

password4321 · on March 16, 2023

Discussion of a similar tool last month:

yq: command-line YAML, JSON, XML, CSV and properties processor

https://news.ycombinator.com/item?id=34656022

Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-flip, csvq, zq, and zsv.

networked · on March 16, 2023

I have made a repository cataloguing tools like this: https://github.com/dbohdan/structured-text-tools.

account-5 · on March 16, 2023

Thanks for this, good job. I always mean to do something similar but just end up bookmarking HN threads I then never look at.

samstave · on March 16, 2023

Guilty as charged... I have so many bookmarks I forget to look at...

APPlet Idea: an app that picks a random bookmark from your saves and sends you a reminder to click it and read it... and schedule when you want to see it - like every 7AM send me a link from my bookmarks. Set an alarm for 15 minutes to get me back on schedule.

damascus · on March 17, 2023

So chatgpt and I made this: https://github.com/erasei/randombookmark Load in developer mode and when you click the icon it will open a new tab with a random bookmark.

jefftk · on March 16, 2023

There's also GoAWK, which supports CSV natively ("-i csv"): https://github.com/benhoyt/goawk

horseRad · on March 16, 2023

You should include qsv (https://github.com/jqnatividad/qsv) and goawk (https://github.com/benhoyt/goawk) into the CSV section! Goawk got a csv/tsv mode and qsv is loaded with features being developed.

glomgril · on March 16, 2023

Big shouts to visidata -- very underrated tool imo.

elesiuta · on March 16, 2023

Thank you! That list is perfect since I don't use these tools often enough to remember them and some are hard to find again without knowing exactly what to search for.

I also discovered pawk on it which looks interesting, I made a somewhat related tool [1] which is probably out of scope for your list, but can be used in some of the same ways [2] and may of of interest nonetheless.

[1] https://github.com/elesiuta/pyxargs

[2] cat /etc/hosts | pyxargs -d \n --im json --pre "d={}" --post "print(dumps(d))" --py "d['{}'.split()[0]] = '{}'.split()[1]"

donio · on March 16, 2023

Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.

dmix · on March 16, 2023

Anyone else find jq really hard to use? I found the syntax/interface bewildering at times. Although not as bad as awk.

password4321 · on March 17, 2023

I'm pretty sure that's one big reason for all these other tools!

dima55 · on March 16, 2023

There's also https://github.com/dkogan/vnlog/ which is a wrapper around the existing coreutils, so all the options work, and there's nothing to learn

asicsp · on March 16, 2023

Seems like you meant https://github.com/TomWright/dasel (instead of damsel)

password4321 · on March 16, 2023

Correct; thanks!

nextaccountic · on March 16, 2023

the obvious missing reference is jq https://github.com/stedolan/jq

mdaniel · on March 16, 2023

I've been getting a lot of mileage out of https://github.com/itchyny/gojq#readme recently due to two things: its vastly superior error messages and the (regrettably verbose) `--yaml-input` option

I also have https://github.com/01mf02/jaq#readme installed but just haven't needed it

donio · on March 16, 2023

-y and -Y unused so I patch those in as short flags for --yaml-input/output in my copy.

password4321 · on March 16, 2023

True, for me that one's a given!

elesiuta · on March 16, 2023

Another similar discussion last year where Miller was mentioned:

New(ish) command line tools

https://news.ycombinator.com/item?id=31009313

rustyf · on March 16, 2023

Even without aliases, I still prefer PowerShell on Windows. Once you've turned the text into an object, you can pipeline it to hell.

Get-Content .\example.csv | ConvertFrom-Csv | Where-Object -Property color -eq red

wankle · on March 16, 2023

No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.

shortrounddev · on March 16, 2023

It's funny to me that people hate how verbose posh is when bash syntax looks like my cat walked over my keyboard

stinos · on March 16, 2023

When I read I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare, bash and sed and cut and the likes indeed are the first things which come to mind. I really feel kind of bad sometimes for once having spent time learning them, only to later find out there are many alternatives and nearly all of them have a shorter learning curve. Many of these tools also have zero discoverability as well, meaning you effectively get locked into learning their syntax. And then another one for the next tool. Whereas at least in PS you can try to use common words and tab completion and sometimes get there before having to reach to 'how do I x in y'.

chongli · on March 16, 2023

It's a matter of recognizing your use case. If you're going to write a program that you expect to maintain for years, sure, go ahead and make it as verbose as possible. Unix tools support this with long-form flags (usually prefixed with -- rather than -). On the other hand, if you're doing exploration and iterating interactively on the fly (which bash is best at) then you want very terse syntax to keep lines short.

password4321 · on March 16, 2023

Thank goodness for tab completion!

I appreciate that it auto-corrects capitalization and slash direction.

gabrielsroka · on March 16, 2023

Here it is with aliases. `?` Means where:

  ipcsv example.csv | ? color -eq red
 
  color shape  flag index
  ----- -----  ---- -----
  red   square 1    15
  red   circle 1    16
  red   square 0    48
  red   square 0    7

7thaccount · on March 16, 2023

Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.

unxdfa · on March 16, 2023

I much prefer this approach as you only have to parse stuff once during the processing pipeline rather than at every step.

Unfortunately after working with it for a few years I utterly despise PowerShell for many other reasons.

cbolton · on March 16, 2023

Can you share some of the reasons you don't like it?

unxdfa · on March 16, 2023

Yes: memory ceiling is huge, WinRM is buggy and unreliable, performance is "variable", totally inconsistent handling of dates and times, being .Net it's UTF-16 internally so streams are painful to work with, escape characters argh, variable scoping is horrible, most of the API you have to hit to get stuff done is wrapping ancient bits of weird win32 and DCOM and WMI thus has inconsistent error handling and ability to actually do what it is told.

I've probably forgotten a few things.

hobs · on March 16, 2023

While powershell has gotten a lot faster in the most recent versions its still pretty slow for anything involving computation; most of the type my equivalent python code beats the pants off of it.

The expressiveness is nice, but oftentimes modules won't support it or require weird ways of using the data to get the performance you want (mostly by dropping out of the pipe.)

The choices around Format- vs Out- vs Convert- are Very Confusing for new people and the "object in a shell but also text sometimes" way of displaying things is weird and until recently things like -NoTypeInformation or managing encoding in files was just pointlessly weird.

The module support and package management is still entirely in the stone ages and I regularly see people patching in C# in strings to get the behavior they want.

"Larger" modules tend to get Really Slow - the azure modules especially are just an example of how not to do it.

The way it automatically unwraps collections is cool, but gets weird when you have a 1 vs many option for your output, and you might find yourself defensively casting things to lists or w/e.

The typing system in general is nice to get started on but declaring a type does not fix it, so assignment can just break your entire world.

There's still a lot to love about the language when you are getting things done in a windows environment its great to glue together the various pieces of the system but I find the code "breaks" more often than equivalent python code.

shortrounddev · on March 16, 2023

My philosophy about any shell language is that if performance is a concern, then you should probably use a real programming language for it. Using shell scripting to handle batch processing tasks just creates dependencies on unmaintainable code.

msla · on March 16, 2023

If performance is too bad, you can only use the shell for toy examples, which means there's no reason to have a real shell at all and you might as well go back to COMMAND.COM or equivalent. It's taking the idea of Unix-style scriptable shells and Improving the implementation to the point of unusability.

pxc · on March 17, 2023

It's also inhibited by typical corporate crapware, but with a pretty barebones/vanilla config, in PowerShell takes a full four to ten seconds just to open a new session on my work computer.

PowerShell isn't even fast enough for basic interactive usage, never mind batch processing.

hobs · on March 16, 2023

Simple tools with simple rules will outlast most of the code we'll all build. Picking things like grep, awk, and sed means your knowledge will be widely applicable going forward, and many people caring about their performance both backwards and forwards in time means your shell can be pretty fast.

MayeulC · on March 16, 2023

I agree that this is something missing on classic UNIX shells: typed output streams. I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.

Searching for that comment, I came across relevant stories:

https://news.ycombinator.com/item?id=27535357

https://news.ycombinator.com/item?id=25225113

mdaniel · on March 17, 2023

Algolia supports searching comments, and while I don't know the "facet" syntax to pin it to your username, thankfully it's a pretty distinct word and https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... coughed up https://news.ycombinator.com/item?id=26797583 from 2 years ago. Is that the one?

MayeulC · on March 20, 2023

Ah, interesting feature I didn't know about, thank you.

It's one of the parents in that comment chain (top comment on the story): https://news.ycombinator.com/item?id=26791597

gregjor · on March 17, 2023

Typed output streams aren’t missing from Unix. The authors of Unix clearly describe, repeatedly, that Unix tools use plain text as the common format, and that everything should look like a file.

You’re right that standard Unix tools don’t have a concept of types in streams, but that decision got made deliberately. Types and formats got left as output details.

Analogously my Macbook Air doesn’t have a fan, by design, not by accidental omission.

MayeulC · on March 17, 2023

Right, that was an unfortunate choice of words. It's not missing, but leads to (IMO) a proliferation of under-specified text-based data and stream exchange formats.

Having human-readable text as the lowest common denominator is a laudable goal. Shell scripting would however probably be improved if most tools offered alternative typed streams, or something similar. I am not convinced Powershell's approach is the best, but their approach is at least interesting.

gregjor · on March 18, 2023

Picking nits, but the decision to use plain text as the common data format in Unix does not rule out structured text (e.g. CSV, TSV, XML). Nor does it imply “human-readable.”

The original decision was about not proliferating specialized or proprietary binary formats, which was more of a norm back in the ‘70s than today. The goal was to make small single-purpose tools that communicated through a common interface (files) in a standard format (plain text). Unix succeeded and continues to succeed at that.

Nothing about those design decisions precludes tools using binary formats under Unix — image processing, for example. It just precludes using standard text-oriented tools on those formats.

snidane · on March 16, 2023

Great tool.

BUT, leaks memory like crazy.

Despite documentation stating the verbs are fully streaming.

> Fully streaming verbs > These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.

https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...

coldtea · on March 16, 2023

>BUT, leaks memory like crazy.

Huh?

a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?

b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?

john_kerl · on March 16, 2023

It's indeed written in Go now, which indeed has GC, and ultimately all memory is freed ... but there are indeed some issues around intermediate retention of memory, taking more memory than one would have expected.

Some gains were made on https://github.com/johnkerl/miller/pull/1133 and https://github.com/johnkerl/miller/pull/1132

john_kerl · on March 16, 2023

See also https://github.com/johnkerl/miller/issues/1119

john_kerl · on March 16, 2023

And https://github.com/johnkerl/miller/pull/1131

asicsp · on March 16, 2023

Previous discussion: https://news.ycombinator.com/item?id=28298729 (273 points | Aug 25, 2021 | 66 comments)

dang · on March 16, 2023

Thanks! Macroexpanded:

Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON - https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33 comments)

Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON - https://news.ycombinator.com/item?id=28298729 - Aug 2021 (66 comments)

Miller v5.0.0: Autodetected line-endings, in-place mode, user-defined functions - https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20 comments)

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76 comments)

carb · on March 16, 2023

This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).

  $ cat example.csv 
  color,shape,flag,index
  yellow,triangle,1,11
  red,square,1,15
  red,circle,1,16
  red,square,0,48
  purple,triangle,0,51
  red,square,0,77
  
  # pretty printing
  $ column -ts',' example.csv 
  color   shape     flag  index
  yellow  triangle  1     11
  red     square    1     15
  red     circle    1     16
  red     square    0     48
  purple  triangle  0     51
  red     square    0     77
  
  # sorting with skipped headers is a mess. 
  $ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
  color   shape     flag  index
  red     square    0     77
  purple  triangle  0     51
  red     square    0     48
  red     circle    1     16
  red     square    1     15
  yellow  triangle  1     11

darrenf · on March 16, 2023

> `jq` supports `@csv` for output conversion but not input

Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):

     jq -R 'split(",")' example.csv | jq -sr '[[.[0]],.[1:]|sort_by(.[4])|reverse|.[]]|.[]|@csv'

    "color","shape","flag","index"
    "red","square","0","77"
    "purple","triangle","0","51"
    "red","square","0","48"
    "red","circle","1","16"
    "red","square","1","15"
    "yellow","triangle","1","11"

NB. there is also an entry in the `jq` cookbook for parsing CSVs into arrays of objects (and keeping numbers as numbers, dealing with nulls, etc) https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...

asicsp · on March 16, 2023

Another option (based on https://unix.stackexchange.com/q/11856/109046):

    (sed -u '1q' ; sort -r -k4 -t',') <example.csv | column -ts','

2h · on March 16, 2023

> sorting with skipped headers is a mess

I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:

https://go.dev/play/p/9bASZ97lLWv

cosmojg · on March 16, 2023

It's a matter of perspective.

I like programming languages, but I think they can cripple someone actually learning Unix!

At the end of the day, you should just use whatever tools make you the most productive most quickly.

coldtea · on March 16, 2023

The whole point of UNIX userland is to not have to write a custom program for every simple case that just needs recombining some existing basic programs in a pipeline...

2h · on March 16, 2023

> simple case

that's just it though, the last example is not a simple case, hence why the last example is awkward by the commenters own admission. command line tools are fine, but you need to know when to set the hammer down and pick up the chainsaw.

coldtea · on March 16, 2023

>the last example is not a simple case

As far as shell scripting goes, this is hardly anything to write home about. Looks simple enough to me.

It just retains the header by printing the header first as is, and then sorting the lines after the header. It's immediately obvious how to do it to anybody who knows about head and tail.

And with Miller it's even simpler than that, still on the command line...

carb · on March 16, 2023

To me the last example is still simple. When I encounter this in the wild, I don't really care about preserving the header.

  tail -n +2 example.csv | sort -r -k4 -t','

Or more often, I just do this and ignore the header

  sort -r -k4 -t',' example.csv

Keeping the header feels awkward, but using `sort` to reverse sort by a specific column is still quicker to type and execute (for me) than writing a program.

gabrielsroka · on March 16, 2023

I thought the Go code looked way too complex and Python would be simpler. Yes and no.

  import csv
  
  filename = 'example.csv'
  sort_by = 'index'
  reverse = True
  
  with open(filename) as f:
      lines = [d for d in csv.DictReader(f)]
  
  for line in lines:
      line['index'] = int(line['index'])
  lines.sort(key=lambda line: line[sort_by], reverse=reverse)
  print(','.join(lines[0].keys()))
  for line in lines:
      print(','.join(str(v) for v in line.values()))

cmdlineluser · on March 16, 2023

Perhaps a `DictWriter` would simplify things:

    import csv
    import sys
    
    filename = "example.csv"
    sort_by = "index"
    reverse = True
    
    with open(filename, newline="") as f:
        reader = csv.DictReader(f)
        writer = csv.DictWriter(sys.stdout, fieldnames=reader.fieldnames)
        writer.writeheader()
        writer.writerows(sorted(reader, key=lambda row: int(row[sort_by]), reverse=reverse))

gabrielsroka · on March 16, 2023

I thought about that but 1) it seemed like cheating to write to standard out, 2) you're assuming that the column to sort by is an integer whereas I broke that code up a little bit.

But yours has the advantage of being able to support more complex CSVs.

melx · on March 16, 2023

+1 for easy install (`dnf install miller` on my Fedora).

But seems like it cannot handle a simple use case: CSV without header.

$ mlr --csv head -n 20 pp-2002.csv

mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.

You have to explicitly pass it (FYI `implicit-csv-header` is terrible arg name)

$ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv

While `head` obliges rightly

$ head -n 20 pp-2002.csv

coldtea · on March 16, 2023

Well, head just reads lines. A CSV row can optionally be multiple lines.

Also head doesn't do anything with the data or the format, aside from printing line by line, so doesn't need to know any column names.

melx · on March 16, 2023

head reads CSV row that has multiple lines.

Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the *nix tools it refers to.

0cf8612b2e1e · on March 16, 2023

Miller is designed around the idea of structured data. This is a higher bar than naively manipulating text, and the user has to be more deliberate to extract fields. Doing cli manipulation of a csv that contains quoted commas is challenging with the standard tools.

coldtea · on March 17, 2023

>head reads CSV row that has multiple lines.

It just reads lines, doesn't know anything about rows.

So if you want, say, the first 100 rows of a csv that has multiline rows, miller can give it, while head can't.

Head will just print the "N" first lines - whether those are 1, 22, 36 or N actual csv rows.

>Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the nix tools it refers to.*

That's the whole idea.

That it behaves in a way more suited to the csv format, and more coherent than 5-6 different text-focused tools.

The claim is not "this is awk, sed, head, sort remade for csv with identical interfaces and behavior" but "this is a tool to work with csv files and do what you'd normally have to jump through hoops to do with awk, sed, head, sort which don't understand csv structure".

kbouck · on March 16, 2023

you could shorten to:

$ mlr -N --csv head -n 20 pp-2002.csv

-N is a shortcut for --implicit-csv-header and --headerless-csv-output

WinstonSmith84 · on March 17, 2023

This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following: ``` I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that ```

And it replied with

mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv

I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly

0xbadcafebee · on March 16, 2023

Csvkit is the usual way to handle working with CSVs (or Excel, or JSON, or.... Postgres): https://csvkit.readthedocs.io/en/latest/index.html#why-csvki...

brew install csvkit and enjoy

hindsightbias · on March 16, 2023

Shell tool favorite from the past: https://github.com/stefan-schroedl/tabulator

pcthrowaway · on March 16, 2023

Miller is a welcome addition to my toolbox. I just wish it was as 'easy' to use as jq... part of that might be how infrequently I work with CSVs

coldtea · on March 16, 2023

Huh? jq has far more difficult syntax...

mywittyname · on March 16, 2023

The syntax is pretty crazy, so there's a learning curve. But once your over the hump, jq is easy to use for transformations or "querying" of large volumes of json data.

hesdeadjim · on March 16, 2023

Yea agreed. The moment I need to do anything complicated I start getting this gut feeling I’d be better off just writing a quick Go cli tool.

ar9av · on March 16, 2023

Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily

mdaniel · on March 16, 2023

In some sense, there's no answer to your question because everyone's tool workflow is their own.

That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`

mcdonje · on March 16, 2023

While I get what they mean by "tabular JSON", that's a bit of a misnomer.

didgetmaster · on March 16, 2023

I was wondering what they meant by that term as well. Does every JSON document in the array have to have the exact same structure (including the ordering of key-value pairs)? What happens if row #1000 introduces a new key-value pair not seen before? What if the value for a key is an array?

kbouck · on March 17, 2023

It generally means an sequence/array of flat json objects, each key represents a column name, and the value representing the row value for that column. Nested json can also be "flattened".

> "... ordering of key-value pairs"

order of appearance of key-value pairs in input does not matter.

> "... if row #100 introduces a new key-value pair"

this is sparse data. miller handles this with the "unsparsify" verb:

    $ cat in.json
    { "a": 1, "b": 2 }
    { "a": 3, "b": 4 }
    { "a": 5, "b": 6, "c": 7 }

without unsparsify:

    $ cat in.json | mlr --j2p cat
    a b
    1 2
    3 4

    a b c
    5 6 7

with unsparsify:

    $ cat in.json | mlr --j2p unsparsify then cat
    a b c
    1 2 -
    3 4 -
    5 6 7

unsparsify can also set default values:

    $ cat in.json | mlr --j2p unsparsify --fill-with 0 then cat
    a b c
    1 2 0
    3 4 0
    5 6 7

> "... What if the value for a key is an array?"

Array value treatment seems to depend on output format. for output types that can represent arrays, they are preserved:

    $ cat in-array.json
    { "a": [0,1,2], "b": [5,6,7] }
    { "a": [3,4,5], "b": [8,9,0] }

    $ cat in-array.json | mlr --jsonl cat
    {"a": [0, 1, 2], "b": [5, 6, 7]}
    {"a": [3, 4, 5], "b": [8, 9, 0]}

for formats like csv/fixed-width, arrays are flattened into columns, one for each array element:

    $ cat in-array.json | mlr --j2p cat
    a.1 a.2 a.3 b.1 b.2 b.3
    0   1   2   5   6   7
    3   4   5   8   9   0

flatten separator can also be set:

    $ cat in-array.json | mlr --j2p --flatsep _ cat
    a_1 a_2 a_3 b_1 b_2 b_3
    0   1   2   5   6   7
    3   4   5   8   9   0

4dayworkweek4u · on March 17, 2023

This is basically what NuShell does (with the shell -- but also with files too). Why don't these guys join the NuShell team?

avodonosov · on March 16, 2023

How is it better that SQL for these tasks on tabular data?

That's the first question I have after reading the title. Haven't read the article.

Edited the first sentence. Originally it was "Why it is not called SQL if works on tabular data?".

My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that. Awk is too powerful, not sure about it.

coldtea · on March 16, 2023

>How is it better that SQL for these tasks on tabular data?

It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.

>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.

SQL is a language for data in the form of tables in relational databases.

While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...

You don't need to load anything to a db, for starters.

You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.

You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).

holy_diver · on March 16, 2023

DuckDB is actually pretty good at this kind of thing.

Doesn't need to load anything to DB

Can be used in shell

Can read from stdin and write to stdout

avodonosov · on March 16, 2023

csvsql is the first google result for "sql in command line for csv"

https://towardsdatascience.com/analyze-csvs-with-sql-in-comm...

coldtea · on March 16, 2023

Yes, I know. It's not that there isn't several ways to do it, it's that it's not really a good fit for the command line, except in the "I want to reuse SQL that I already know".

The problem isn't in having a way to use SQL to query the data from the command line, it's that SQL is long winded and with syntax not really fit in a traditional shell pipeline.

avodonosov · on March 16, 2023

I submitted a ticket: https://github.com/johnkerl/miller/issues/1235

garciasn · on March 16, 2023

1. Because it doesn't use SQL syntax.

2. Because it's closer to an amalgamation of the standard shell scripting tools (cut, sort, jq, etc) than it is to a SQL variant.

kbouck · on March 17, 2023

and miller also has a powerful (and intuitive imo.) dsl should you need to go beyond what the simple command line switches offer.

so, it remains simple and concise for the easy problems, yet can scale up to address the trickier ones as well.

i've used csvkit, sqlite, jq, etc. miller is my favorite tool for data-fu (with sqlite being a close 2nd)

LMMojo · on March 16, 2023

Because SQL (Structured Query Language) is the language used to access/manipulate the data, not the name for that kind of data. There have been, and are, many databases, which essentially use tabular data, which are not SQL database.

jbverschoor · on March 16, 2023

Seems cool. Maybe a rainbow / alternating color per column would be nice

fdgsdfogijq · on March 16, 2023

I've been using chat-GPT to work with these types of files now.

sitkack · on March 16, 2023

ChatGPT is pretty good at giving working code snippets for Pandas, you can describe the transforms you want and a sketch of the data (column names and sample rows) and it will usually give back working code.