Hacker News new | past | comments | ask | show | jobs | submit login
Miller: Like Awk, sed, cut, join, and sort for CSV, TSV, and tabular JSON (github.com/johnkerl)
318 points by ingve on March 16, 2023 | hide | past | favorite | 104 comments



I recommend using clickhouse-local[1] for these tasks.

It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].

[1] https://clickhouse.com/blog/extracting-converting-querying-l...

[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...


> curl https://clickhouse.com/ | sh

Jesus, this is disgusting. I'm not that picky and don't really complain about "... | sh" usually, but at least I took it for granted that I can always look at the script in the browser and assume that is has no actual evil intentions and doesn't rely on some fucking client-header magic to be modified on the fly.


Why be so cantankerous?

    curl https://clickhouse.com/ > install.sh
    cat install


User agent "sniffing" (I mean, it's right there. It's not exactly subtle.) has been going on since the before the IE6 days. That it now extends to make things easier for us as command line users is... kinda convenient? Another site where I've seen it done to good effect is http://ifconfig.me . Hit that with a web browser and get a page of accompanying information. Hit it with curl, and get back your ip in ascii - not even an extra newline character is returned!

The underlying question is do you trust clickhouse.com or not? You don't have to; I've never met the team or talked to them, and I can't make that decision for you. But whether you go to the site, laboriously find the download page, right click, download a binary, install the deb/rpm, ask your package manager for what files just got installed, then find and run the clickhouse binary, or just let your computer do it for you via a shell script, the end result is the same. Code from clickhouse (and we're sure it was from clickhouse because of TLS) was downloaded to the target machine, and then got run. Does things have to be difficult and annoying in order for you to like it? (Psychology studies say yes, actually.)


We provide .deb, .rpm, .tgz, Docker, or single-binary, for x86-64, AArch64, for Linux, Mac and FreeBSD.



pipe it though less


or sqlite for CSV/TSV, haven't tried it for json.


duckdb and being able to write:

   select a,b,c from '*.jsonl.gz'
has been a huge improvement to my workflows.


How does it look from command-line for streaming processing of CSV/TSV?


you can pipe to it


Yep:

  $ cat foo.tsv

  name    foo     bar
  Alice   10      8888
  Bob     20      9999

  $ cat foo.tsv | sqlite3 -batch \
    -cmd ".mode tabs" \
    -cmd ".import /dev/stdin x" \
    -cmd "select foo from x where bar > 9000;"

  20


Discussion of a similar tool last month:

yq: command-line YAML, JSON, XML, CSV and properties processor

https://news.ycombinator.com/item?id=34656022

Also mentions gojq, Benthos, xsv, Damsel, a 2nd yq, htmlq, cfn-flip, csvq, zq, and zsv.


I have made a repository cataloguing tools like this: https://github.com/dbohdan/structured-text-tools.


Thanks for this, good job. I always mean to do something similar but just end up bookmarking HN threads I then never look at.


Guilty as charged... I have so many bookmarks I forget to look at...

APPlet Idea: an app that picks a random bookmark from your saves and sends you a reminder to click it and read it... and schedule when you want to see it - like every 7AM send me a link from my bookmarks. Set an alarm for 15 minutes to get me back on schedule.


So chatgpt and I made this: https://github.com/erasei/randombookmark Load in developer mode and when you click the icon it will open a new tab with a random bookmark.


There's also GoAWK, which supports CSV natively ("-i csv"): https://github.com/benhoyt/goawk


You should include qsv (https://github.com/jqnatividad/qsv) and goawk (https://github.com/benhoyt/goawk) into the CSV section! Goawk got a csv/tsv mode and qsv is loaded with features being developed.


Big shouts to visidata -- very underrated tool imo.


Thank you! That list is perfect since I don't use these tools often enough to remember them and some are hard to find again without knowing exactly what to search for.

I also discovered pawk on it which looks interesting, I made a somewhat related tool [1] which is probably out of scope for your list, but can be used in some of the same ways [2] and may of of interest nonetheless.

[1] https://github.com/elesiuta/pyxargs

[2] cat /etc/hosts | pyxargs -d \n --im json --pre "d={}" --post "print(dumps(d))" --py "d['{}'.split()[0]] = '{}'.split()[1]"


Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.


Anyone else find jq really hard to use? I found the syntax/interface bewildering at times. Although not as bad as awk.


I'm pretty sure that's one big reason for all these other tools!


There's also https://github.com/dkogan/vnlog/ which is a wrapper around the existing coreutils, so all the options work, and there's nothing to learn


Seems like you meant https://github.com/TomWright/dasel (instead of damsel)


Correct; thanks!


the obvious missing reference is jq https://github.com/stedolan/jq


I've been getting a lot of mileage out of https://github.com/itchyny/gojq#readme recently due to two things: its vastly superior error messages and the (regrettably verbose) `--yaml-input` option

I also have https://github.com/01mf02/jaq#readme installed but just haven't needed it


-y and -Y unused so I patch those in as short flags for --yaml-input/output in my copy.


True, for me that one's a given!


Another similar discussion last year where Miller was mentioned:

New(ish) command line tools

https://news.ycombinator.com/item?id=31009313


Even without aliases, I still prefer PowerShell on Windows. Once you've turned the text into an object, you can pipeline it to hell.

Get-Content .\example.csv | ConvertFrom-Csv | Where-Object -Property color -eq red


No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.


It's funny to me that people hate how verbose posh is when bash syntax looks like my cat walked over my keyboard


When I read I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare, bash and sed and cut and the likes indeed are the first things which come to mind. I really feel kind of bad sometimes for once having spent time learning them, only to later find out there are many alternatives and nearly all of them have a shorter learning curve. Many of these tools also have zero discoverability as well, meaning you effectively get locked into learning their syntax. And then another one for the next tool. Whereas at least in PS you can try to use common words and tab completion and sometimes get there before having to reach to 'how do I x in y'.


It's a matter of recognizing your use case. If you're going to write a program that you expect to maintain for years, sure, go ahead and make it as verbose as possible. Unix tools support this with long-form flags (usually prefixed with -- rather than -). On the other hand, if you're doing exploration and iterating interactively on the fly (which bash is best at) then you want very terse syntax to keep lines short.


Thank goodness for tab completion!

I appreciate that it auto-corrects capitalization and slash direction.


Here it is with aliases. `?` Means where:

  ipcsv example.csv | ? color -eq red
 
  color shape  flag index
  ----- -----  ---- -----
  red   square 1    15
  red   circle 1    16
  red   square 0    48
  red   square 0    7


Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.


I much prefer this approach as you only have to parse stuff once during the processing pipeline rather than at every step.

Unfortunately after working with it for a few years I utterly despise PowerShell for many other reasons.


Can you share some of the reasons you don't like it?


Yes: memory ceiling is huge, WinRM is buggy and unreliable, performance is "variable", totally inconsistent handling of dates and times, being .Net it's UTF-16 internally so streams are painful to work with, escape characters argh, variable scoping is horrible, most of the API you have to hit to get stuff done is wrapping ancient bits of weird win32 and DCOM and WMI thus has inconsistent error handling and ability to actually do what it is told.

I've probably forgotten a few things.


While powershell has gotten a lot faster in the most recent versions its still pretty slow for anything involving computation; most of the type my equivalent python code beats the pants off of it.

The expressiveness is nice, but oftentimes modules won't support it or require weird ways of using the data to get the performance you want (mostly by dropping out of the pipe.)

The choices around Format- vs Out- vs Convert- are Very Confusing for new people and the "object in a shell but also text sometimes" way of displaying things is weird and until recently things like -NoTypeInformation or managing encoding in files was just pointlessly weird.

The module support and package management is still entirely in the stone ages and I regularly see people patching in C# in strings to get the behavior they want.

"Larger" modules tend to get Really Slow - the azure modules especially are just an example of how not to do it.

The way it automatically unwraps collections is cool, but gets weird when you have a 1 vs many option for your output, and you might find yourself defensively casting things to lists or w/e.

The typing system in general is nice to get started on but declaring a type does not fix it, so assignment can just break your entire world.

There's still a lot to love about the language when you are getting things done in a windows environment its great to glue together the various pieces of the system but I find the code "breaks" more often than equivalent python code.


My philosophy about any shell language is that if performance is a concern, then you should probably use a real programming language for it. Using shell scripting to handle batch processing tasks just creates dependencies on unmaintainable code.


If performance is too bad, you can only use the shell for toy examples, which means there's no reason to have a real shell at all and you might as well go back to COMMAND.COM or equivalent. It's taking the idea of Unix-style scriptable shells and Improving the implementation to the point of unusability.


It's also inhibited by typical corporate crapware, but with a pretty barebones/vanilla config, in PowerShell takes a full four to ten seconds just to open a new session on my work computer.

PowerShell isn't even fast enough for basic interactive usage, never mind batch processing.


Simple tools with simple rules will outlast most of the code we'll all build. Picking things like grep, awk, and sed means your knowledge will be widely applicable going forward, and many people caring about their performance both backwards and forwards in time means your shell can be pretty fast.


I agree that this is something missing on classic UNIX shells: typed output streams. I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.

Searching for that comment, I came across relevant stories:

https://news.ycombinator.com/item?id=27535357

https://news.ycombinator.com/item?id=25225113


Algolia supports searching comments, and while I don't know the "facet" syntax to pin it to your username, thankfully it's a pretty distinct word and https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... coughed up https://news.ycombinator.com/item?id=26797583 from 2 years ago. Is that the one?


Ah, interesting feature I didn't know about, thank you.

It's one of the parents in that comment chain (top comment on the story): https://news.ycombinator.com/item?id=26791597


Typed output streams aren’t missing from Unix. The authors of Unix clearly describe, repeatedly, that Unix tools use plain text as the common format, and that everything should look like a file.

You’re right that standard Unix tools don’t have a concept of types in streams, but that decision got made deliberately. Types and formats got left as output details.

Analogously my Macbook Air doesn’t have a fan, by design, not by accidental omission.


Right, that was an unfortunate choice of words. It's not missing, but leads to (IMO) a proliferation of under-specified text-based data and stream exchange formats.

Having human-readable text as the lowest common denominator is a laudable goal. Shell scripting would however probably be improved if most tools offered alternative typed streams, or something similar. I am not convinced Powershell's approach is the best, but their approach is at least interesting.


Picking nits, but the decision to use plain text as the common data format in Unix does not rule out structured text (e.g. CSV, TSV, XML). Nor does it imply “human-readable.”

The original decision was about not proliferating specialized or proprietary binary formats, which was more of a norm back in the ‘70s than today. The goal was to make small single-purpose tools that communicated through a common interface (files) in a standard format (plain text). Unix succeeded and continues to succeed at that.

Nothing about those design decisions precludes tools using binary formats under Unix — image processing, for example. It just precludes using standard text-oriented tools on those formats.


Great tool.

BUT, leaks memory like crazy.

Despite documentation stating the verbs are fully streaming.

> Fully streaming verbs > These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.

https://miller.readthedocs.io/en/6.7.0/streaming-and-memory/...


>BUT, leaks memory like crazy.

Huh?

a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?

b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?


It's indeed written in Go now, which indeed has GC, and ultimately all memory is freed ... but there are indeed some issues around intermediate retention of memory, taking more memory than one would have expected.

Some gains were made on https://github.com/johnkerl/miller/pull/1133 and https://github.com/johnkerl/miller/pull/1132




Previous discussion: https://news.ycombinator.com/item?id=28298729 (273 points | Aug 25, 2021 | 66 comments)


Thanks! Macroexpanded:

Miller – tool for querying, shaping, reformatting data in CSV, TSV, and JSON - https://news.ycombinator.com/item?id=29651871 - Dec 2021 (33 comments)

Miller CLI – Like Awk, sed, cut, join, and sort for CSV, TSV and JSON - https://news.ycombinator.com/item?id=28298729 - Aug 2021 (66 comments)

Miller v5.0.0: Autodetected line-endings, in-place mode, user-defined functions - https://news.ycombinator.com/item?id=13751389 - Feb 2017 (20 comments)

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV - https://news.ycombinator.com/item?id=10066742 - Aug 2015 (76 comments)


This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).

  $ cat example.csv 
  color,shape,flag,index
  yellow,triangle,1,11
  red,square,1,15
  red,circle,1,16
  red,square,0,48
  purple,triangle,0,51
  red,square,0,77
  
  # pretty printing
  $ column -ts',' example.csv 
  color   shape     flag  index
  yellow  triangle  1     11
  red     square    1     15
  red     circle    1     16
  red     square    0     48
  purple  triangle  0     51
  red     square    0     77
  
  # sorting with skipped headers is a mess. 
  $ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
  color   shape     flag  index
  red     square    0     77
  purple  triangle  0     51
  red     square    0     48
  red     circle    1     16
  red     square    1     15
  yellow  triangle  1     11


> `jq` supports `@csv` for output conversion but not input

Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):

     jq -R 'split(",")' example.csv | jq -sr '[[.[0]],.[1:]|sort_by(.[4])|reverse|.[]]|.[]|@csv'

    "color","shape","flag","index"
    "red","square","0","77"
    "purple","triangle","0","51"
    "red","square","0","48"
    "red","circle","1","16"
    "red","square","1","15"
    "yellow","triangle","1","11"
NB. there is also an entry in the `jq` cookbook for parsing CSVs into arrays of objects (and keeping numbers as numbers, dealing with nulls, etc) https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-f...


Another option (based on https://unix.stackexchange.com/q/11856/109046):

    (sed -u '1q' ; sort -r -k4 -t',') <example.csv | column -ts','


> sorting with skipped headers is a mess

I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:

https://go.dev/play/p/9bASZ97lLWv


It's a matter of perspective.

I like programming languages, but I think they can cripple someone actually learning Unix!

At the end of the day, you should just use whatever tools make you the most productive most quickly.


The whole point of UNIX userland is to not have to write a custom program for every simple case that just needs recombining some existing basic programs in a pipeline...


> simple case

that's just it though, the last example is not a simple case, hence why the last example is awkward by the commenters own admission. command line tools are fine, but you need to know when to set the hammer down and pick up the chainsaw.


>the last example is not a simple case

As far as shell scripting goes, this is hardly anything to write home about. Looks simple enough to me.

It just retains the header by printing the header first as is, and then sorting the lines after the header. It's immediately obvious how to do it to anybody who knows about head and tail.

And with Miller it's even simpler than that, still on the command line...


To me the last example is still simple. When I encounter this in the wild, I don't really care about preserving the header.

  tail -n +2 example.csv | sort -r -k4 -t','
Or more often, I just do this and ignore the header

  sort -r -k4 -t',' example.csv
Keeping the header feels awkward, but using `sort` to reverse sort by a specific column is still quicker to type and execute (for me) than writing a program.


I thought the Go code looked way too complex and Python would be simpler. Yes and no.

  import csv
  
  filename = 'example.csv'
  sort_by = 'index'
  reverse = True
  
  with open(filename) as f:
      lines = [d for d in csv.DictReader(f)]
  
  for line in lines:
      line['index'] = int(line['index'])
  lines.sort(key=lambda line: line[sort_by], reverse=reverse)
  print(','.join(lines[0].keys()))
  for line in lines:
      print(','.join(str(v) for v in line.values()))


Perhaps a `DictWriter` would simplify things:

    import csv
    import sys
    
    filename = "example.csv"
    sort_by = "index"
    reverse = True
    
    with open(filename, newline="") as f:
        reader = csv.DictReader(f)
        writer = csv.DictWriter(sys.stdout, fieldnames=reader.fieldnames)
        writer.writeheader()
        writer.writerows(sorted(reader, key=lambda row: int(row[sort_by]), reverse=reverse))


I thought about that but 1) it seemed like cheating to write to standard out, 2) you're assuming that the column to sort by is an integer whereas I broke that code up a little bit.

But yours has the advantage of being able to support more complex CSVs.


+1 for easy install (`dnf install miller` on my Fedora).

But seems like it cannot handle a simple use case: CSV without header.

$ mlr --csv head -n 20 pp-2002.csv

mlr: unacceptable empty CSV key at file "pp-2002.csv" line 1.

You have to explicitly pass it (FYI `implicit-csv-header` is terrible arg name)

$ mlr --csv --implicit-csv-header head -n 20 pp-2002.csv

While `head` obliges rightly

$ head -n 20 pp-2002.csv


Well, head just reads lines. A CSV row can optionally be multiple lines.

Also head doesn't do anything with the data or the format, aside from printing line by line, so doesn't need to know any column names.


head reads CSV row that has multiple lines.

Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the *nix tools it refers to.


Miller is designed around the idea of structured data. This is a higher bar than naively manipulating text, and the user has to be more deliberate to extract fields. Doing cli manipulation of a csv that contains quoted commas is challenging with the standard tools.


>head reads CSV row that has multiple lines.

It just reads lines, doesn't know anything about rows.

So if you want, say, the first 100 rows of a csv that has multiline rows, miller can give it, while head can't.

Head will just print the "N" first lines - whether those are 1, 22, 36 or N actual csv rows.

>Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the nix tools it refers to.*

That's the whole idea.

That it behaves in a way more suited to the csv format, and more coherent than 5-6 different text-focused tools.

The claim is not "this is awk, sed, head, sort remade for csv with identical interfaces and behavior" but "this is a tool to work with csv files and do what you'd normally have to jump through hoops to do with awk, sed, head, sort which don't understand csv structure".


you could shorten to:

$ mlr -N --csv head -n 20 pp-2002.csv

-N is a shortcut for --implicit-csv-header and --headerless-csv-output


This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following: ``` I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that ```

And it replied with

mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv

I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly


Csvkit is the usual way to handle working with CSVs (or Excel, or JSON, or.... Postgres): https://csvkit.readthedocs.io/en/latest/index.html#why-csvki...

brew install csvkit and enjoy


Shell tool favorite from the past: https://github.com/stefan-schroedl/tabulator


Miller is a welcome addition to my toolbox. I just wish it was as 'easy' to use as jq... part of that might be how infrequently I work with CSVs


Huh? jq has far more difficult syntax...


The syntax is pretty crazy, so there's a learning curve. But once your over the hump, jq is easy to use for transformations or "querying" of large volumes of json data.


Yea agreed. The moment I need to do anything complicated I start getting this gut feeling I’d be better off just writing a quick Go cli tool.


Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily


In some sense, there's no answer to your question because everyone's tool workflow is their own.

That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`


While I get what they mean by "tabular JSON", that's a bit of a misnomer.


I was wondering what they meant by that term as well. Does every JSON document in the array have to have the exact same structure (including the ordering of key-value pairs)? What happens if row #1000 introduces a new key-value pair not seen before? What if the value for a key is an array?


It generally means an sequence/array of flat json objects, each key represents a column name, and the value representing the row value for that column. Nested json can also be "flattened".

> "... ordering of key-value pairs"

order of appearance of key-value pairs in input does not matter.

> "... if row #100 introduces a new key-value pair"

this is sparse data. miller handles this with the "unsparsify" verb:

    $ cat in.json
    { "a": 1, "b": 2 }
    { "a": 3, "b": 4 }
    { "a": 5, "b": 6, "c": 7 }
without unsparsify:

    $ cat in.json | mlr --j2p cat
    a b
    1 2
    3 4

    a b c
    5 6 7
with unsparsify:

    $ cat in.json | mlr --j2p unsparsify then cat
    a b c
    1 2 -
    3 4 -
    5 6 7
unsparsify can also set default values:

    $ cat in.json | mlr --j2p unsparsify --fill-with 0 then cat
    a b c
    1 2 0
    3 4 0
    5 6 7
> "... What if the value for a key is an array?"

Array value treatment seems to depend on output format. for output types that can represent arrays, they are preserved:

    $ cat in-array.json
    { "a": [0,1,2], "b": [5,6,7] }
    { "a": [3,4,5], "b": [8,9,0] }

    $ cat in-array.json | mlr --jsonl cat
    {"a": [0, 1, 2], "b": [5, 6, 7]}
    {"a": [3, 4, 5], "b": [8, 9, 0]}
for formats like csv/fixed-width, arrays are flattened into columns, one for each array element:

    $ cat in-array.json | mlr --j2p cat
    a.1 a.2 a.3 b.1 b.2 b.3
    0   1   2   5   6   7
    3   4   5   8   9   0
flatten separator can also be set:

    $ cat in-array.json | mlr --j2p --flatsep _ cat
    a_1 a_2 a_3 b_1 b_2 b_3
    0   1   2   5   6   7
    3   4   5   8   9   0


This is basically what NuShell does (with the shell -- but also with files too). Why don't these guys join the NuShell team?


How is it better that SQL for these tasks on tabular data?

That's the first question I have after reading the title. Haven't read the article.

Edited the first sentence. Originally it was "Why it is not called SQL if works on tabular data?".

My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that. Awk is too powerful, not sure about it.


>How is it better that SQL for these tasks on tabular data?

It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.

>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.

SQL is a language for data in the form of tables in relational databases.

While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...

You don't need to load anything to a db, for starters.

You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.

You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).


DuckDB is actually pretty good at this kind of thing.

Doesn't need to load anything to DB

Can be used in shell

Can read from stdin and write to stdout


csvsql is the first google result for "sql in command line for csv"

https://towardsdatascience.com/analyze-csvs-with-sql-in-comm...


Yes, I know. It's not that there isn't several ways to do it, it's that it's not really a good fit for the command line, except in the "I want to reuse SQL that I already know".

The problem isn't in having a way to use SQL to query the data from the command line, it's that SQL is long winded and with syntax not really fit in a traditional shell pipeline.



1. Because it doesn't use SQL syntax.

2. Because it's closer to an amalgamation of the standard shell scripting tools (cut, sort, jq, etc) than it is to a SQL variant.


and miller also has a powerful (and intuitive imo.) dsl should you need to go beyond what the simple command line switches offer.

so, it remains simple and concise for the easy problems, yet can scale up to address the trickier ones as well.

i've used csvkit, sqlite, jq, etc. miller is my favorite tool for data-fu (with sqlite being a close 2nd)


Because SQL (Structured Query Language) is the language used to access/manipulate the data, not the name for that kind of data. There have been, and are, many databases, which essentially use tabular data, which are not SQL database.


Seems cool. Maybe a rainbow / alternating color per column would be nice


I've been using chat-GPT to work with these types of files now.


ChatGPT is pretty good at giving working code snippets for Pandas, you can describe the transforms you want and a sketch of the data (column names and sample rows) and it will usually give back working code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: