I recommend using clickhouse-local[1] for these tasks.
It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].
Jesus, this is disgusting. I'm not that picky and don't really complain about "... | sh" usually, but at least I took it for granted that I can always look at the script in the browser and assume that is has no actual evil intentions and doesn't rely on some fucking client-header magic to be modified on the fly.
User agent "sniffing" (I mean, it's right there. It's not exactly subtle.) has been going on since the before the IE6 days. That it now extends to make things easier for us as command line users is... kinda convenient? Another site where I've seen it done to good effect is http://ifconfig.me . Hit that with a web browser and get a page of accompanying information. Hit it with curl, and get back your ip in ascii - not even an extra newline character is returned!
The underlying question is do you trust clickhouse.com or not? You don't have to; I've never met the team or talked to them, and I can't make that decision for you. But whether you go to the site, laboriously find the download page, right click, download a binary, install the deb/rpm, ask your package manager for what files just got installed, then find and run the clickhouse binary, or just let your computer do it for you via a shell script, the end result is the same. Code from clickhouse (and we're sure it was from clickhouse because of TLS) was downloaded to the target machine, and then got run. Does things have to be difficult and annoying in order for you to like it? (Psychology studies say yes, actually.)
$ cat foo.tsv
name foo bar
Alice 10 8888
Bob 20 9999
$ cat foo.tsv | sqlite3 -batch \
-cmd ".mode tabs" \
-cmd ".import /dev/stdin x" \
-cmd "select foo from x where bar > 9000;"
20
Guilty as charged... I have so many bookmarks I forget to look at...
APPlet Idea: an app that picks a random bookmark from your saves and sends you a reminder to click it and read it... and schedule when you want to see it - like every 7AM send me a link from my bookmarks. Set an alarm for 15 minutes to get me back on schedule.
So chatgpt and I made this: https://github.com/erasei/randombookmark Load in developer mode and when you click the icon it will open a new tab with a random bookmark.
Thank you! That list is perfect since I don't use these tools often enough to remember them and some are hard to find again without knowing exactly what to search for.
I also discovered pawk on it which looks interesting, I made a somewhat related tool [1] which is probably out of scope for your list, but can be used in some of the same ways [2] and may of of interest nonetheless.
Benthos is a really cool tool when you want to take this idea to the next level. It can do the usual text/csv/json/yaml mangling you'd do with awk or jq (since it includes a version of those in addition to its own processing langauge) but it also has decoders and encoders for a bunch of different binary formats like protobuf or Avro. And in addition to stdin/out and files it can talk to Kafka, MQTT, HTTP, ES and a bunch of other stuff. I was able to put together a log processor that consumes from Kafka, does a bunch of field mangling and then ingests into Elasticsearch in a couple dozen lines of yaml.
I've been getting a lot of mileage out of https://github.com/itchyny/gojq#readme recently due to two things: its vastly superior error messages and the (regrettably verbose) `--yaml-input` option
No thank you. I appreciate the power, speed, simplicity and flexibility of UNIX/GNU style text tools. I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare.
When I read I-Also-Don't-Want-To-Be-Locked-Into-This-Ridiculous-Syntax-Nightmare, bash and sed and cut and the likes indeed are the first things which come to mind. I really feel kind of bad sometimes for once having spent time learning them, only to later find out there are many alternatives and nearly all of them have a shorter learning curve. Many of these tools also have zero discoverability as well, meaning you effectively get locked into learning their syntax. And then another one for the next tool. Whereas at least in PS you can try to use common words and tab completion and sometimes get there before having to reach to 'how do I x in y'.
It's a matter of recognizing your use case. If you're going to write a program that you expect to maintain for years, sure, go ahead and make it as verbose as possible. Unix tools support this with long-form flags (usually prefixed with -- rather than -). On the other hand, if you're doing exploration and iterating interactively on the fly (which bash is best at) then you want very terse syntax to keep lines short.
Ever try this on large files? A lot of PowerShell commands I make like this can take minutes to run when a combination of Linux commands and Awk might take a couple of seconds.
Yes: memory ceiling is huge, WinRM is buggy and unreliable, performance is "variable", totally inconsistent handling of dates and times, being .Net it's UTF-16 internally so streams are painful to work with, escape characters argh, variable scoping is horrible, most of the API you have to hit to get stuff done is wrapping ancient bits of weird win32 and DCOM and WMI thus has inconsistent error handling and ability to actually do what it is told.
While powershell has gotten a lot faster in the most recent versions its still pretty slow for anything involving computation; most of the type my equivalent python code beats the pants off of it.
The expressiveness is nice, but oftentimes modules won't support it or require weird ways of using the data to get the performance you want (mostly by dropping out of the pipe.)
The choices around Format- vs Out- vs Convert- are Very Confusing for new people and the "object in a shell but also text sometimes" way of displaying things is weird and until recently things like -NoTypeInformation or managing encoding in files was just pointlessly weird.
The module support and package management is still entirely in the stone ages and I regularly see people patching in C# in strings to get the behavior they want.
"Larger" modules tend to get Really Slow - the azure modules especially are just an example of how not to do it.
The way it automatically unwraps collections is cool, but gets weird when you have a 1 vs many option for your output, and you might find yourself defensively casting things to lists or w/e.
The typing system in general is nice to get started on but declaring a type does not fix it, so assignment can just break your entire world.
There's still a lot to love about the language when you are getting things done in a windows environment its great to glue together the various pieces of the system but I find the code "breaks" more often than equivalent python code.
My philosophy about any shell language is that if performance is a concern, then you should probably use a real programming language for it. Using shell scripting to handle batch processing tasks just creates dependencies on unmaintainable code.
If performance is too bad, you can only use the shell for toy examples, which means there's no reason to have a real shell at all and you might as well go back to COMMAND.COM or equivalent. It's taking the idea of Unix-style scriptable shells and Improving the implementation to the point of unusability.
It's also inhibited by typical corporate crapware, but with a pretty barebones/vanilla config, in PowerShell takes a full four to ten seconds just to open a new session on my work computer.
PowerShell isn't even fast enough for basic interactive usage, never mind batch processing.
Simple tools with simple rules will outlast most of the code we'll all build.
Picking things like grep, awk, and sed means your knowledge will be widely applicable going forward, and many people caring about their performance both backwards and forwards in time means your shell can be pretty fast.
I agree that this is something missing on classic UNIX shells: typed output streams.
I had this discussion a while back on HN, though I can't find it ATM (I wish there was a comment history search function). I am far from the first one who thought of that, and there are a few implementations of this idea.
Searching for that comment, I came across relevant stories:
Typed output streams aren’t missing from Unix. The authors of Unix clearly describe, repeatedly, that Unix tools use plain text as the common format, and that everything should look like a file.
You’re right that standard Unix tools don’t have a concept of types in streams, but that decision got made deliberately. Types and formats got left as output details.
Analogously my Macbook Air doesn’t have a fan, by design, not by accidental omission.
Right, that was an unfortunate choice of words. It's not missing, but leads to (IMO) a proliferation of under-specified text-based data and stream exchange formats.
Having human-readable text as the lowest common denominator is a laudable goal. Shell scripting would however probably be improved if most tools offered alternative typed streams, or something similar. I am not convinced Powershell's approach is the best, but their approach is at least interesting.
Picking nits, but the decision to use plain text as the common data format in Unix does not rule out structured text (e.g. CSV, TSV, XML). Nor does it imply “human-readable.”
The original decision was about not proliferating specialized or proprietary binary formats, which was more of a norm back in the ‘70s than today. The goal was to make small single-purpose tools that communicated through a common interface (files) in a standard format (plain text). Unix succeeded and continues to succeed at that.
Nothing about those design decisions precludes tools using binary formats under Unix — image processing, for example. It just precludes using standard text-oriented tools on those formats.
Despite documentation stating the verbs are fully streaming.
> Fully streaming verbs
> These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.
a) Isn't it written in Golang, which has a GC? Does it do custom buffer based management?
b) Isn't it supposed to be run on a file and get some output - as opposed to an interactice session? Why would it matter if it leaks, then, and how could it leak, as the memory is returned to the OS when it ends?
It's indeed written in Go now, which indeed has GC, and ultimately all memory is freed ... but there are indeed some issues around intermediate retention of memory, taking more memory than one would have expected.
This is nice. I use `column` for pretty printing CSV/TSV but it fixes two tiny gaps in `sort` (skipping header lines) and `jq` (parsing CSV input. `jq` supports `@csv` for output conversion but not input).
$ cat example.csv
color,shape,flag,index
yellow,triangle,1,11
red,square,1,15
red,circle,1,16
red,square,0,48
purple,triangle,0,51
red,square,0,77
# pretty printing
$ column -ts',' example.csv
color shape flag index
yellow triangle 1 11
red square 1 15
red circle 1 16
red square 0 48
purple triangle 0 51
red square 0 77
# sorting with skipped headers is a mess.
$ (head -n 1 example.csv && tail -n +2 example.csv | sort -r -k4 -t',') | column -ts','
color shape flag index
red square 0 77
purple triangle 0 51
red square 0 48
red circle 1 16
red square 1 15
yellow triangle 1 11
> `jq` supports `@csv` for output conversion but not input
Actually, `jq` can cope with trivial CSV input like your example, - `jq -R 'split(",")'` will turn a CSV into an array of arrays. To then sort it in reverse order by 3rd column and retain the header, the following fell out of my fingers (I'm beyond certain that a more skilled `jq` user than me could improve it):
I like these command line tools, but I think they can cripple someone actually learning programming language. For example, here is a short program that does your last example:
The whole point of UNIX userland is to not have to write a custom program for every simple case that just needs recombining some existing basic programs in a pipeline...
that's just it though, the last example is not a simple case, hence why the last example is awkward by the commenters own admission. command line tools are fine, but you need to know when to set the hammer down and pick up the chainsaw.
As far as shell scripting goes, this is hardly anything to write home about. Looks simple enough to me.
It just retains the header by printing the header first as is, and then sorting the lines after the header. It's immediately obvious how to do it to anybody who knows about head and tail.
And with Miller it's even simpler than that, still on the command line...
To me the last example is still simple. When I encounter this in the wild, I don't really care about preserving the header.
tail -n +2 example.csv | sort -r -k4 -t','
Or more often, I just do this and ignore the header
sort -r -k4 -t',' example.csv
Keeping the header feels awkward, but using `sort` to reverse sort by a specific column is still quicker to type and execute (for me) than writing a program.
I thought the Go code looked way too complex and Python would be simpler. Yes and no.
import csv
filename = 'example.csv'
sort_by = 'index'
reverse = True
with open(filename) as f:
lines = [d for d in csv.DictReader(f)]
for line in lines:
line['index'] = int(line['index'])
lines.sort(key=lambda line: line[sort_by], reverse=reverse)
print(','.join(lines[0].keys()))
for line in lines:
print(','.join(str(v) for v in line.values()))
I thought about that but 1) it seemed like cheating to write to standard out, 2) you're assuming that the column to sort by is an integer whereas I broke that code up a little bit.
But yours has the advantage of being able to support more complex CSVs.
Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the *nix tools it refers to.
Miller is designed around the idea of structured data. This is a higher bar than naively manipulating text, and the user has to be more deliberate to extract fields. Doing cli manipulation of a csv that contains quoted commas is challenging with the standard tools.
It just reads lines, doesn't know anything about rows.
So if you want, say, the first 100 rows of a csv that has multiline rows, miller can give it, while head can't.
Head will just print the "N" first lines - whether those are 1, 22, 36 or N actual csv rows.
>Miller is being offered as "like awk, sed, head.." (emphasis on head - mine) and yes it offers more, but it does not behave "like" the nix tools it refers to.*
That's the whole idea.
That it behaves in a way more suited to the csv format, and more coherent than 5-6 different text-focused tools.
The claim is not "this is awk, sed, head, sort remade for csv with identical interfaces and behavior" but "this is a tool to work with csv files and do what you'd normally have to jump through hoops to do with awk, sed, head, sort which don't understand csv structure".
This is a cool tool. And it works well with GPT4. I asked ChatGPT4 the following:
```
I'd like to use Miller from John Kerl on the CSV called test.csv. It has 3 columns, Name, City and State. I'd like to filter City where it's either Chicago or Boston. I want to remove the State column and add a new column called NewColumn. This new column shall be filled with the value of the Name column appended with the letter `_new`. Finally the output shall be saved into the file called myfile.csv. Write the command to do that
```
And it replied with
mlr --csv filter '$City == "Chicago" || $City == "Boston"' then cut -x -f State then put '$NewColumn = $Name . "_new"' test.csv > myfile.csv
I'm sure that someone familiar with Miller would write this command faster than writing the text to GPT, but as a newbie I'd have spent much longer. Also each steps is described perfectly
The syntax is pretty crazy, so there's a learning curve. But once your over the hump, jq is easy to use for transformations or "querying" of large volumes of json data.
Looks cool! It'd help to have some more full examples (including output) of what some common use-cases do in the README. After looking through this, I'm still scratching my head trying to think of what problem this could solve in my own practice, and I work with a lot of csv files daily
In some sense, there's no answer to your question because everyone's tool workflow is their own.
That said, the most recent invocation for me was `mlr --icsv --ojson cat < a515b308-9a0e-4e4e-99a2-eafaa6159762.csv` to fix up CloudTrail csv and after that I happen to be more muscle-memory with jq but conceptually next I could have `filter '$eventname == "CreateNodegroup"'`
I was wondering what they meant by that term as well. Does every JSON document in the array have to have the exact same structure (including the ordering of key-value pairs)? What happens if row #1000 introduces a new key-value pair not seen before? What if the value for a key is an array?
It generally means an sequence/array of flat json objects, each key represents a column name, and the value representing the row value for that column. Nested json can also be "flattened".
> "... ordering of key-value pairs"
order of appearance of key-value pairs in input does not matter.
> "... if row #100 introduces a new key-value pair"
this is sparse data. miller handles this with the "unsparsify" verb:
>How is it better that SQL for these tasks on tabular data?
It has far better handling of a CSV/TSV file on the command line directly and is compasable in shell pipelines.
>My point, if one wants cat, sort, sed, join on tabular data, SOL is exactly that.
SQL is a language for data in the form of tables in relational databases.
While it can do sorting or joining or some changes, it is meant for a different domain than these tools, which other constraints, other concerns, and other patterns of use...
You don't need to load anything to a db, for starters.
You also normally don't care for involving a DB in order to use in a shell script, or for quick shell exploration.
You also can't mix SQL and regular unix userland in a pipeline (well, with enough effort you can, but it's not something people do or need to do).
Yes, I know. It's not that there isn't several ways to do it, it's that it's not really a good fit for the command line, except in the "I want to reuse SQL that I already know".
The problem isn't in having a way to use SQL to query the data from the command line, it's that SQL is long winded and with syntax not really fit in a traditional shell pipeline.
Because SQL (Structured Query Language) is the language used to access/manipulate the data, not the name for that kind of data. There have been, and are, many databases, which essentially use tabular data, which are not SQL database.
ChatGPT is pretty good at giving working code snippets for Pandas, you can describe the transforms you want and a sketch of the data (column names and sample rows) and it will usually give back working code.
It does SQL; it supports all imaginable data formats, streaming processing, and connecting to external data sources. It also outperforms every other tool[2].
[1] https://clickhouse.com/blog/extracting-converting-querying-l...
[2] https://colab.research.google.com/github/dcmoura/spyql/blob/...