Hacker News new | past | comments | ask | show | jobs | submit login
A list of command line tools for manipulating structured text data (github.com/dbohdan)
177 points by networked on May 7, 2016 | hide | past | favorite | 55 comments



Also csvkit.

  in2csv data.xls > data.csv
  in2csv data.json > data.csv
  csvcut -c column_a,column_c data.csv > new.csv
  csvcut -c column_c,column_a data.csv > new.csv
  csvgrep -c phone_number -r 555-555-\d{4}" data.csv > matching.csv
  csvjson data.csv > data.json
  csvsql --query "select name from data where age > 30" data.csv > old_folks.csv
  csvsql --db postgresql:///database --insert data.csv
  sql2csv --db postgresql:///database --query "select * from data" > extract.csv
https://github.com/wireservice/csvkit

(Submitted a pull request.)


Oh man where were you a few months ago where I had to write simple apples just to do some of this stuff... also this will be so damn handy!:

    csvsql --query "select name from data where age > 30" data.csv > old_folks.csv


Also see DBD::CSV which been around since 1998! - https://metacpan.org/pod/DBD::CSV


awk is really, really powerful. It is fast, and you can do a lot very efficiently just playing with the 'FS' variable. And you can find it on all *nix boxes. And it works nicely with cli other tools such as cut, paste, or datamash (http://www.gnu.org/software/datamash/). As soon as it becomes too complex though it is better to resort to a real language (be it python, perl or whatever - my favourite is python + scipy).


I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my data analysis work. It's a really great way to play around with data to get a feel for it before formalizing it in a different language.

For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].

  time /usr/bin/parallel -Mx -j+0  -S192.168.1.3,: --block 3.5M --pipe --fifo --bg "/usr/bin/numactl -l /usr/bin/mawk -vRS='[^a-zA-Z0-9]+' '{a[tolower(\$1)]+=1} END { for(k in a) { print a[k],k} }'" < ~/A* |  /usr/bin/mawk '{a[$2]+=$1} END {for(k in a) {if (a[k] > 1000) print a[k],k}}' | sort -nr | head -10
[0] https://github.com/red-bin/wc_fun/blob/master/wordcount.sh


Awk is great at what it does, but I find myself unable to keep it cached in my brain long enough to reuse it. Using awk usually means a google search of how to use it, which defeats quickly working at a term.


awk is a really productive constraint one can oblige to. If a problem looks like a text processing problem, use awk, and let the idioms set in. It is an awesome tool to have on one's belt.

It is, in a sense, liberating to not have any library support for common problems: No need to learn a library (hej!) and, by the way, what you need in any given problem is an easy subset of what that library would do anyway.

Awk has made me focus more on the data there is to analyze, rather than the framework to analyze it with.

As the idiomatic use of awk is also very succinct, I hardly can imagine working efficiently on the command line without it.


I take a different stance. I never use awk, because I can use Perl. In using Perl, I also have access to the many modules that can extend it's usefulness. One of the core reasons Perl was created was to fill in where awk wasn't as useful as it could be.


I use awk only if it's a trivial one-liner. The only awk I can remember is selecting specific columns in whitespace-separated text. If it's just a single column, I'll try to use cut(1).

For anything more complicated, I'll use Python. Because if it's something that is going to have a somewhat long life, and that I would need to feed into plotting, I'd rather use Python. Because I remember the bad old days of plot(1).


This is a similar argument used by users of languages like J, K and APL. The language is small and concise - rather than learn libraries or frameworks you use the primitives and the emphasis is on the data being manipulated. Common patterns (like +/%# for average) are recognized rather than named.


AWK is a language. I always get upset when people call it a program or a tool. In a very BROAD and general sense all languages are also a program or a tool but it is first and foremost a language. Perl isn't called a tool or a program 1/100 as much as AWK. Maybe I am just petty?


This is probably because a good majority of the time it's used on the command line directly instead of writing the code to a file and executing that.

In the days perl was more commonly used inline ( perl -e 'print "\x90"x200;' and friends) it was often referred to the same way.


HIGHLY Recommended Book:

Data Science on the Command Line

http://datascienceatthecommandline.com/



Thanks, hadn't heard of TXR. Gema[1] is another tool I'm surprised I never hear mentioned.

[1] http://gema.sourceforge.net/


I learned about Gema from a user on the TXR mailing list. Also learned about a similarly named TXL: http://www.txl.ca That similarity is pure coincidence.

As are similarities with some features of Gema, in some cases striking. They are complete coincidence; I had no awareness of Gema whatsoever when I started TXR in 1009. For instance, the use of the @ character. Check this out: Gema has @bind{var;string}. TXR has @(bind var expression). Complete, utter, double-take, rub-your-eyes coincidence! The bind terminology in TXR comes from Lisp and from pattern matching and unification concepts: variables in patterns bind to the parts of the data which they match. So of course if you introduce an explicit construct for extending the bindings, you call it bind).

TXR is a project with a much larger scope than the current state of things like Gema and TXL; it provides a complete programming language that is paired with a text extraction language.


I've used some very basic TXR for refactorings that were a bit beyond my IDE's capabilities, which gave me a taste of how powerful it could be. One thing that's slowed me down in experimenting with it is having to save the script, rerun TXR and refresh the output file each time I make a change. Do you have any tips for quickly and interactively building complex scripts?


Not in the TXR pattern language, I'm afraid! TXR Lisp has those interactive properties of just re-loading a module on-the-fly to redefine some functions without redefining the entire program, but the pattern language isn't interactive in that way.

This is from the ground up, pretty much, emanating from the way it is parsed in its entirety before being executed.

You can develop the logic in small pieces and test them in isolation on some sub-segment of the data, then integrate the pieces into a larger script with @(load ...).

Speaking of code refactorings, I also use it for that myself. When using it to interactively rewrite code, I invoke it in a pipeline out of vim. E.g. select some range of code visually with V, then "!txr script.txr -[Enter]" to filter it through. Vim deletes the original text and replaces it with the output. If the output is wrong, I hit u for undo to restore the original text. Then of course I have to fix script.txr and save it, and recall and repeat that filtering step just with ":[up arrow][Enter]".

To avoid refreshing the output file during development, don't have one; let it dump to standard output.

Something useful is that we can also monitor the first N lines of an output file that fit into the screen using the watch utility that is found in many GNU/Linux distros. Watch repeatedly executes some command that is given (once every 2.0 seconds by default) and splashes its output on a clear screen. If we "watch cat file", we can monitor the changing contents of file.


I didn't think of calling TXR from within vim, thanks I'll try that.

Watch could get me closer to the workflow I'm used to from web development where the browser auto-reloads whenever a change is made. Your mentioning vim (of which I'm not exactly a power user) actually prompted me to check if it had the ability to run a command on writing a buffer or to automatically reload a file when it changes. Turns out it does indeed have ways to do both these things, so that might be another way to achieve what I'm after.


An example of a more complex script is Tamarind: a CGI application in TXR which lets users log in with their IMAP/SASL credentials and edit their personal list of randomly generated throwaway e-mail aliases.

http://www.kylheku.com/cgit/tamarind/tree

This is a CGI script that runs under Apache. It doesn't use any web framework; everything is there in the code: parsing URL parameters and form data, handling sessions and authentication, etc.

In early versions, I scanned the /etc/passwd file and authenticated users by doing raw crypt calls on the passwords. Then I upgraded to SASL authentication: sending tokens to "saslauthd" over Unix sockets. This is in

http://www.kylheku.com/cgit/tamarind/tree/auth.txr

Then I added IMAP. That was an example of incremental development. I tested the IMAP code in isolation and then just planted it into auth.txr.

I was surprised because I was sure that would be done in Lisp; but the pattern language easily implements the IMAP conversation needed to authenticate.

I expected most of Tamarind to be Lisp; I'm surprised how much of it in general ended up TXR via a path of least resistance.


I actually tried running TXR from a Python web service so that I could work with it via a web interface, but running it with subprocess felt all wrong. It didn't occur to me to do it as CGI as it would have seemed far too ambitious for a beginner, but that looks surprisingly straightforward.


Check out pup for parsing HTML. https://github.com/ericchiang/pup

pup uses CSS selectors to select elements from HTML documents. Used in conjunction with curl, it gives you a very simple and low friction way to scrape data in scripts.


I would add to that list Nokogiri, "The Chainsaw". xsltproc is ubiquitous, but writing xslt is akin to having a pack of wild monkeys compose a mural with their excrement.


It isn't easy (perhaps not even possible) to get the name of the fruit Bob grows in his farm using any of these tools and the following data:

  {
    "models": [{
      "title": "fruits",
      "fields": [
        {"name": "Name", "key": "3746"},
        {"name": "Colour", "key": "4867"}
      ],
      "entities": [{
        "_id": "372612",
        "3746": "Orange",
        "4867": "orange"
      }]
    }, {
      "title": "farmers",
      "fields": [
        {"name": "Full name", "key": "8367"},
        {"name": "Address", "key": "3947"},
        {"name": "Fruits", "key": "5243"}
      ],
      "entities": [{
        "_id": "747463",
        "8367": "Bob, the farmer",
        "3947": "Farmland",
        "5243": ["372612"]
      }]
    }]
  }


I've been meaning to learn jq, so I decided to give it a try.

    FRUITS=$(cat input.json | jq '.models | map(select(.title == "fruits")) | .[0]')
    FRUIT_NAME_KEY=$(echo "$FRUITS" | jq '.fields | map(select(.name == "Name")) | .[0].key')
    
    FARMERS=$(cat input.json | jq '.models | map(select(.title == "farmers")) | .[0]')
    FARMER_NAME_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Full name")) | .[0].key')
    FARMER_FRUITS_KEY=$(echo "$FARMERS" | jq '.fields | map(select(.name == "Fruits")) | .[0].key')
    
    BOB=$(echo "$FARMERS" | jq '.entities | map(select(.['$FARMER_NAME_KEY'] == "Bob, the farmer")) | .[0]')
    BOB_FRUIT_IDS=$(echo "$BOB" | jq '.['$FARMER_FRUITS_KEY'] | .[]' -r)
    
    for BOB_FRUIT_ID in "$BOB_FRUIT_IDS"; do
        echo "$FRUITS" | jq '.entities | map(select(._id == "'$BOB_FRUIT_ID'")) | .[0] | .['$FRUIT_NAME_KEY']'
    done
There's a bit of bash boilerplate, but honestly it was about what I would expect, given a structure with so many layers of indirection.

Pain points:

* Switching between bash and jq's filtering language led me to use string interpolation with bash variables. Malicious inputs can probably exploit this (and it was just awkward anyway).

* A "select one" filter would be nice, instead of select + get first element.


jq is powerful enough to express it in one query: it has variables (using expr as name) which make this thing at least vaguely feasible. That doesn't mean you should, but you could:

    .models
      | (.[]
        | if .title == "farmers"
          then
              (.fields | .[] | if .name == "Fruits" then .key else empty end)
                as $fruits
            | (.fields | .[] | if .name == "Full name" then .key else empty end)
                as $name
            | .entities
            | .[]
            | if .[ $name ] == "Bob, the farmer" then .[ $fruits ] else empty end
          else empty end)
            as $fruits
      | .[]
      | if .["title"] == "fruits"
          then
             ( .fields | .[] | if .name == "Name" then .key else empty end) as $fruit_name
             | .entities | [ .[] | {(._id): .[$fruit_name]} ] | add as $lookup
             | $fruits | .[] | $lookup[.]
          else empty
        end
(I'm not claiming this is the best way to write that query, but it's the first one I came up with.)


Here's the equivalent using jsonaxe[1]. The main difference is the python-like syntax, which is either good or bad depending on your tastes. The pain point about string interpolation remains, tho.

    data=input.json
    
    fruits () {
        jsonaxe 'models.filter(lambda x: x["title"] == "fruits")[0]' "$data"
    }
    
    farmers () {
        jsonaxe 'models.filter(lambda x: x["title"] == "farmers")[0]' "$data"
    }
    
    fruit_name_key=$(fruits |
        jsonaxe --raw 'fields.filter(lambda x: x["name"] == "Name")[0].key')
    
    farmer_name_key=$(farmers |
        jsonaxe --raw 'fields.filter(lambda x: x["name"] == "Full name")[0].key')
    
    farmer_fruits_key=$(farmers |
        jsonaxe --raw 'fields.filter(lambda x: x["name"] == "Fruits")[0].key')
    
    farmers |
    # get bob
    jsonaxe "entities.filter(lambda x: x['$farmer_name_key'] == 'Bob, the farmer')[0]" |
    # get bob's fruits key
    jsonaxe --raw "get('$farmer_fruits_key')" |
    while read fruit_id
    do
        fruits |
        jsonaxe "entities.filter(lambda x: x['_id'] == '$fruit_id')[0]" |
        jsonaxe "get('$fruit_name_key')"
    done
[1] https://github.com/davvid/jsonaxe


First of all, I have little sympathy for people who create JSON like that. You created a messy, hard to use JSON "schema", it should be little surprise to anybody that it's messy and hard to use.

FWIW though, jq can do the query, but I'm not going to spend the time doing it.


I'm with you. Why would anyone intentionally create data like that?


I've seen JSON like this many times, and I also hate it, but I'm hardly going to use jq to fiddle with JSON I've written myself.


I don't really see what your complaint is.

The problem you're describing isn't a shortcoming of jq and the other tools in the article, it's a shortcoming of the brain dead JSON encoding.

If you need to process data like that, you'll have to console yourself to the fact it won't be process-able as a shell one-liner.


For details, see http://pastebin.com/raw/1W88ywuu

The actual test:

    (defparameter *model*
      (make-model (cl-json:decode-json-from-source #P"/tmp/json.test")))

    (with-table (farmers :title "farmers"
                         :model *model*
                         :accessors ((farmer-name "Full name")
                                     (farmer-fruits "Fruits")))
      (dolist (farmer (table-entities farmers))
        (when (ppcre:scan "Bob" (farmer-name farmer))
          (with-table (fruits :title "fruits"
                              :model *model*
                              :accessors ((fruit-name "Name"))
                              :index t)
            (return
              (loop
                 for fid in (farmer-fruits farmer)
                 collect (fruit-name (iget fruits fid))))))))
The :accessors arguments is used to resolve field access once for all when visiting a table. If :index is true, we build a temporary hash-table based on identifiers.


That's true even in Clojure, arguably the simplest and cleanest language ever invented for complex data transformation and extraction.

The Clojure solution to this still ends up requiring temporary variables and some sort of model transformation functionality. (Will try to post my Clojure solution in 5 hours after my next noprocrast timer is up.)

If the data could first be transformed so that it doesn't require temporary variables or ad-hoc transformation function definitions, instead making use of "paths", then it would be easier with command line tools. Such a transformation could be possible as its own command line interface.


Yup, lots of temporaries (I haven't written much Clojure recently, so I'm sure I'm missing lots of simplifying core fns).

  (let [; input
        js (cheshire.core/parse-string (slurp (clojure.java.io/resource "awful.json")))
        ; utils
        get-match (fn [k v coll] (first (filter #(= (get % k) v) coll)))
        model-for (fn [title toplevel] (get-match "title" title (toplevel "models")))
        key-for (fn [name* model] ((get-match "name" name* (model "fields")) "key"))
        ; models
        fruit-model (model-for "fruits" js)
        farmer-model (model-for "farmers" js)
        ; keys
        fruit-name-key (key-for "Name" fruit-model)
        farmer-name-key (key-for "Full name" farmer-model)
        farmer-fruits-key (key-for "Fruits" farmer-model)
        ; values
        bob-entity (get-match farmer-name-key "Bob, the farmer" (farmer-model "entities"))
        bob-fruit-keys (bob-entity farmer-fruits-key)
        bob-fruit-entities (map #(get-match "_id" % (fruit-model "entities")) bob-fruit-keys)
        bob-fruit-names (map #(% fruit-name-key) bob-fruit-entities)]
    bob-fruit-names)

  => ("Orange")


Got this far and gave up:

    (let [models (->> (get (clojure.data.json/read-str json-string) "models")
                      (clojure.walk/keywordize-keys))
          fields (->> models
                      (mapcat :fields)
                      (map (juxt (comp keyword :key) :name))
                      (into {}))
          entities (->> models
                        (mapcat :entities)
                        (map (juxt :_id (fn [entity]
                                          (->> (for [[k v] (dissoc entity :_id)]
                                                 [(get fields k) v])
                                               (into {})))))
                        (into {}))
          ]
      )
Posted for posterity.


That's true even in Clojure, arguably the simplest and cleanest language ever invented for complex data transformation and extraction.

Interesting, I didn't realize Clojure had a reputation for data transformation and extraction


I've found fsql[1] to be extremely useful in the past.

1: https://metacpan.org/pod/distribution/App-fsql/bin/fsql


Don't forget cut and sed.


pgloader: load from CSV (and others) to postgreSQL.

See http://pgloader.io/howto/quickstart.html


Does anyone know of a tool like ranger[1] for visualizing JSON on the terminal? There is a Chrome Extension[2], but nothing useful to browse JSON on the terminal (it doesn't have to be like ranger, I'm looking for any tool that makes it easier to take a look at a JSON file).

  [1]: https://github.com/hut/ranger
  [2]: https://chrome.google.com/webstore/detail/json-finder/flhdcaebggmmpnnaljiajhihdfconkbj


If you use vim there's https://github.com/elzr/vim-json which gives you folding, highlighting, etc.


not "like ranger" yet emacs has an extensive library to edit, highlight, and tidy various data formats (XML, HTML, yaml), and json is no exception: https://github.com/thorstadt/json-pretty-print.el


jq can pretty print JSON files, including coloring.


For converting arrays of objects between formats like CSV, JSON, YAML, XML (WIP), etc... I built aoot[1] which stands for "Array of objects to". It's written in Node.js and uses upstream packages whenever possible.

1. https://github.com/montanaflynn/aoot




And not a single tool for s-expressions?


I was after something like jq for s-expressions a while ago, but didn't find anything other than full-blown Lisps, Schemes, or libraries written in other languages: http://stackoverflow.com/questions/31232843/jq-or-xsltproc-a...


xsv for doing queries against CSV files probably belongs to the list too: https://github.com/BurntSushi/xsv


yacc, antlr, lemon, bison.....


Definitely. Most software vulnerabilities are from failure to write formal parsers on on all inputs. Is there a command line YACC for compiling simple stuff?


Many inputs don't have well-specified grammars.


"Most software vulnerabilities are from failure to write formal parsers on on all inputs."

That's a good quote


This might also interest you? http://langsec.org/


thanks. Have you attended the conference?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: