An introduction to data processing on the Linux command line

tjlav5 · on Nov 23, 2019

If you're interested in this space, a great resource can be found at https://www.datascienceatthecommandline.com/ (a free guide to go along with an orielly book)

dima55 · on Nov 23, 2019

A plug of my tools:

To visualize data coming in from a pipe, can pipe it to

https://github.com/dkogan/feedgnuplot

Very useful in conjunction with other tools to provide filtering and manipulation. For instance (the first one is mine):

https://github.com/dkogan/vnlog

https://www.gnu.org/software/datamash/

https://csvkit.readthedocs.io/

https://github.com/johnkerl/miller

https://github.com/eBay/tsv-utils-dlang

http://harelba.github.io/q/

https://github.com/BatchLabs/charlatan

https://github.com/dinedal/textql

https://github.com/BurntSushi/xsv

https://github.com/dbohdan/sqawk

https://stedolan.github.io/jq/

https://github.com/benbernard/RecordStream

haddr · on Nov 23, 2019

Command Line tools are powerful beasts (e.g. awk) and they were always central to data preprocessing. But do we need to call it now a data science?

p0cc · on Nov 23, 2019

Yeah this article is about processing text data and not any form of statistics, modeling, etc. I'm guessing they added "data science" because it's in vogue? In any case, the provided title does not reflect the article.

KasianFranks · on Nov 23, 2019

NLU. It relates to extracting intelligence from human language. Most of which comes in the form of text.

fizixer · on Nov 23, 2019

Regarding more than one mentions of UUOC in this thread:

- The original award started in 1995. Even though pentium was already out, I think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day shell work (meaning no GBs of file-processing or anything like that), isn't invoking UUOC and pointing out inefficiencies an example of premature optimization [1]?

- Isn't readability a matter of subjectivity, and that for some folks 'cat file' is more readable than '<file' or a direct use of a processing command (like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly illuminating [3]).

[1] http://wiki.c2.com/?PrematureOptimization

[2] https://chat.stackoverflow.com/rooms/182573/discussion-on-an...

[3] https://stackoverflow.com/questions/11710552

mark_l_watson · on Nov 23, 2019

Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.

dvrx · on Nov 23, 2019

Better solution: sixel-gnuplot

Shameless plug: https://github.com/csdvrx/sixel-gnuplot

mark_l_watson · on Nov 23, 2019

Thanks, I will try that.

dvrx · on Nov 24, 2019

If you like it, share your terminal configuration!

mlterm works.

mintty had a regression, 3.1.0 may have fixed that

dima55 · on Nov 24, 2019

https://github.com/dkogan/gnuplotlib

ibern · on Nov 23, 2019

Here are some ways you could simplify some of the tasks in the article, saving on typing:

    cat data.csv | sed 's/"//g'

can be simplified by doing this instead:

    cat data.csv | tr d '"'

This awk command:

    cat sales.csv | awk -F',' '{print $1}' | sort | uniq

Can be replaced with a simpler (IMO) cut instead:

    cat sales.csv | cut -d , -f 1 | sort | uniq

When using head or tail like this:

    head -n 3

You don't need the -n:

    head -3

Also shout out to jq, xsv, and zsh (extended glob), all nice complements to the typical command line utils.

gnufx · on Nov 23, 2019

If you want to simplify things, don't employ "useless us of cat". Pass the file as a command arg or re-direct input. And sort has options, so the third/fourth commands can be

sort -u -t, sales.csv

However, those fail with quoted commas.

Also, head -3 is non-POSIX obsolete syntax.

Edit: I don't know why I didn't see other UUOC references initially.

dvrx · on Nov 23, 2019

I like cut and tr too, but I try to replace them by sed and awk when I can. I reduces the number of moving parts, and allows you to increase the complexity slowly.

Ex: | sed -e step1 becomes | sed -e step1 -e step2 instead of adding another pipe and another "moving part" like tr

mnaydin · on Nov 24, 2019

The awk command

  cat sales.csv | awk -F',' '{print $1}' | sort | uniq

can even be further simplified to

  cut -d, -f1 sales.csv | sort -u

arminiusreturns · on Nov 23, 2019

When I was at a genetics lab, I was helping some researchers on something and spent 3 days writing a perl script, which kept failing. I sent an email to one of the guys who wrote the paper the research was being based on, and he said, why not try awk like this? With a little work, I turned 3 days of perl into a 1 line awk that was faster than anything else for the job at the time. That was an inspirational moment for the fundamental power of the unix philosophy and the core utilities in linux for me.

Good introductory article here!

mjirv · on Nov 23, 2019

This is a great list and well-written. As a data professional, I use these commands all the time and my job would be much harder without them. I also learned a few new things here (`tee` and `comm`).

I was lucky that my first job was as a support engineer at a data-centric tech company, which is where I learned these. I've often thought about how to teach them to data analysts coming from a non-engineering background. This is comprehensive but clear and would be a perfect resource for training someone like that. Thank you!

fizixer · on Nov 23, 2019

I'll just leave one of my past comments [1] here.

[1] https://news.ycombinator.com/item?id=17324222

P.S.: Not essential, but it really becomes a joy when, as a touch typist, I have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never have to leave the home row while I do my shell piping work from start to finish. (no mouse, no arrow keys, etc.)

hackerm0nkey · on Nov 23, 2019

Haha. That’s me. Once you go ‘set -o vi’, you can’t go back

pferde · on Nov 23, 2019

Huh, so it turns out that I've been a 'data scientist' for over 20 years. Who knew?

_ytji · on Nov 23, 2019

That was my first thought skimming through this too. Either every *nix admin who is aware of a few text processing tools is a data scientist, or “data scientists” are just as full of it as I’ve expected.

jdjdjjsjs · on Nov 23, 2019

Just because a tool can be used for A, B or C, and you are an expert at using that tool for A does not imply that your expertise at using the tool for A makes you an expert in B and C.

The whole point of this article is to point out that a lot of common Linux tools can be used for Data Science like work (a significant part of which includes pre processing structured and unstructured text).

rodrigo975 · on Nov 23, 2019

Why people use Linux in place of *nix ?

Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped by the GNU project in most of the GNULinux distros.

clarry · on Nov 23, 2019

> Why people use Linux in place of nix ?

I find it more irritating when people try to score greybeard points by saying *nix (or Unix) when it's obvious that they're talking about a Linux-only mechanism and quite possibly haven't ever used Unix (or a direct derivative).

umanwizard · on Nov 23, 2019

Huh? What is specific to Linux in this post?

Also, the most popular Unix-like OS (far more than Linux) is macOS, basically the least “leet greybeard affectation” thing I can imagine. Your irritation is way off base.

clarry · on Nov 23, 2019

> Huh? What is specific to Linux in this post?

Perhaps nothing? I was responding to the complaint in general terms.

> Your irritation is way off base.

Please allow me to feel irritated when people refer to obvious Linux things as something that's supposedly got something to do with Unix. It happens often enough.

umanwizard · on Nov 23, 2019

Fair enough; I suppose it’d be annoying for people to talk about “the Unix concept of cgroup namespaces” or something like that.

I had thought that you were directly responding to the original poster.

drran · on Nov 23, 2019

Most popular Unix-like OS on consumer devices is Linux (Android). Most popular Unix-like OS on servers is Linux. Most popular Unix-like OS in embedded is Linux. Most popular Unix-like OS on supercomputers is Linux. Most popular Unix-like OS on IBM PC compatible computers or notebooks is MS Windows with WSL.

Most popular Unix-like OS on MacBooks is MacOS.

umanwizard · on Nov 23, 2019

Android is not a Unix-like OS, other than having a kernel that was originally devised as a Unix clone. Beyond that, I’m not sure what your point is. Is it just that “most popular” is ill-defined?

To bring us back to the context of this post: I am quite willing to bet that “grep” and “cat” are used by humans more times per day on macOS than on any other OS.

gnufx · on Nov 23, 2019

Linux is a kernel, which is irrelevant; I've run the GNU tools on many different systems, including MS-DOS, over the years. POSIX now defines UNIX anyway.

umanwizard · on Nov 23, 2019

Yep. And the vast majority of these tools exist on systems that share virtually zero heritage with Linux or GNU. (Like macOS).

Oh well, I guess a lot of people just think all Unix-like systems are called “Linux” now. Perhaps it’s become like the word “Kleenex”.

goatinaboat · on Nov 23, 2019

Speaking of GNU there’s Datamash https://www.gnu.org/software/datamash/ if you like doing “data science” in the shell

TsiCClawOfLight · on Nov 23, 2019

To be fair, in many cases (such as grep), the GNU commands have additional features and are more intuitive to use than the standard POSIX implementations.

wolfhumble · on Nov 23, 2019

Very nice video, and I like the way you combine it with text and examples! :-) Looking forward to reading the other articles on your page as well!

hackerm0nkey · on Nov 23, 2019

Very useful article. Learned a couple of new things here.

While reading the idea that I know most of this, would that made me a data scientist? Jumped at me.

But then I quickly recovered from that thought that surely knowing some of the tools someone could use for a certain domain does not make you expert at that domain.

Might just be the case of same ingredients, different recipes.

pedro84 · on Nov 23, 2019

This is a little more awk-ish:

awk -F, '$2 == "F" {$0=(($1-32)*5/9)",C"} {print}'

dvrx · on Nov 23, 2019

I love awk too but most people don't know much of awk. Better use regular things and keep awk for whenever you absolutely need it.

pnutjam · on Nov 23, 2019

This is still useful information for data scientist who end up on Linux.

oburb · on Nov 23, 2019

This is also useful: https://www.gnu.org/software/datamash/

mnaydin · on Nov 24, 2019

I wouldn't use awk for simple things such as

  cat sales.csv | awk -F',' '{print $1}'

but I'd prefer

  cut -d, -f1 sales.csv

teddyh · on Nov 23, 2019

Useless use of cat detected!

Rememeber, nearly all cases where you have:

  cat file | some_command and its args ...

you can rewrite it as:

  <file some_command and its args ...

and in some cases, such as this one, you can move the filename to the arglist as in:

  some_command and its args ... file

— Randal L. Schwartz (http://porkmail.org/era/unix/award.html#cat)

robertelder · on Nov 23, 2019

hah, I knew someone would point that out (which is why I talked about it in the article).

I actually prefer useless cat because when you're prototyping a pipeline it's very awkward to use non-useless cat. You'll probably start off with something like this to observe the content of the file:

    cat something.txt

Using this doesn't work in bash:

    <something.txt

Then, continuing with useless cat to build on it you do

    cat something.txt | grep stuff

Which you can type easily from using 'up' in your terminal. But if you use non-useless cat you have to re-type the entire thing or move the cursor around:

    grep stuff < something.txt

With useless cat, you can keep adding things and check the result:

    cat something.txt | grep stuff | sed 's/"//g'

Or if you need to insert another filter before the last stage like this, you can just press "up" and insert it:

    cat something.txt | grep -v negmatch | grep stuff

I don't think there is any easily-typed equivalent workflow with non-useless cat.

jpxw · on Nov 23, 2019

If you’re working with a lot of data you probably want to pipe it into head anyway, initially, so

<file head -n50 | whatever

Can be the starting command. When you no longer need the head there, just get rid of “head |”.

Although I agree that the pointing out of “useless cat” is usually not particularly useful or constructive.

robertelder · on Nov 23, 2019

Using head when there's lots of data make sense, but I really don't see any advantage to avoiding useless cat. Useless cat is way faster to type and make additions to. I sort of get the feeling that 'useless cat' is really just a fun copypasta kinda like when people like post "I'd just like to interject for a moment. What you're referring to as Linux, is in fact, GNU/Linux, or as I've recently taken to calling it..."

drran · on Nov 23, 2019

It's still good to be aware about `useless cat`, to save some CPU and I/O, when converting one liners into scripts.

heliodor · on Nov 23, 2019

Rather than using head and worrying about the size of the file, it is easier to simply use "useless" cat then ctrl-c the stream of data that comes out.

dblotsky · on Nov 23, 2019

What makes it useless? It’s functionally the same but makes it easier to author pipelines, which I think is a valid use.

teddyh · on Nov 27, 2019

  cat foo | bar

is useless use of cat, since it’s equivalent to

  < foo bar

which is both shorter and starts one less process. Why do you think that “cat” makes it “easier to author pipelines”?

dblotsky · on Dec 3, 2019

1). It doesn’t work in all shells, so you have to think about your environment before you use it.

2). I often start with `bar < foo`, so if I need to add more arguments to `bar`, I need to always skip over the input.

3). If I just want to delete all processing and look at the input, I can’t just backspace away the processing because `< foo` is invalid.

lonelappde · on Nov 23, 2019

Good intro to data processing.

tsort and comm were news to me.

c06n · on Nov 23, 2019

Can somebody explain the advantage of doing it on the command line vs in Python or R? What would a practical use case look like?

robertelder · on Nov 23, 2019

The most significant use case for all things command-line IMHO is automation. Also, I would change that from "command line vs in Python or R" to "command line and Python or R". Build a pipeline like I've discussed in the article, then pipe it into Python or R.

criddell · on Nov 23, 2019

> Build a pipeline like I've discussed in the article, then pipe it into Python or R.

Why not just do it all in Python or R? That way you also get something that will probably work on non-unix platforms.

robertelder · on Nov 23, 2019

Over the years I've found that I usually fall into a pattern of starting with low-fidelity automation in languages like shell and slowly re-writing it over time into more higher-level languages, usually python first, then Java. This way, unimportant tasks can be automated in less than 5 min with one of these shell commands. If it breaks or has errors, no big deal. Python works well for figuring out the structure of the solution as an actual program, and then finally a language with static type checking when it really needs to run without errors.

criddell · on Nov 23, 2019

I go to Python first because it's nice to be able to single-step through the script with a debugger and monitor exactly what's happening. I also know Python a lot better than shell script so it saves me a lot of time as well.

ibern · on Nov 23, 2019

The advantage is that it's faster to prototype/write on the command line and usually ends up being less verbose (although potentially harder to read). It's easy to see what you're data is doing as you work with it and incrementally add pipes to new commands.

I like to use command line tools for for one-off tasks that I'm unlikely to repeat. If there's a task I know I'll need to repeat or is too cumbersome to do in a couple of lines, I'll reach for Python.

fizixer · on Nov 23, 2019

Please see my comment to this thread (which links to one of my past comments): https://news.ycombinator.com/item?id=21614511

robertelder · on Nov 23, 2019

Hi, (I wrote the article). A few people commented noting that I included "Data Science" in the title, but the content doesn't include any statistics or machine learning which is closer to the core definition of 'data science'. I still think the title is appropriate since any kind of low-fidelity data science task you do on some had-hoc data (log files, heaps of text, web pages) is going to start with setting up a processing pipeline that involves these commands. I could have re-named it "An intro to text processing" or "An intro to data processing", but then the people who need to see this content won't associate the title with something they're interested in, so they never benefit from it. The list of commands was chosen specifically with the question "What Linux commands would someone answering data science/business intelligence questions use?" in mind. These commands are also among the list of ones that are usually already installed on every system.

LeftHandPath · on Nov 23, 2019

Interesting.

For anyone who is interested in going a little deeper into data science, I’d also recommend the “Introduction to Data Science with R” series by David Langer:

https://youtu.be/32o0DnuRjfg

nurettin · on Nov 23, 2019

yes, this article describes - what we used to call in the early 2000s - using linux.

NHQ · on Nov 23, 2019

You forgot an important one: man

bitminer · on Nov 23, 2019

Much can be done just with awk.

My pet peeve is the "grep | awk" idiom. No, just use awk.

Awk does map/reduce, relational joins, associative memory, table lookup, and so on. Just use awk arrays, begin block, and end block.

dvrx · on Nov 23, 2019

If I am going to maintain the script myself and never tweak anything in the middle of the night, sure.

But most people don't know awk. And awk requires more awareness. I break my awk when I fix things when tired.

dang · on Nov 23, 2019

Now that you mention it, 'data processing' seems more neutral and accurate, so I've put that in the title above.

netmonk · on Nov 23, 2019

Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....

helij · on Nov 23, 2019

The person shared their knowledge in the best way they know. It might inspire some to go and take a look and figure even better ways to do the task.

SanchoPanda · on Nov 24, 2019

Useless use of cat is almost always more readable and better for explanation. It shows the direction of a pipe unambiguously and splits out commands from files at a quick glance.

FZ1 · on Nov 23, 2019

UUOC is a pedantic ford-chevy argument.

Hikikomori · on Nov 23, 2019

Yes, it made the whole article useless.

robertelder · on Nov 23, 2019

I assume you're just joking around, but to you and the parent comment, I'd be happy to hear any good arguments for avoiding 'useless' cat. Note that I did mentioned 'useless cat' in the article, and there is already a comment thread in this article that contains my opinions on it.

netmonk · on Nov 23, 2019

So why should we repeat ?