Hacker News new | past | comments | ask | show | jobs | submit login
An introduction to data processing on the Linux command line (robertelder.org)
203 points by robertelder on Nov 23, 2019 | hide | past | favorite | 71 comments



If you're interested in this space, a great resource can be found at https://www.datascienceatthecommandline.com/ (a free guide to go along with an orielly book)



Command Line tools are powerful beasts (e.g. awk) and they were always central to data preprocessing. But do we need to call it now a data science?


Yeah this article is about processing text data and not any form of statistics, modeling, etc. I'm guessing they added "data science" because it's in vogue? In any case, the provided title does not reflect the article.


NLU. It relates to extracting intelligence from human language. Most of which comes in the form of text.


Regarding more than one mentions of UUOC in this thread:

- The original award started in 1995. Even though pentium was already out, I think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day shell work (meaning no GBs of file-processing or anything like that), isn't invoking UUOC and pointing out inefficiencies an example of premature optimization [1]?

- Isn't readability a matter of subjectivity, and that for some folks 'cat file' is more readable than '<file' or a direct use of a processing command (like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly illuminating [3]).

[1] http://wiki.c2.com/?PrematureOptimization

[2] https://chat.stackoverflow.com/rooms/182573/discussion-on-an...

[3] https://stackoverflow.com/questions/11710552


Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.


Better solution: sixel-gnuplot

Shameless plug: https://github.com/csdvrx/sixel-gnuplot


Thanks, I will try that.


If you like it, share your terminal configuration!

mlterm works.

mintty had a regression, 3.1.0 may have fixed that



Here are some ways you could simplify some of the tasks in the article, saving on typing:

    cat data.csv | sed 's/"//g'
can be simplified by doing this instead:

    cat data.csv | tr d '"'

This awk command:

    cat sales.csv | awk -F',' '{print $1}' | sort | uniq
Can be replaced with a simpler (IMO) cut instead:

    cat sales.csv | cut -d , -f 1 | sort | uniq

When using head or tail like this:

    head -n 3
You don't need the -n:

    head -3

Also shout out to jq, xsv, and zsh (extended glob), all nice complements to the typical command line utils.


If you want to simplify things, don't employ "useless us of cat". Pass the file as a command arg or re-direct input. And sort has options, so the third/fourth commands can be

sort -u -t, sales.csv

However, those fail with quoted commas.

Also, head -3 is non-POSIX obsolete syntax.

Edit: I don't know why I didn't see other UUOC references initially.


I like cut and tr too, but I try to replace them by sed and awk when I can. I reduces the number of moving parts, and allows you to increase the complexity slowly.

Ex: | sed -e step1 becomes | sed -e step1 -e step2 instead of adding another pipe and another "moving part" like tr


The awk command

  cat sales.csv | awk -F',' '{print $1}' | sort | uniq
can even be further simplified to

  cut -d, -f1 sales.csv | sort -u


When I was at a genetics lab, I was helping some researchers on something and spent 3 days writing a perl script, which kept failing. I sent an email to one of the guys who wrote the paper the research was being based on, and he said, why not try awk like this? With a little work, I turned 3 days of perl into a 1 line awk that was faster than anything else for the job at the time. That was an inspirational moment for the fundamental power of the unix philosophy and the core utilities in linux for me.

Good introductory article here!


This is a great list and well-written. As a data professional, I use these commands all the time and my job would be much harder without them. I also learned a few new things here (`tee` and `comm`).

I was lucky that my first job was as a support engineer at a data-centric tech company, which is where I learned these. I've often thought about how to teach them to data analysts coming from a non-engineering background. This is comprehensive but clear and would be a perfect resource for training someone like that. Thank you!


I'll just leave one of my past comments [1] here.

[1] https://news.ycombinator.com/item?id=17324222

P.S.: Not essential, but it really becomes a joy when, as a touch typist, I have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never have to leave the home row while I do my shell piping work from start to finish. (no mouse, no arrow keys, etc.)


Haha. That’s me. Once you go ‘set -o vi’, you can’t go back


Huh, so it turns out that I've been a 'data scientist' for over 20 years. Who knew?


That was my first thought skimming through this too. Either every *nix admin who is aware of a few text processing tools is a data scientist, or “data scientists” are just as full of it as I’ve expected.


Just because a tool can be used for A, B or C, and you are an expert at using that tool for A does not imply that your expertise at using the tool for A makes you an expert in B and C.

The whole point of this article is to point out that a lot of common Linux tools can be used for Data Science like work (a significant part of which includes pre processing structured and unstructured text).


Why people use Linux in place of *nix ?

Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped by the GNU project in most of the GNULinux distros.


> Why people use Linux in place of nix ?

I find it more irritating when people try to score greybeard points by saying *nix (or Unix) when it's obvious that they're talking about a Linux-only mechanism and quite possibly haven't ever used Unix (or a direct derivative).


Huh? What is specific to Linux in this post?

Also, the most popular Unix-like OS (far more than Linux) is macOS, basically the least “leet greybeard affectation” thing I can imagine. Your irritation is way off base.


> Huh? What is specific to Linux in this post?

Perhaps nothing? I was responding to the complaint in general terms.

> Your irritation is way off base.

Please allow me to feel irritated when people refer to obvious Linux things as something that's supposedly got something to do with Unix. It happens often enough.


Fair enough; I suppose it’d be annoying for people to talk about “the Unix concept of cgroup namespaces” or something like that.

I had thought that you were directly responding to the original poster.


Most popular Unix-like OS on consumer devices is Linux (Android). Most popular Unix-like OS on servers is Linux. Most popular Unix-like OS in embedded is Linux. Most popular Unix-like OS on supercomputers is Linux. Most popular Unix-like OS on IBM PC compatible computers or notebooks is MS Windows with WSL.

Most popular Unix-like OS on MacBooks is MacOS.


Android is not a Unix-like OS, other than having a kernel that was originally devised as a Unix clone. Beyond that, I’m not sure what your point is. Is it just that “most popular” is ill-defined?

To bring us back to the context of this post: I am quite willing to bet that “grep” and “cat” are used by humans more times per day on macOS than on any other OS.


Linux is a kernel, which is irrelevant; I've run the GNU tools on many different systems, including MS-DOS, over the years. POSIX now defines UNIX anyway.


Yep. And the vast majority of these tools exist on systems that share virtually zero heritage with Linux or GNU. (Like macOS).

Oh well, I guess a lot of people just think all Unix-like systems are called “Linux” now. Perhaps it’s become like the word “Kleenex”.


Speaking of GNU there’s Datamash https://www.gnu.org/software/datamash/ if you like doing “data science” in the shell


To be fair, in many cases (such as grep), the GNU commands have additional features and are more intuitive to use than the standard POSIX implementations.


Very nice video, and I like the way you combine it with text and examples! :-) Looking forward to reading the other articles on your page as well!


Very useful article. Learned a couple of new things here.

While reading the idea that I know most of this, would that made me a data scientist? Jumped at me.

But then I quickly recovered from that thought that surely knowing some of the tools someone could use for a certain domain does not make you expert at that domain.

Might just be the case of same ingredients, different recipes.


This is a little more awk-ish:

awk -F, '$2 == "F" {$0=(($1-32)*5/9)",C"} {print}'


I love awk too but most people don't know much of awk. Better use regular things and keep awk for whenever you absolutely need it.


This is still useful information for data scientist who end up on Linux.



I wouldn't use awk for simple things such as

  cat sales.csv | awk -F',' '{print $1}'
but I'd prefer

  cut -d, -f1 sales.csv


Useless use of cat detected!

Rememeber, nearly all cases where you have:

  cat file | some_command and its args ...
you can rewrite it as:

  <file some_command and its args ...
and in some cases, such as this one, you can move the filename to the arglist as in:

  some_command and its args ... file
— Randal L. Schwartz (http://porkmail.org/era/unix/award.html#cat)


hah, I knew someone would point that out (which is why I talked about it in the article).

I actually prefer useless cat because when you're prototyping a pipeline it's very awkward to use non-useless cat. You'll probably start off with something like this to observe the content of the file:

    cat something.txt
Using this doesn't work in bash:

    <something.txt
Then, continuing with useless cat to build on it you do

    cat something.txt | grep stuff
Which you can type easily from using 'up' in your terminal. But if you use non-useless cat you have to re-type the entire thing or move the cursor around:

    grep stuff < something.txt
With useless cat, you can keep adding things and check the result:

    cat something.txt | grep stuff | sed 's/"//g'
Or if you need to insert another filter before the last stage like this, you can just press "up" and insert it:

    cat something.txt | grep -v negmatch | grep stuff
I don't think there is any easily-typed equivalent workflow with non-useless cat.


If you’re working with a lot of data you probably want to pipe it into head anyway, initially, so

<file head -n50 | whatever

Can be the starting command. When you no longer need the head there, just get rid of “head |”.

Although I agree that the pointing out of “useless cat” is usually not particularly useful or constructive.


Using head when there's lots of data make sense, but I really don't see any advantage to avoiding useless cat. Useless cat is way faster to type and make additions to. I sort of get the feeling that 'useless cat' is really just a fun copypasta kinda like when people like post "I'd just like to interject for a moment. What you're referring to as Linux, is in fact, GNU/Linux, or as I've recently taken to calling it..."


It's still good to be aware about `useless cat`, to save some CPU and I/O, when converting one liners into scripts.


Rather than using head and worrying about the size of the file, it is easier to simply use "useless" cat then ctrl-c the stream of data that comes out.


What makes it useless? It’s functionally the same but makes it easier to author pipelines, which I think is a valid use.


  cat foo | bar
is useless use of cat, since it’s equivalent to

  < foo bar
which is both shorter and starts one less process. Why do you think that “cat” makes it “easier to author pipelines”?


1). It doesn’t work in all shells, so you have to think about your environment before you use it.

2). I often start with `bar < foo`, so if I need to add more arguments to `bar`, I need to always skip over the input.

3). If I just want to delete all processing and look at the input, I can’t just backspace away the processing because `< foo` is invalid.


Good intro to data processing.

tsort and comm were news to me.


Can somebody explain the advantage of doing it on the command line vs in Python or R? What would a practical use case look like?


The most significant use case for all things command-line IMHO is automation. Also, I would change that from "command line vs in Python or R" to "command line and Python or R". Build a pipeline like I've discussed in the article, then pipe it into Python or R.


> Build a pipeline like I've discussed in the article, then pipe it into Python or R.

Why not just do it all in Python or R? That way you also get something that will probably work on non-unix platforms.


Over the years I've found that I usually fall into a pattern of starting with low-fidelity automation in languages like shell and slowly re-writing it over time into more higher-level languages, usually python first, then Java. This way, unimportant tasks can be automated in less than 5 min with one of these shell commands. If it breaks or has errors, no big deal. Python works well for figuring out the structure of the solution as an actual program, and then finally a language with static type checking when it really needs to run without errors.


I go to Python first because it's nice to be able to single-step through the script with a debugger and monitor exactly what's happening. I also know Python a lot better than shell script so it saves me a lot of time as well.


The advantage is that it's faster to prototype/write on the command line and usually ends up being less verbose (although potentially harder to read). It's easy to see what you're data is doing as you work with it and incrementally add pipes to new commands.

I like to use command line tools for for one-off tasks that I'm unlikely to repeat. If there's a task I know I'll need to repeat or is too cumbersome to do in a couple of lines, I'll reach for Python.


Please see my comment to this thread (which links to one of my past comments): https://news.ycombinator.com/item?id=21614511


Hi, (I wrote the article). A few people commented noting that I included "Data Science" in the title, but the content doesn't include any statistics or machine learning which is closer to the core definition of 'data science'. I still think the title is appropriate since any kind of low-fidelity data science task you do on some had-hoc data (log files, heaps of text, web pages) is going to start with setting up a processing pipeline that involves these commands. I could have re-named it "An intro to text processing" or "An intro to data processing", but then the people who need to see this content won't associate the title with something they're interested in, so they never benefit from it. The list of commands was chosen specifically with the question "What Linux commands would someone answering data science/business intelligence questions use?" in mind. These commands are also among the list of ones that are usually already installed on every system.


Interesting.

For anyone who is interested in going a little deeper into data science, I’d also recommend the “Introduction to Data Science with R” series by David Langer:

https://youtu.be/32o0DnuRjfg


yes, this article describes - what we used to call in the early 2000s - using linux.


You forgot an important one: man


Much can be done just with awk.

My pet peeve is the "grep | awk" idiom. No, just use awk.

Awk does map/reduce, relational joins, associative memory, table lookup, and so on. Just use awk arrays, begin block, and end block.


If I am going to maintain the script myself and never tweak anything in the middle of the night, sure.

But most people don't know awk. And awk requires more awareness. I break my awk when I fix things when tired.


Now that you mention it, 'data processing' seems more neutral and accurate, so I've put that in the title above.


Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....


The person shared their knowledge in the best way they know. It might inspire some to go and take a look and figure even better ways to do the task.


Useless use of cat is almost always more readable and better for explanation. It shows the direction of a pipe unambiguously and splits out commands from files at a quick glance.


UUOC is a pedantic ford-chevy argument.


Yes, it made the whole article useless.


I assume you're just joking around, but to you and the parent comment, I'd be happy to hear any good arguments for avoiding 'useless' cat. Note that I did mentioned 'useless cat' in the article, and there is already a comment thread in this article that contains my opinions on it.


So why should we repeat ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: