Yeah this article is about processing text data and not any form of statistics, modeling, etc. I'm guessing they added "data science" because it's in vogue? In any case, the provided title does not reflect the article.
Regarding more than one mentions of UUOC in this thread:
- The original award started in 1995. Even though pentium was already out, I think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day shell work (meaning no GBs of file-processing or anything like that), isn't invoking UUOC and pointing out inefficiencies an example of premature optimization [1]?
- Isn't readability a matter of subjectivity, and that for some folks 'cat file' is more readable than '<file' or a direct use of a processing command (like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly illuminating [3]).
Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.
If you want to simplify things, don't employ "useless us of cat". Pass the file as a command arg or re-direct input. And sort has options, so the third/fourth commands can be
sort -u -t, sales.csv
However, those fail with quoted commas.
Also, head -3 is non-POSIX obsolete syntax.
Edit: I don't know why I didn't see other UUOC references initially.
I like cut and tr too, but I try to replace them by sed and awk when I can. I reduces the number of moving parts, and allows you to increase the complexity slowly.
Ex: | sed -e step1 becomes | sed -e step1 -e step2 instead of adding another pipe and another "moving part" like tr
When I was at a genetics lab, I was helping some researchers on something and spent 3 days writing a perl script, which kept failing. I sent an email to one of the guys who wrote the paper the research was being based on, and he said, why not try awk like this? With a little work, I turned 3 days of perl into a 1 line awk that was faster than anything else for the job at the time. That was an inspirational moment for the fundamental power of the unix philosophy and the core utilities in linux for me.
This is a great list and well-written. As a data professional, I use these commands all the time and my job would be much harder without them. I also learned a few new things here (`tee` and `comm`).
I was lucky that my first job was as a support engineer at a data-centric tech company, which is where I learned these. I've often thought about how to teach them to data analysts coming from a non-engineering background. This is comprehensive but clear and would be a perfect resource for training someone like that. Thank you!
P.S.: Not essential, but it really becomes a joy when, as a touch typist, I have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never have to leave the home row while I do my shell piping work from start to finish. (no mouse, no arrow keys, etc.)
That was my first thought skimming through this too. Either every *nix admin who is aware of a few text processing tools is a data scientist, or “data scientists” are just as full of it as I’ve expected.
Just because a tool can be used for A, B or C, and you are an expert at using that tool for A does not imply that your expertise at using the tool for A makes you an expert in B and C.
The whole point of this article is to point out that a lot of common Linux tools can be used for Data Science like work (a significant part of which includes pre processing structured and unstructured text).
I find it more irritating when people try to score greybeard points by saying *nix (or Unix) when it's obvious that they're talking about a Linux-only mechanism and quite possibly haven't ever used Unix (or a direct derivative).
Also, the most popular Unix-like OS (far more than Linux) is macOS, basically the least “leet greybeard affectation” thing I can imagine. Your irritation is way off base.
Perhaps nothing? I was responding to the complaint in general terms.
> Your irritation is way off base.
Please allow me to feel irritated when people refer to obvious Linux things as something that's supposedly got something to do with Unix. It happens often enough.
Most popular Unix-like OS on consumer devices is Linux (Android).
Most popular Unix-like OS on servers is Linux.
Most popular Unix-like OS in embedded is Linux.
Most popular Unix-like OS on supercomputers is Linux.
Most popular Unix-like OS on IBM PC compatible computers or notebooks is MS Windows with WSL.
Android is not a Unix-like OS, other than having a kernel that was originally devised as a Unix clone. Beyond that, I’m not sure what your point is. Is it just that “most popular” is ill-defined?
To bring us back to the context of this post: I am quite willing to bet that “grep” and “cat” are used by humans more times per day on macOS than on any other OS.
Linux is a kernel, which is irrelevant; I've run the GNU tools on many different systems, including MS-DOS, over the years. POSIX now defines UNIX anyway.
To be fair, in many cases (such as grep), the GNU commands have additional features and are more intuitive to use than the standard POSIX implementations.
Very useful article. Learned a couple of new things here.
While reading the idea that I know most of this, would that made me a data scientist? Jumped at me.
But then I quickly recovered from that thought that surely knowing some of the tools someone could use for a certain domain does not make you expert at that domain.
Might just be the case of same ingredients, different recipes.
hah, I knew someone would point that out (which is why I talked about it in the article).
I actually prefer useless cat because when you're prototyping a pipeline it's very awkward to use non-useless cat. You'll probably start off with something like this to observe the content of the file:
cat something.txt
Using this doesn't work in bash:
<something.txt
Then, continuing with useless cat to build on it you do
cat something.txt | grep stuff
Which you can type easily from using 'up' in your terminal. But if you use non-useless cat you have to re-type the entire thing or move the cursor around:
grep stuff < something.txt
With useless cat, you can keep adding things and check the result:
cat something.txt | grep stuff | sed 's/"//g'
Or if you need to insert another filter before the last stage like this, you can just press "up" and insert it:
cat something.txt | grep -v negmatch | grep stuff
I don't think there is any easily-typed equivalent workflow with non-useless cat.
Using head when there's lots of data make sense, but I really don't see any advantage to avoiding useless cat. Useless cat is way faster to type and make additions to. I sort of get the feeling that 'useless cat' is really just a fun copypasta kinda like when people like post "I'd just like to interject for a moment. What you're referring to as Linux, is in fact, GNU/Linux, or as I've recently taken to calling it..."
Rather than using head and worrying about the size of the file, it is easier to simply use "useless" cat then ctrl-c the stream of data that comes out.
The most significant use case for all things command-line IMHO is automation. Also, I would change that from "command line vs in Python or R" to "command line and Python or R". Build a pipeline like I've discussed in the article, then pipe it into Python or R.
Over the years I've found that I usually fall into a pattern of starting with low-fidelity automation in languages like shell and slowly re-writing it over time into more higher-level languages, usually python first, then Java. This way, unimportant tasks can be automated in less than 5 min with one of these shell commands. If it breaks or has errors, no big deal. Python works well for figuring out the structure of the solution as an actual program, and then finally a language with static type checking when it really needs to run without errors.
I go to Python first because it's nice to be able to single-step through the script with a debugger and monitor exactly what's happening. I also know Python a lot better than shell script so it saves me a lot of time as well.
The advantage is that it's faster to prototype/write on the command line and usually ends up being less verbose (although potentially harder to read). It's easy to see what you're data is doing as you work with it and incrementally add pipes to new commands.
I like to use command line tools for for one-off tasks that I'm unlikely to repeat. If there's a task I know I'll need to repeat or is too cumbersome to do in a couple of lines, I'll reach for Python.
Hi, (I wrote the article). A few people commented noting that I included "Data Science" in the title, but the content doesn't include any statistics or machine learning which is closer to the core definition of 'data science'. I still think the title is appropriate since any kind of low-fidelity data science task you do on some had-hoc data (log files, heaps of text, web pages) is going to start with setting up a processing pipeline that involves these commands. I could have re-named it "An intro to text processing" or "An intro to data processing", but then the people who need to see this content won't associate the title with something they're interested in, so they never benefit from it. The list of commands was chosen specifically with the question "What Linux commands would someone answering data science/business intelligence questions use?" in mind. These commands are also among the list of ones that are usually already installed on every system.
For anyone who is interested in going a little deeper into data science, I’d also recommend the “Introduction to Data Science with R” series by David Langer:
Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....
Useless use of cat is almost always more readable and better for explanation. It shows the direction of a pipe unambiguously and splits out commands from files at a quick glance.
I assume you're just joking around, but to you and the parent comment, I'd be happy to hear any good arguments for avoiding 'useless' cat. Note that I did mentioned 'useless cat' in the article, and there is already a comment thread in this article that contains my opinions on it.