Hacker News new | past | comments | ask | show | jobs | submit login
Useful Unix commands for exploring data (datavu.blogspot.com)
343 points by aks_c on Aug 27, 2014 | hide | past | favorite | 151 comments



"While dealing with big genetic data sets I often got stuck with limitation of programming languages in terms of reading big files."

Hate to sound like Steve-Jobs here, but: "You're using it wrong."

Let me elaborate. If you're coming across limitations of "too-big" or "too-long" in your language of choice: Then you're just a few searches away from both being enlightened on how to solve your task at hand and on how your language works. Both things that will prevent you from being hindered next time around when you have to do a similar big-data job.

Perhaps you are more comfortable using pre-defined lego-blocks to build your logic. Perhaps you understand the unix commands better than you do your chosen language. But understand that programming is the same, just in a different conceptual/knowledge space. And remember, always use the right tool for the job!

(I use Unix commands daily as they're quick/dirty in a jiffy, but for complex tasks I am more productive solving the problem in a language I am comfortable in instead of searching through man pages for obscure flags/functionality)


Yeah, I think calling him the "unenlightened" one is pretty off base here.

For performing the tasks outlined by his examples, Unix utilities are easier for the user as well as executing faster than writing your own code in a general purpose programming language, unless one puts in the time to tune the implementation.

One could rebuild AWK in C and get similar performance, but why not just use some extremely simple AWK? And is anybody going to replicate the amazing speed of grep without a huge time investment? [1]

There is a right tool for the job, and having seen dozens of programmers be exposed to new big data sets, I can tell you that the ones who become productive quickly and stay more productive are the ones who adopt whatever tool is best for the job, not the ones that stick to their one favorite programming language. In fact, a good sign of somebody who will quickly fail is someone who says "forget those Unix tools, I'm just going to write this in X".

[1] http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...


This is one area where I wish the Unix philosophy (reuse of tools) was taken a bit further. Too me, every command should be callable as a C library function. That way you wouldn't have to parse the human readable output through a pipe. Not only that, there needs to be both human-readable, as well as machine-readable output to all commands. For example I would love to be able to call "ps" from another script and easily select specific columns from an xml or json output.


PowerShell solves this problem by piping around objects instead of strings. It's pretty neat!


Shell scripts can be pretty powerful if you know what you're doing, but I do agree that sometimes the shell script paradigm can be more of a hurdle than a help.

However your point about every command being a callable as a C library is kind of possible already. Some commands do have native language libraries (eg libcurl), but you could also fork out to those ELFs if you're feeling really brave (though in all practicality - it's little worse than writing a shell script to begin with). In fact there's times I've been known to cheat with Perl and run (for example):

    (my $hostname = `hostname`) =~ s/\n//g;
because it's quicker and easier to throw together than using the proper Perl libraries (yeah, it's pretty nasty from an academic perspective, but the additional footprint is minimal while the development time saved is significant.

Of course, any such code that's used regularly and/or depended on will be cleaned up as and when I have the time.

As for your XML or JSON parsing; the same theory as above could be applied:

    use JSON::Parse 'parse_json';
    my $json = `curl --silent http://birthdays.com/myfriends.json`;
    my $bdays = parse_json($json);
    print "derekp7's birthday is $bdays{derekp7}";
Obviously these aren't best practices, but if it only running locally (ie this isn't part of a CGI (etc) script that's web accessible) and gets the job done in a hurry then I can't see why I you shouldn't use that for ad hoc reporting.


But one of the unix philosophies is to use plain text. To have everything as a C function means everything needs a new API


Most scripting languages aren't multithreaded, and some aren't pipeline oriented by default.

For example, working with file lines naively in Ruby means reading the whole lot into a giant array and doing transformations an array at a time, rather than in a streaming fashion.

The shell gives you fairly safe concurrency and streaming for free.

Personally, if it's a complex task, I generally write a tool such that it can be put into a shell pipeline.

Knowing the command line well - so that you don't often have to look up man pages for obscure flags / functionality - has its own rewards, as these commands turn into something you use all the time in the terminal. Rather than spending a few minutes developing a script in an editor, you can incrementally build a pipeline over a few seconds. Doing your script in a REPL is a better approximation, but it's a bit less immediate.


You don't have to read all of a file into memory in Ruby. There are a number of facilities for reading only a portion of a file, readpartial[1] for example. Additionally, you have access to all of the native pipe[2] functionality as well. There are plenty of reasons to favor shell tools over Ruby, but those aren't some of them.

[1]: http://www.ruby-doc.org/core-2.1.2/IO.html#method-i-readpart... [2]: http://www.ruby-doc.org/core-2.1.2/IO.html#method-c-popen


No the case with ruby at all, if you're reading the whole file into memory theres a good chance you're doing it wrong.

check out yield and blocks


The problem is that the most obvious way of doing it - File.readlines('foo.txt').map { ... }.select { ... } etc. - is not stream-oriented.


arguably, it's trivial to make that stream oriented

    open('tmp.rb').each_line.lazy.map {...}.select {...}
the problem with processing big files with ruby (in my humble experience) is usually that it's still slow enough that "preprocessing with grep&uniq" is worthwhile.


    > open('tmp.rb').each_line.lazy
    NoMethodError: undefined method `lazy' for #<Enumerator: #<File:Procfile>:each_line>
Not everybody is using Ruby 2.0.


> if you're reading the whole file into memory theres a good chance you're doing it wrong

GP: "working with file lines _naively_ in Ruby"


ah, my bad, read that as natively and chalked it up to odd wording.


> always use the right tool for the job

Standard grep is much faster on multi-gigabyte files than anything you can figure out how to do in your pet language. By the time you get close to matching grep, you would have reimplemented most of grep, in half-assed fashion at that.

Your delusion is assuming standard command line tools are simple in function because they have a simple interface that Average Joe can use.


Personally I find silversearcher (ag) faster, and reinventing standard commandline tools with a collection of other tools is often slower.

One liner shell commands often turn complicated quickly.


ag is often faster when you're using it interactively, replacing "grep -r" (in particular in version controlled dirs). It's also faster in the sense that for interactive use it will often DWYM.

But has too many weird quirks that it can replace grep for data munging. E.g.

    $ ag verb fullform_nn.txt >/tmp/verbs
    ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Man ag says there's a --max-count option. Let's try that.

    $ grep -c verb fullform_nn.txt
    206077
    $ ag --max-count 206077 verb fullform_nn.txt >/tmp/verbs
    ERR: Too many matches in fullform_nn.txt. Skipping the rest of this file.
Wtf? (and running those two commands with "time" gave ag user 0m0.770s while grep had user 0m0.057s)


Did you report it as a bug?


did now https://github.com/ggreer/the_silver_searcher/issues/483 (though it took me a while to figure out a more precise issue title than "--max-count is taunting me")


I have never used ag, but in most instances where people thought they made a faster grep it is becuase it doesn't handle multibyte encodings correctly.

Have you rerun your tests in the "C" locale?


I would allow that for some definition of "too-big".

I wrote a distributed grep impl a few years back to grep my logs and collect output to a central machine (a vague "how may machines had this error" job).

The central orchestration was easy in python, but implementing

zgrep | awk | sort | uniq -c | wc -l

is way faster and way more code in python than to do it with shell (zgrep is awesome for .gz logs).

On the other hand, the shell co-ordinator way way harder using pdsh that I reverted to using paramiko and python threadpools.

Unix tools are extremely composable and present in nearly every machine with the standard behaviour.


> Hate to sound like Steve-Jobs here, but: "You're using it wrong."

I don't quite agree — say this individual needs to sort a file by two columns. Should they really load everything into memory to call Python's sorted()? With large genomics datasets this isn't possible. Trying to reimplement sort's on-disk merge sort would be unnecessary and treacherous.

It's easy to forget how much engineer went into these core utilities — which can be useful when working with big files.


It's not hard to write an on-disk merge sort using Python... it just may not be that fast.

But really, as I'm sure you know, for genome-scale datasets, the key word is streaming. Disk IO is a major bottleneck. If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest. Or, you'd be calculating some other kind of non-trivial summary statistics. In both of these cases, you'd need to use some kind of custom program. But you'd still should be operating on the stream, not the entire dataset.

(Of the top of my head I can think of only a few instances where you'd need to operate on a column as opposed to a row in genome data - multiple testing correction being the main one)

If you need to sort by two columns, yes, by all means use "sort". It's about as fast as you are going to get. But for "exploratory analysis" on genomic data, you'd better have a really good reason (or small dataset) to use these tools.


> If you're using a large genomic dataset, you shouldn't be sorting your results in text format anyway... it would take way too much time and temporary disk space. What you'd probably want is a row filter to extract out the rows of interest.

For repeated queries, this isn't efficient. This is why we have indexed, sorted BAM files compressed with BGZF (and tabix, which uses the same ideas). Many queries in genomics are with respect to position, and this is O(1) with a sorted, indexed file and O(n) with streaming. Streaming also involves reading and uncompressing the entire file from the disk — accessing entries with from an indexed, sorted file involves a seek() to the correct offset and decompressing that particular block – this is far more efficient. I definitely agree that streaming is great, but sorting data is an essential trick in working with large genomics data sets.


I'm well aware of BAM and BGZF. I've even written a parser or two (yay double encoded binary formats). I really like the BGZF format and think it doesn't get used enough outside of sequencing. It's basically gzip frames with enough information to allow random access within the compressed stream. And tabix is a great way to efficiently index otherwise unindexable text files.

However, these are all binary formats. I specifically said that you shouldn't sort genome files in text format. Because while text is easy to parse, binary is faster for this. You aren't going to use any of the standard unix tools once you've crossed over into binary formats. And so you are stuck using domain specific tools. Stuff like http://ngsutils.org (shameless plug).

I have seen people write tools for parsing SAM files using unix core utils, but they are always orders of magnitude slower than a good BAM-specific tool (in almost any language).


Just to point out, "big genetic datasets" can be hundreds of gigabytes to terabytes, sizes which are not possible to process easily in any language. A highly optimized C program like uniq might be the fastest option short of a distributed system.


When you have to deal with (genetic data) files of few GB on daily basis, I dont think using Python, R or databases is good idea to do basic data exploring.

-rwxr-x--- 1 29594528008 out_chr1comb.dose

-rwxr-x--- 1 27924241334 out_chr2comb.dose

-rwxr-x--- 1 25684164559 out_chr3comb.dose

-rwxr-x--- 1 24665680612 out_chr4comb.dose

-rwxr-x--- 1 21493584686 out_chr5comb.dose

-rwxr-x--- 1 23626967979 out_chr6comb.dose

-rwxr-x--- 1 20856136599 out_chr7comb.dose

-rwxr-x--- 1 18398180426 out_chr8comb.dose

-rwxr-x--- 1 15864714472 out_chr9comb.dose


As someone that deals with large datasets on a Database + Python daily, I'm not quite sure what you mean. You'll have to explain it to me what "not a good idea is", or "basic data exploring".


Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.

I can use something like following to explore few rows and few columns. $$ awk '{print $1,$3,$5}' file | head -10

And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.

If there is a better way I would be happy to learn.


I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.


I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.


For this kind of thing, it's easiest to bulk-load them into SQLite and do your exploration and early analysis in SQL


Some more tips from someone who does this every day.

1) Be careful with CSV files and UNIX tools - most big CSV files with text fields have some subset of fields that are text quoted and character-escaped. This means that you might have "," in the middle of a string. Anything (like cut or awk) that depends on comma as a delimiter will not handle this situation well.

2) "cut" has shorter, easier to remember syntax than awk for selecting fields from a delimited file.

3) Did you know that you can do a database-style join directly in UNIX with common command line tools? See "join" - assumes your input files are sorted by join key.

4) As others have said - you almost invevitably want to run sort before you run uniq, since uniq only works on adjacent records.

5) sed doesn't get enough love: sed '1d' to delete the first line of a file. Useful for removing those pesky headers that interfere with later steps. Not to mention regex replacing, etc.

6) By the time you're doing most of this, you should probably be using python or R.


Actually I would say Perl is more appropriate. I went back to Perl after 4 years for this sort of task, as it has so many features built into the syntax. Plus it can be run as a one liner.


I'm reminded of the old joke, "python is executable pseudocode, while perl is executable line noise."

But seriously, I've got some battle scars from the perl days, and hope not to revisit them. Honestly, there's very little I find I can do with perl and not python, and it's just as easy to express (if not quite as concise) and much simpler to maintain.

But, use the tool that works for you!


I use Python and Django most of the time, and its true, you can do pretty much the same thing in each language. But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands, and the fact you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python, but Perl is quicker.


>But for quick hacky stuff manipulating the filesystem a lot, Perl has many more features built into the language. Things like regex syntax, globing directories, back ticks to execute Unix commands

All good points.

>you can use it directly from the command line as a one liner. You can do all these (except the last one?) in Python

You can use Python from the command line too, but Perl has more features for doing that, like the -n and -p flags. Then again, Python has the fileinput module. Here's an example:

http://jugad2.blogspot.in/2013/05/convert-multiple-text-file...


> you almost invevitably want to run sort before you run uniq

And then you don't actually want uniq anyway since sort has a -u switch that removes duplicate lines.


What if you want uniq -c? Any simple way to replicate that functionality better then...sort | uniq -c?


Then you run uniq -c (which I do all the time).

But for the examples in the main article sort -u would be fine.


>> If we don't want new file we can redirect the output to same file which will overwrite original file

You need to be a little careful with that. If you do:

    uniq -u movies.csv > movies.csv
The shell will first open movies.csv for writing (the redirect part) then launch the uniq command connecting stdout to the now emptied movies.csv.

Of course when uniq opens movies.csv for consumption, it'll already be empty. There will be no work to do.

There's a couple of options to deal with this, but the temporary intermediate file is my preference provided there's sufficient space - it's easily understood, if someone else comes across the construct in your script, they'll grok it.


The utility to do this is called sponge.

http://linux.die.net/man/1/sponge

    uniq -u movies.csv | sponge movies.csv


sponge is cool. But on debian/ubuntu, it's packaged up in moreutils, which includes a few helpful tools. However a programme called parallel is in moreutils, and that's not as powerful as GNU's parallel. So I often end up uninstalling sponge/moreutils. :(


There's some attempt underway to fix that fwiw: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=749355


GNU Parallel is an indispensable heavy lifter on the command line. I was expecting it to show up in the article.


moreutils for my usage contains 'chronic', which prepended to a command, stops cron from alerting on any non-error output. Big fan.


Thanks for throwing this out there. Never heard of this command before! Definitely a good one for my bag o' tricks.


The classic book "The UNIX Programming Environment" by Kernighan and Pike, has a tool in it, called 'overwrite', that does this - letting you safely overwrite a file with the result of a oommand or pipeline, IIRC.


I came here for that. I learned long ago, the hard way, to never ever use the same file for writing as reading. I was wondering if that rule had changed on me.


Thank you for inputs, how about this?

uniq -u movies.csv > temp.csv

temp.csv > movie.csv

rm temp.csv


long_running_process > filename.tmp && mv filename.tmp filename

The rename is atomic; anyone opening "filename" will get either the old version, or the new version. (Although it breaks one of my other favorite idioms for monitoring log files, "tail -f filename", because the old inode will never be updated.)


> Although it breaks one of my other favorite idioms for monitoring log files, "tail -f filename", because the old inode will never be updated

You should look into the '-F' option of tail; it follows the filename, and not the inode.


you could directly write to uniqMovie.csv in your example. I would do it like below but ONLY once I am certain it is exactly what I want. Usually I just make one clearly named result file per operation without touching the original.

uniq -u movies.csv > /tmp/temp.csv && mv /temp/temp.csv movies.csv


  $ temp.csv > movie.csv
  temp.csv: command not found


He forgot his cat.


You might want to use sort -u


My personal favorite is to use this pattern. You can do some extremely cool counts and group by operations at the command like [1]:

  grep '01/Jul/1995' NASA_access_log_Jul95 | 
    awk '{print $1}' | 
    sort | 
    uniq -c | 
    sort -h -r | 
    head -n 15
Turns this:

  199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
  unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
  199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
  burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
  199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
Into this:

    623 piweba3y.prodigy.com
    547 piweba4y.prodigy.com
    536 alyssa.prodigy.com
    463 disarray.demon.co.uk
    456 piweba1y.prodigy.com
    417 www-b6.proxy.aol.com
    350 burger.letters.com
    300 poppy.hensa.ac.uk
    279 www-b5.proxy.aol.com
[1] https://sysadmincasts.com/episodes/28-cli-monday-cat-grep-aw...


I think there's something like that in the Kernighan and Pike book I referred to elsewhere in this thread, and also, that code looks similar to this technique:

http://en.wikipedia.org/wiki/Decorate-sort-undecorate

, i.e. Decorate-Sort-Undecorate (DSU), related to the Schwartzian transform.


Similar ideas in Python with Generators and Co-routines:

http://www.dabeaz.com/generators-uk/


For working with complex CSV files, I highly recommend checking out CSVKit https://csvkit.readthedocs.org/en/0.8.0/

I've just started using it, and the only limitation I've so far encountered has been that there's no equivalent to awk (i.e. I want a way to evaluate a python expression on every line as part of a pipeline).


Get words starting with "and"

    $ cat /usr/share/dict/words | py -fx 're.match(r"and", x)' | head -5
    and
    andante
    andante's
    andantes
    andiron
https://github.com/Russell91/pythonpy


Sorry, I meant: remove the characters "$" and "," from the 3rd column of a CSV file. Obviously the CSV file is quoted, since it has commas in the 3rd column, and so awk is no longer an acceptable solution.


Not to sound too much like an Amazon product page, but if you like this, you'll probably quite like "Unix for Poets" - http://www.lsi.upc.edu/~padro/Unixforpoets.pdf . It's my favourite 'intro' to text/data mangling using unix utils.


I'd like to repeat peterwwillis in saying that there are very Unixy tools that are designed for this, and update his link to my favorite, csvfix: http://neilb.bitbucket.org/csvfix/

Neat selling points: csvfix eval and csvfix exec

also: the last commit to csvfix was 6 days ago; it's active, mature, and the developer is very responsive. If you can think of a capability that he hasn't yet, tell him and you'll have it in no time:)


If you're on Windows, you owe it to yourself to check out a little known Microsoft utility called logparser: http://mlichtenberg.wordpress.com/2011/02/03/log-parser-rock... It effectively lets you query a CSV (or many other log file formats/sources) with a SQL-like language. Very useful tool that I wish was available on Linux systems.


LogParser is one of the few things I really miss from windows. I think there are unix equivalents, but I haven't had the time to invest in learning them. Pretty much every example in this article boiled down to 'Take this CSV and run a simple SQL query on it'. Yes you can do that by piping through various unix utilities or you could just use a tool mean specifically for the task. I'd like to see the article explore some more advanced cases, like rolling up a column. I actually had to do this yesterday and ended up opening my data in open office and using a pivot table.



LogParser is excellent, so many Windows admins have never heard of it which is a shame.


lnav (http://lnav.org) provides SQL-queries-over-logs in the unix world. It's also a nice viewer for the logs themselves.



First blog post was the inspiration for a book, which is almost finished: http://datascienceatthecommandline.com


O'Reilly is having a 50% sale on all ebooks through 9 September.

http://oreilly.com/

I just bought the early release of that exact book for $13.60, which was 60% off, because you get 60% off if you order $100 worth of prediscount ebooks.

http://shop.oreilly.com/product/0636920032823.do

When the book is finished you get the final version. It's mostly already finished.

"With Early Release ebooks, you get books in their earliest form — the author's raw and unedited content as he or she writes — so you can take advantage of these technologies long before the official release of these titles. You'll also receive updates when significant changes are made, new chapters as they're written, and the final ebook bundle."


You can also find tools designed for your dataset, like csvkit[1] , csvfix[2] , and other tools[3] (I even wrote my own CSV munging Unix tools in Perl back in the day)

[1] http://csvkit.readthedocs.org/en/0.8.0/ [2] https://code.google.com/p/csvfix/ [3] https://unix.stackexchange.com/questions/7425/is-there-a-rob...


caveat: delimiter-based commands are not quote-aware. For example, this is a CSV line with two fields:

    foo,"bar,baz"
However, the tools will treat it as 3 columns:

    $ echo 'foo,"bar,baz"' | awk -F, '{print NF}'
    3


Is there any workaround?


Don't use CSV files...

If I'm working with a datafile where I expect the delimiter to be in one of the fields, there is something wrong.

This is one reason why I always work with tab delimited files. Having an actual tab character isn't very common in free-text fields, at least in the data that I work with. Commas on the other hand, are quite common. Why one would select a field separator that was common in your data is beyond me (I know it's historical).

Your data files might be different, in which case, maybe you should select a different field separator.

Otherwise, no, there is no work around. If you have to quote fields, then you can't use the normal unix command line tools that tokenize fields.


yes!

https://github.com/dbro/csvquote

csvquote allows UNIX tools to work properly with quoted fields that contain delimiters inside the data. It is a simple translation tool that temporarily replaces the special characters occurring inside quotes with harmless non-printing characters. You do it as a first step in the pipeline, then do the regular operations using UNIX tools, and the last step of of the pipeline restores those troublesome characters back inside the data fields.


I primarily deal with Excel (xls) files nowadays. I wrote a command line tool to extract data: https://www.npmjs.org/package/j

In my current workflow, I generate JSON from the excel files and use the really awesome JQ command (http://stedolan.github.io/jq/) to process


csvfix is probably the best tool to deal with it. Csvfix, awk, sed are probably my "first line of data-attack". After that usually I can get to analysing, plotting or whatever I need to do.


Practically all Unix tools consider the comma-separated-but-you-can-use-quotes-to-override CSV file to be an abomination. [1] You have to have a crazy regexp to get around it.

[1] Maybe someday they won't.


Yes. Use lex/flex.

You can write one-off (or reuseable) filters in minutes.

lex/flex should be in every UNIX distribution that has a C compiler, but maybe that's changing.


use csvkit or something similar.


I am surprised no one has mentioned datamash: http://www.gnu.org/software/datamash/. It is a fantastic tool for doing quick filtering, group-by, aggregations, etc. Previous HN discussion: https://news.ycombinator.com/item?id=8130149


No one gives a shit about cut.

    $ man 1 cut


I'm always surprised when people recommend awk for pulling delimited sections of lines out of a file, cut is so much easier to work with.


That's because cut sucks when fields can be separated by multiple space or tab characters.

    # printf '1 2\t3' | cut -f 2
    3
    # printf '1 2\t3' | awk '{print $2}'
    2
    # printf '1 2\t\t3' | cut -f 2
    
    # printf '1 2\t\t3' | awk '{print $2}'
    2


paste and join are useful too; paste complements cut, and join is like the SQL join, but for text files.


I love Unix pipelines, but chances are your data is structured in such a way that using regex based tools will break that structure unless you're very careful.

You know that thing about not making HTML with regexs? Same rule applies to CSV, TSV, and XLSX. All these can be created, manipulated and read using Python, which is probably already on your system.


There are unix tools that handle XML or CVS as well ;-) http://wiki.apertium.org/wiki/Xml_grep http://csvkit.readthedocs.org/en/latest/


iow, use Unix commands and pipes when you can. Don't use them when you can't.


The author states:

    uniq -u movies.csv > temp.csv 
    mv temp.csv movie.csv 

    **Important thing to note here is uniq wont work if duplicate records are not adjacent. [Addition based on HN inputs]  
Would the fix here be to sort the lines first using the `sort` command first? Then `uniq`?


Yes, and in fact you can just use sort's -u (unqiue) argument, and avoid uniq all together.


Thanks for the clarification!


Yes, but not first, rather instead. "sort -u" both sorts and hides duplicates.


except when you need uniq -c


To run Unix commands on Terabytes of data, check out https://cloudbash.sh/. In addition to the standard Unix commands, their join, group-By operations are amazing.

We guys are evaluating replacing our entire ETL with cloudbash!


I use this command very frequently to check how often an event occurs in a log file over time (specifically in 10-minute buckets), assuming the file is formatting like "INFO - [2014-08-27 16:16:29,578] Something something something"

    cat /path/to/logfile | grep PATTERN | sed 's/.*\(2014-..-..\) \(..\):\(.\).*/\1 \2:\3x/' | uniq -c
results in:

    273 2014-08-27 14:5x
    222 2014-08-27 15:0x
    201 2014-08-27 15:1x
    171 2014-08-27 15:2x
    349 2014-08-27 15:3x
    230 2014-08-27 15:4x
    236 2014-08-27 15:5x
    339 2014-08-27 16:0x
    330 2014-08-27 16:1x
This can subsequently be visualized with a tool like gnuplot or Excel.


sed -E for awk or grep -E regexes.


Useless use of cat?


"Don't pipe a cat" is how I'm used to describing what you're talking about -- it may have been a performance issue in days past, but these days I think it's simply a matter of style. Not that style is not important.


This was drilled into me back in the usenet days. If you see a cat command with a single argument it's almost always replaceable by a shell redirection, or in this case just by passing the filename as an argument to grep. If you're processing lots of data like in the article there's no point in passing it through a separate command and pipe first.


I think people like reading cat thefile | grepsedawk -opts 'prog' from left to right, and that they think the only alternative is grepsedawk -opts 'prog' thefile.

But there's grep <thefile -opts 're'. I like that one best; it reads the same way you'd tend to think it.


But when you're interactively creating your pipeline, having the cat at the very beginning can save you some shuffling around. For example, while your second command is be grep and you're passing your file directly to grep, you might then realize you need another command before grep. So you'll have to move that argument to the other command. With a useless cat in front, you just insert the new command between cat and grep. It doesn't really cause harm.


uniq also doesn't deal well with duplicate records that aren't adjacent. You may need to do a sort before using it.

   sort | uniq
But that can screw with your header lines, so be careful there two.


You can do this without sorting:

    awk '!x[$0]++'


That's usually faster where possible, but it may cause problems on large data sets, since it loads the entire set of unique strings (and their counts) into an in-memory hash table.


I use something like this everyday:

awk '!($0 in a);a[$0]; print}'

I rarely if ever use uniq to remove duplicates. Sorting is expensive.


    sort -u
sort and uniq in one step.


Indeed. uniq is usually only useful if you're also using -u, -d or -c.


Try `body`: https://github.com/jeroenjanssens/data-science-at-the-comman...

    $ echo 'header\ne\nd\na\nb\nc\nb' | body sort | body uniq
    header
    a
    b
    c
    d
    e


Heh, at first I thought you meant a body command like this one, which I've written in the past:

$ cat body_test.txt

     1  This
     2  is
     3  a
     4  file
     5  to
     6  test
     7  the
     8  body
     9  command
    10  which
    11  is
    12  a
    13  complement
    14  to
    15  head
    16  and
    17  tail.
$ cat `which body`

sed -n $1,$2p $3

$ body 5 10 body_test.txt

     5	to
     6	test
     7	the
     8	body
     9	command
    10	which


Immediately ^'d this post for it's usefulness, but I put this in my .bashrc instead:

function body_alias() {

  sed -n $1,$2p $3
}

alias body=body_alias

I used to have little scripts like body in my own ~/bin or /usr/local/bin but I've been slowly moving those to my .bashrc which I can copy to new systems I log on to.


Glad you liked it. Your alias technique is good too. Plus it may save a small amount of time since the body script does not have to be loaded from its file (as in my case) - unless your *nix version caches it in memory after the first time.


> But that can screw with your header lines, so be careful there two.

    F=filename; (head -n 1 $F ; tail -n +2 $F | sort -u) | sponge $F
To get counts of duplicates, you can use:

    sort filename | uniq -c | awk '$1 != 1'


If you were piping into that bracketed expression (instead of using a real file), you'd need "line", "9 read", "sh -c 'read ln; echo $ln'", or "bash -c 'read; echo $REPLY'" in place of the head since head, sed, or anything else, might use buffered I/O and bite off more than it chews. (and then a plain cat in place of the tail)

"line" will compile anywhere but I only know it to be generally available on Linux. I think it's crazy that such a pipe-friendly way to extract a number of lines, and no more than that, isn't part of some standard.


In the spirit of more options, `pee` comes with moreutils and does the trick:

    cat filename | pee 'head -n 1' 'tail -n +2 | sort -u'


I may as well plug my little program, which takes numbers read line-by-line in standard input and outputs a live-updating histogram (and some summary statistics) in the console!

https://github.com/bmsherman/LiveHistogram

It's useful if you want to, say, get a quick feeling of the distribution of numbers in some column of text.


"rs" for "reshape array". Found only on FreeBSD systems (yes, we are better... smile)

For example, transpose a text file:

~/ (j=0,r=1)$ cat foo.txt a b c d e f ~/ (j=0,r=0)$ cat foo.txt | rs -T a d b e c f

Honestly I have never used in production, but I still think it is way cool.

Also, being forced to work in a non-Unix environment, I am always reminded how much I wish everything were either text files, zipped text files, or a SQL database. I know for really big data (bigger than our typical 10^7 row dataset, like imagery or genetics), you have to expand into things like HDF5, but part of my first data cleaning sequence is often to take something out of Excel or whatever and make a text file from it and apply unix tools.


"Found only on FreeBSD..."

Also found on NetBSD, OpenBSD and DragonFlyBSD.



It's also on Mac OS X by default :D.


You should mention this behavior of uniq (from the man page on my machine):

Note: ’uniq’ does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use ‘sort -u’ without ‘uniq’.

Your movies.csv file is already sorted, but you don't mention that sorting is important for using uniq, which may be misleading.

$ cat tmp.txt

AAAA

AAAA

BBBB

DDDD

BBBB

$ uniq -d tmp.txt

AAAA


It's good to note that `uniq -u` does remove duplicates, but it doesn't output any instances of a line which has been duplicated. This is probably not clear to a lot of people reading this.


`uniq` removes duplicates; `uniq -u` only shows unique lines.


Exactly. The point wasn't clear from reading the article.



Just one other thing I'd like to mention before everyone moves on to another topic. Not all of the unix commands are equal, and some have features that others don't.

E.g. I mainly work on AIX, and a lot of the commands are simply not the same as what they are on more standard linux flavors. From what I've heard, this applies between different distros as well.

Not so much the case with standard programming languages that are portable. E.g. Python. Unless you take in to account Jython, etc.


For last line, I always did

   tac [file] | head -n 1
Mainly because I can never remember basic sed commands

(Strange, OS X doesn't seem to have tac, but Cygwin does...)


The BSDish way of pronouncing 'tac' is 'tail -r'.


If you installed coreutils with homebrew, it's called gtac.


Why not just tail -n [file]?


Certain people might miss the point of why to use command line.

1) I use this before using R or Python and ONLY do this when this is something I consistently need to be done all the time. Makes my R scripts shorter.

2) Somethings just need something simple to be fixed and these commands are just great.

Learn awk and sed and your tools just go much larger in munging data.


Exactly! I had a longish period when I wanted to do everything with the same tool. Now, I try to pick the most efficient (for me, not the machine) to do it. Csvfix, awk, sed, jq and several other command line goodies make my life easier, the heavy lifting goes to R, gephi, or some ad-hoc Python, go or C


Fine if your only tool is Perl.


Using basic Unix commands in trivial ways, am I missing something here?


Its data science these days. Latest buzzword. Old is new.



I built stats-tools for use in place of awk for basic statistics.

https://github.com/jweslley/stats-tools


Then you can make plots by piping to https://github.com/dkogan/feedgnuplot


sort -T your_tmp_dir is very useful for sorting large data


There is a command on freebsd for transposing text table rows to columns and vice versa, but I can't remember or find it now. It is in core, fwiw.


awk / gawk is super useful. For C/C++ programmers the language is very easy to learn. Try running "info gawk" for a very good guide.

I've used gawk for many things ranging from data analysis to generate linker / loader code in an embedded build environment for a custom processor / sequencer.

(You can even find a version to run from the Windows command prompt if you don't have Cygwin.)


sort before you uniq!


Do you have a pastebin of the CSV file? Time to play...


just checking!


really HN? if you find yourself depending heavily on the recommendations in this article you are doing data analysis wrong. Shell foo is relevant to data analysis only as much as regex is. In the same light depending on these methods too much is digging a deep knowledge ditch that in the end is going to limit and hinder you way more than the initial ingress time required to learn more capable data analytics frameworks or at least a scripting language.

still, on international man page appreciate day this is a great reference. the only thing it is missing is gnuplot ascii graphs.


Use splunk.

'nuff said.


Quote: "While dealing with big genetic data sets ..."

What a great start. Unless he's a biologist, the author means generic, not genetic.

The author goes on to show that he can use command-line utilities to accomplish what database clients do much more easily.


A few blog posts earlier, the author writes about "Network Analysis application in Genetic Studies ", so I am confident to say this isn't a typo.

And for a quick-and-dirty custom analysis of big data sets, the unix tools might be a lot more convenient than databases.


I think that he actually is a biologist. He refers to movies as a parallel universe. In which case, these tools are probably not all that helpful. Biological data is usually in the scale of either "Excel can handle it" (shudder) or "ginormous".

In the later case, none of these would be all that useful, and CSV is not the standard format for most of the biological data that I see.

Databases are less helpful than you'd imagine for this type of data as the schemas are not well defined. I am curious to know how JSON records would work for these data, because I could see something like that working for processing biological data files.


He is a biologist.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: