Hacker News new | past | comments | ask | show | jobs | submit login

uniq also doesn't deal well with duplicate records that aren't adjacent. You may need to do a sort before using it.

   sort | uniq
But that can screw with your header lines, so be careful there two.



You can do this without sorting:

    awk '!x[$0]++'


That's usually faster where possible, but it may cause problems on large data sets, since it loads the entire set of unique strings (and their counts) into an in-memory hash table.


I use something like this everyday:

awk '!($0 in a);a[$0]; print}'

I rarely if ever use uniq to remove duplicates. Sorting is expensive.


    sort -u
sort and uniq in one step.


Indeed. uniq is usually only useful if you're also using -u, -d or -c.


Try `body`: https://github.com/jeroenjanssens/data-science-at-the-comman...

    $ echo 'header\ne\nd\na\nb\nc\nb' | body sort | body uniq
    header
    a
    b
    c
    d
    e


Heh, at first I thought you meant a body command like this one, which I've written in the past:

$ cat body_test.txt

     1  This
     2  is
     3  a
     4  file
     5  to
     6  test
     7  the
     8  body
     9  command
    10  which
    11  is
    12  a
    13  complement
    14  to
    15  head
    16  and
    17  tail.
$ cat `which body`

sed -n $1,$2p $3

$ body 5 10 body_test.txt

     5	to
     6	test
     7	the
     8	body
     9	command
    10	which


Immediately ^'d this post for it's usefulness, but I put this in my .bashrc instead:

function body_alias() {

  sed -n $1,$2p $3
}

alias body=body_alias

I used to have little scripts like body in my own ~/bin or /usr/local/bin but I've been slowly moving those to my .bashrc which I can copy to new systems I log on to.


Glad you liked it. Your alias technique is good too. Plus it may save a small amount of time since the body script does not have to be loaded from its file (as in my case) - unless your *nix version caches it in memory after the first time.


> But that can screw with your header lines, so be careful there two.

    F=filename; (head -n 1 $F ; tail -n +2 $F | sort -u) | sponge $F
To get counts of duplicates, you can use:

    sort filename | uniq -c | awk '$1 != 1'


If you were piping into that bracketed expression (instead of using a real file), you'd need "line", "9 read", "sh -c 'read ln; echo $ln'", or "bash -c 'read; echo $REPLY'" in place of the head since head, sed, or anything else, might use buffered I/O and bite off more than it chews. (and then a plain cat in place of the tail)

"line" will compile anywhere but I only know it to be generally available on Linux. I think it's crazy that such a pipe-friendly way to extract a number of lines, and no more than that, isn't part of some standard.


In the spirit of more options, `pee` comes with moreutils and does the trick:

    cat filename | pee 'head -n 1' 'tail -n +2 | sort -u'




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: