uniq also doesn't deal well with duplicate records that aren't adjacent. You may...

sheetjs · on Aug 27, 2014

You can do this without sorting:

    awk '!x[$0]++'

_delirium · on Aug 27, 2014

That's usually faster where possible, but it may cause problems on large data sets, since it loads the entire set of unique strings (and their counts) into an in-memory hash table.

jingo · on Aug 27, 2014

I use something like this everyday:

awk '!($0 in a);a[$0]; print}'

I rarely if ever use uniq to remove duplicates. Sorting is expensive.

danielweber · on Aug 27, 2014

    sort -u

sort and uniq in one step.

barrkel · on Aug 27, 2014

Indeed. uniq is usually only useful if you're also using -u, -d or -c.

jeroenjanssens · on Aug 27, 2014

Try `body`: https://github.com/jeroenjanssens/data-science-at-the-comman...

    $ echo 'header\ne\nd\na\nb\nc\nb' | body sort | body uniq
    header
    a
    b
    c
    d
    e

vram22 · on Aug 27, 2014

Heh, at first I thought you meant a body command like this one, which I've written in the past:

$ cat body_test.txt

     1  This
     2  is
     3  a
     4  file
     5  to
     6  test
     7  the
     8  body
     9  command
    10  which
    11  is
    12  a
    13  complement
    14  to
    15  head
    16  and
    17  tail.

$ cat `which body`

sed -n $1,$2p $3

$ body 5 10 body_test.txt

     5	to
     6	test
     7	the
     8	body
     9	command
    10	which

faulteh · on Aug 29, 2014

Immediately ^'d this post for it's usefulness, but I put this in my .bashrc instead:

function body_alias() {

  sed -n $1,$2p $3

}

alias body=body_alias

I used to have little scripts like body in my own ~/bin or /usr/local/bin but I've been slowly moving those to my .bashrc which I can copy to new systems I log on to.

vram22 · on Aug 30, 2014

Glad you liked it. Your alias technique is good too. Plus it may save a small amount of time since the body script does not have to be loaded from its file (as in my case) - unless your *nix version caches it in memory after the first time.

gav · on Aug 27, 2014

> But that can screw with your header lines, so be careful there two.

    F=filename; (head -n 1 $F ; tail -n +2 $F | sort -u) | sponge $F

To get counts of duplicates, you can use:

    sort filename | uniq -c | awk '$1 != 1'

smorrow · on Aug 27, 2014

If you were piping into that bracketed expression (instead of using a real file), you'd need "line", "9 read", "sh -c 'read ln; echo $ln'", or "bash -c 'read; echo $REPLY'" in place of the head since head, sed, or anything else, might use buffered I/O and bite off more than it chews. (and then a plain cat in place of the tail)

"line" will compile anywhere but I only know it to be generally available on Linux. I think it's crazy that such a pipe-friendly way to extract a number of lines, and no more than that, isn't part of some standard.

gav · on Aug 27, 2014

In the spirit of more options, `pee` comes with moreutils and does the trick:

    cat filename | pee 'head -n 1' 'tail -n +2 | sort -u'