Don't MAWK AWK - the fastest and most elegant big data munging language

philh · on Sept 10, 2009

Perl was written partly as a replacement for awk, and as such it has command-line switches that make it more suitable than it might appear. You could get very similar behaviour with a much shorter implementation using `perl -nai~`, something like:

    BEGIN { open(VOCAB, ">vocab"); }
    if (!$imap{$ARGV}{$F[0]}) {
      $imap{$ARGV}{$F[0]} = ++$I{$ARGV};
    }
    if (!$jmap{$F[1]}) {
      $jmap{$F[1]} = ++$J;
      print VOCAB $F[1] . "\n";
    }
    print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"

Which apart from the BEGIN line is almost a direct translation of the awk. A lot uglier, but for one-off things that isn't much of a problem.

(And if you want to claim awk has a three-line implementation, this is four lines.)

Admittedly, it's not quite the same - instead of putting output from file1 in file1n, it renames file1 to file1~ and puts its output back in file1. If you want to change that, you have to add your own file-handling code. That would only be a few lines. And it's probably never going to be as fast as mawk.

There are other cases where I suspect perl would beat awk, but maybe get beaten by sed. Not to rain on awk's parade or anything - it's still cool. Just not that much cooler than perl. :)

brendano · on Sept 10, 2009

aha, very nice! I was wondering how to do the awk-style structure in perl; it was unfair I didn't research it.

Maybe it's just me, but I find it much harder to read than the awk syntax, I think mostly because of the dollar signs. I think it's pretty crowded as a four liner. Awk's condition-action syntax helps a little here too.

    BEGIN { open(VOCAB, ">vocab"); }
    if (!$imap{$ARGV}{$F[0]}) { $imap{$ARGV}{$F[0]} = ++$I{$ARGV}; }
    if (!$jmap{$F[1]}) { $jmap{$F[1]} = ++$J;print VOCAB $F[1] . "\n"; }
    print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"

aw3c2 · on Sept 10, 2009

sed, grep and awk are among the major reasons why I love Linux so much. It took months until I first used them, now I use them daily and they made me work so much more productive than before.

fizx · on Sept 10, 2009

Silly bash function I use all the time.

  function f {
    awk '{print $'$1'}'
  }

  cat tab-separated | f 2 > just-the-2nd-column

bsaunder · on Sept 10, 2009

cut -f2 tab-separated > just-the-2nd-column

philh · on Sept 10, 2009

I have a script much like cut, but which can reorder and duplicate fields.

http://github.com/ChickenProp/f/tree/master

mmt · on Sept 10, 2009

doesn't work if it's any-amount-of-whitespace separated

obecalp · on Sept 10, 2009

-d'<tab>'

mmt · on Sept 16, 2009

Nope: echo 'a b' | cut -d' ' -f2 a b

rv77ax · on Sept 10, 2009

load "tab-separated" ( :f1:::'\t', :f2:::'\t', :f3::: ) as x;

create "just-the-2nd-column" from x ( :f2::: );

-- http://github.com/shuLhan/vos

neilc · on Sept 10, 2009

1GB isn't exactly "Big Data". I'd expect most truly Big Data tasks to be more I/O bound than computation bound -- at least if your "computation" consists of text parsing and hash table lookups.

That said, it's interesting that mawk is so fast.

fizx · on Sept 10, 2009

Depends. If you do a naive Ruby implementation, then you'll be CPU-bound quite quickly.

  #!/usr/bin/env ruby
  while line = STDIN.gets
    puts line.split(/\s+/).first
  end

This pegs my CPU at only 2MB/s, well below the IO capabilities of any modern system. I guess the tool you're using matters, which I think was the original point.

brendano · on Sept 10, 2009

I agree, 1GB is small. I do similar processing on more 100GB-ish tasks.

But I should say, this is a large enough dataset size that loading it all into memory is sometimes infeasible, at least in interactive interpreted environments like Python. That's an important boundary point.

henning · on Sept 10, 2009

A pretty good showing from Java on this, even though Java's I/O system is pretty annoying. I don't think the implementation he shows there is too odious if you're used to Javaland pain.

fizx · on Sept 10, 2009

It's not the end of the world. It just adds a few lines, but for multi-GB text processing, the runtime speedup, and the decent concurrency support is worth it.

  import java.io.BufferedReader;
  import java.io.InputStreamReader;

  public class Foo {
    public static void main(String[] args) throws Exception {
      BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
      String line;
      while ((line = reader.readLine()) != null) {

      }
    }
  }

ZitchDog · on Sept 10, 2009

It's not hard to write a utility that wraps an InputStream in an Iterator so you can do things like:

  for(String line : readLines(System.in)) {
    //do something with line here
  }

fizx · on Sept 10, 2009

Like org.apache.commons.io.IOUtils.lineIterator? Ultimately, I choose to either be a Maven project and require half the world, or just a single file that's easily compiled.

If it's the latter, I don't bother creating many abstractions.

brendano · on Sept 10, 2009

Yeah, I was happy how the java turned out. The important point of comparison is to C++ and was WAY easier!

mathogre · on Sept 10, 2009

I used mawk a long time ago, but it became stale. Last version I used, I believe, was 1.3.3. It was excellent - fast and accurate. I crunch a lot of data, and it always outperformed gawk. I migrated away from it when it would no longer compile on a Linux system. As I still had gawk, and gawk was fast enough, I left mawk behind.

Now I'll have to see if I can get it to run on OS X. Hmmm... ;)

UPDATE

It's available on MacPorts. It should be on my machines tonight.

jcw · on Sept 10, 2009

I'm installing it right now.

kvs · on Sept 10, 2009

I wonder if the C/C++ code compiled with LLVM/Clang would make a dent in the run time?

skwiddor · on Sept 11, 2009

use it in unicode mode, that kills performance