Perl was written partly as a replacement for awk, and as such it has command-line switches that make it more suitable than it might appear. You could get very similar behaviour with a much shorter implementation using `perl -nai~`, something like:
BEGIN { open(VOCAB, ">vocab"); }
if (!$imap{$ARGV}{$F[0]}) {
$imap{$ARGV}{$F[0]} = ++$I{$ARGV};
}
if (!$jmap{$F[1]}) {
$jmap{$F[1]} = ++$J;
print VOCAB $F[1] . "\n";
}
print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"
Which apart from the BEGIN line is almost a direct translation of the awk. A lot uglier, but for one-off things that isn't much of a problem.
(And if you want to claim awk has a three-line implementation, this is four lines.)
Admittedly, it's not quite the same - instead of putting output from file1 in file1n, it renames file1 to file1~ and puts its output back in file1. If you want to change that, you have to add your own file-handling code. That would only be a few lines. And it's probably never going to be as fast as mawk.
There are other cases where I suspect perl would beat awk, but maybe get beaten by sed. Not to rain on awk's parade or anything - it's still cool. Just not that much cooler than perl. :)
aha, very nice! I was wondering how to do the awk-style structure in perl; it was unfair I didn't research it.
Maybe it's just me, but I find it much harder to read than the awk syntax, I think mostly because of the dollar signs. I think it's pretty crowded as a four liner. Awk's condition-action syntax helps a little here too.
BEGIN { open(VOCAB, ">vocab"); }
if (!$imap{$ARGV}{$F[0]}) { $imap{$ARGV}{$F[0]} = ++$I{$ARGV}; }
if (!$jmap{$F[1]}) { $jmap{$F[1]} = ++$J;print VOCAB $F[1] . "\n"; }
print "$imap{$ARGV}{$F[0]} $jmap{$F[1]} $F[2]\n"
sed, grep and awk are among the major reasons why I love Linux so much. It took months until I first used them, now I use them daily and they made me work so much more productive than before.
1GB isn't exactly "Big Data". I'd expect most truly Big Data tasks to be more I/O bound than computation bound -- at least if your "computation" consists of text parsing and hash table lookups.
Depends. If you do a naive Ruby implementation, then you'll be CPU-bound quite quickly.
#!/usr/bin/env ruby
while line = STDIN.gets
puts line.split(/\s+/).first
end
This pegs my CPU at only 2MB/s, well below the IO capabilities of any modern system. I guess the tool you're using matters, which I think was the original point.
I agree, 1GB is small. I do similar processing on more 100GB-ish tasks.
But I should say, this is a large enough dataset size that loading it all into memory is sometimes infeasible, at least in interactive interpreted environments like Python. That's an important boundary point.
A pretty good showing from Java on this, even though Java's I/O system is pretty annoying. I don't think the implementation he shows there is too odious if you're used to Javaland pain.
It's not the end of the world. It just adds a few lines, but for multi-GB text processing, the runtime speedup, and the decent concurrency support is worth it.
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Foo {
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = reader.readLine()) != null) {
}
}
}
Like org.apache.commons.io.IOUtils.lineIterator? Ultimately, I choose to either be a Maven project and require half the world, or just a single file that's easily compiled.
If it's the latter, I don't bother creating many abstractions.
I used mawk a long time ago, but it became stale. Last version I used, I believe, was 1.3.3. It was excellent - fast and accurate. I crunch a lot of data, and it always outperformed gawk. I migrated away from it when it would no longer compile on a Linux system. As I still had gawk, and gawk was fast enough, I left mawk behind.
Now I'll have to see if I can get it to run on OS X. Hmmm... ;)
UPDATE
It's available on MacPorts. It should be on my machines tonight.
(And if you want to claim awk has a three-line implementation, this is four lines.)
Admittedly, it's not quite the same - instead of putting output from file1 in file1n, it renames file1 to file1~ and puts its output back in file1. If you want to change that, you have to add your own file-handling code. That would only be a few lines. And it's probably never going to be as fast as mawk.
There are other cases where I suspect perl would beat awk, but maybe get beaten by sed. Not to rain on awk's parade or anything - it's still cool. Just not that much cooler than perl. :)