In the same vein of 'improvements over things everyone assumes you have to do with the more famous tool', we could say grep '^string' : look :: sort -R : shuf
My point there was that besides being a true shuffle, 'shuf' also has an algorithmic advantage when you only need a few lines (similar to how lookup has an advantage when you only need a few).
On my machine, (Ubuntu 14.04) look only performs a binary search if the "-b" option is passed (unless no file is specified, in which case it behaves as described in the linked manpage). My manpage notes in the compatibility section:
look uses a linear search by default instead of a binary search, which is what most other implementations use by default.
I created a 1.1G file with dictionary words and timed...
% du -h a
1.1G a
% time look 'dog' a | wc -l
53856
real 0m0.021s
user 0m0.020s
sys 0m0.003s
% time grep '^dog' a | wc -l
53856
real 0m28.593s
user 0m0.977s
sys 0m2.223s
Ok grep performed worse than what I expected (sort took a long time though).
That seem abysmally slow for grep. On my machine, with a 1.1G file made of 1200 copies of /usr/share/dict/words, GNU grep 2.16 takes about 1.5 seconds. Try with LC_ALL=C?
Yeah there's a lot of that, but grep is like using duct tape to hold something together when you really should be using screws or so. grep can do everything, but its not always the best tool for the job.
The "ack" you reference is apparently faster only in that it's easier to specify exactly which files to search or not to search, which obviously can also be done with grep, e.g. combined with 'find'.
As for grep itself, it has had many incarnations over the decades with various pluses and minuses.
The current GNU grep allows 3 kinds of regex, basic/extended/perl -- the latter being what ack supports.
Note that Perl regexes have extensions beyond regular languages that are inherently slower than the automata specified by basic regexes. Power versus speed.
E.g. grep(1): "Back-references are very slow, and may require exponential time."
For further info:
> why GNU grep is fast
> Mike Haertel mike at ducky.net
> Sat Aug 21 03:00:30 UTC 2010
> Here's a blog post from 2006 about a developer trying to "beat grep" and looking at the algorithms it uses; it goes into a little more detail about the "doesn't need to do the loop exit test at every step" optimization mentioned in this email.
The best writeup is surely by the inimitable Russ Cox, who really really explains clearly when grep as of 2007 was one of the only fast regex implementations:
Regular Expression Matching Can Be Simple And Fast [#1]
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
(This is a 4 part series but IIRC part 1 has the highlights)
I'm sure that various other tools have been strongly influenced by this famous essay, and so many more things may be as fast as grep by now, but still...
P.S. one of the other high profile "ack"-like search tools would be "ag", aka "The Silver Surfer".
"The Silver Searcher is a 3-5x faster drop in replacement for ack (which itself is better than grep)."
Seems that most time for grep was spent waiting on disk. This underlines the reason look may be significantly faster: actually reading the bad lines may take time.
Yes, I am aware of that but I wanted to understand the usage of 'look' at this point. It was a fault to try stuff using the .zshrc file at first because I ran into misunderstanding everything. It was just the first file that came in mind to me. Since sorting a source file and then "looking" for comments makes no sense in any way, grep seems to be really the better choice.
Only 12 hours after looking at this did I realize that if your log files have one record per line, you can easily use this to search for all logs that happened in a particular ten minute period.
As long as I have been using *nixes, I have never met a machine that didn't have grep. It was written by Ken Thompson himself in 1974, or 41 years ago, and first came out in Unix 4th edition.
It obvious that look was written as an optimization for this specific case though. The default file is the user dictionary which is interesting. Do you think they didn't know about grep in 1979?
The interesting use case for me is that you can make a sorted index and look up records efficiently, i.e. within a script. It works with ctags indexes, for instance.
Binary search is only faster if access to the elements is constant-time, which is not true for a file with varying-length records (lines). Line 1000 (counting from 0) could be at offset 2000 if each line is a single character + newline, or it could be at offset 2000000 if each line is 999 characters and a newline, or somewhere in between.
You don't need to know the number of the line you're looking at. You jump to the middle of the file, and you compare the nearest line to the search pattern. Then you recurse on the top or bottom half.
That still requires random seeking; it's not so bad on SSDs, but abysmal on rotating disks, and most of the filesystem and OS caching code is optimised for linear forward reads, so things like prefetching are not going to help at all.
At some point the cost of scanning a long file will overcome the cost of doing log(n) seeks. On short files, it really doesn't matter what you do, searching will be fast regardless.
It's probably much faster in most cases of text files though.
I also think your constant time claim sounds too strong. You can eat a lot of end of line search time after the binary search and still beat a linear/regex search.
As someone else pointed out, look can also exit sooner.