Hacker News new | past | comments | ask | show | jobs | submit login
look(1): lines beginning with given string (die.net)
47 points by mtdewcmu on Dec 13, 2015 | hide | past | favorite | 45 comments



The universe of Unix filters never ceases to turn up things I haven't seen before. Other examples for those inclined: tac, sponge, pv, join, paste.


Some others I like:

rev, vipe, comm, bfr, xxd


Do you know of a good resource listing all of these?


The standardized ones are described in the Shell & Utilities section of POSIX:2008/SUSv4 online at http://pubs.opengroup.org/onlinepubs/9699919799/


Yes, Num command has a quick list: http://www.numcommand.com/doc/helpers.html


`ls /usr/bin/?? /usr/bin/???` can be interesting to skim through.


In the same vein of 'improvements over things everyone assumes you have to do with the more famous tool', we could say grep '^string' : look :: sort -R : shuf


Thanks to @climagic on Twitter, I recently learned that sort -R and shuf are not in fact the same thing. (sort -R doesn't actually shuffle...)


My point there was that besides being a true shuffle, 'shuf' also has an algorithmic advantage when you only need a few lines (similar to how lookup has an advantage when you only need a few).


look does a binary search and only works on sorted files, while grep '^string' works on any text file but would be slower.


On my machine, (Ubuntu 14.04) look only performs a binary search if the "-b" option is passed (unless no file is specified, in which case it behaves as described in the linked manpage). My manpage notes in the compatibility section:

look uses a linear search by default instead of a binary search, which is what most other implementations use by default.


Is look of any use? Isn't grep fast enough?


A better one-line description would be

"look(1): binary search for lines with a given prefix in a sorted file"

Which tells you exactly what the pros/cons are compared to plain grep.


I created a 1.1G file with dictionary words and timed...

  % du -h a  
  1.1G	a
  % time look 'dog' a | wc -l
  53856

  real	0m0.021s
  user	0m0.020s
  sys	0m0.003s
  % time grep '^dog' a | wc -l
  53856

  real	0m28.593s
  user	0m0.977s
  sys	0m2.223s
Ok grep performed worse than what I expected (sort took a long time though).


'look' can stop working when it finishes the sorted section of 'dog', whereas grep must continue to the end of the file.

Ignoring memory hierarchy, naturally 'look' would be O(log n) while grep would be O(n).

Special input (sorted in this case) of course often calls for matching algorithms, when speed matters.


That seem abysmally slow for grep. On my machine, with a 1.1G file made of 1200 copies of /usr/share/dict/words, GNU grep 2.16 takes about 1.5 seconds. Try with LC_ALL=C?


Is grep optimized to be fast, though? There seems to be a lot of, "better at grep than grep" tools out there,

http://beyondgrep.com/


There's some real effort that was put into it: https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug.... I don't know how it compares to all the alternatives.


Yeah there's a lot of that, but grep is like using duct tape to hold something together when you really should be using screws or so. grep can do everything, but its not always the best tool for the job.


The "ack" you reference is apparently faster only in that it's easier to specify exactly which files to search or not to search, which obviously can also be done with grep, e.g. combined with 'find'.

As for grep itself, it has had many incarnations over the decades with various pluses and minuses.

The current GNU grep allows 3 kinds of regex, basic/extended/perl -- the latter being what ack supports.

Note that Perl regexes have extensions beyond regular languages that are inherently slower than the automata specified by basic regexes. Power versus speed.

E.g. grep(1): "Back-references are very slow, and may require exponential time."

For further info:

> why GNU grep is fast > Mike Haertel mike at ducky.net > Sat Aug 21 03:00:30 UTC 2010

> Here's a blog post from 2006 about a developer trying to "beat grep" and looking at the algorithms it uses; it goes into a little more detail about the "doesn't need to do the loop exit test at every step" optimization mentioned in this email.

http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...

via

https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

The best writeup is surely by the inimitable Russ Cox, who really really explains clearly when grep as of 2007 was one of the only fast regex implementations:

Regular Expression Matching Can Be Simple And Fast [#1] (but is slow in Java, Perl, PHP, Python, Ruby, ...)

Russ Cox 2007 jan

http://swtch.com/~rsc/regexp/regexp1.html

(This is a 4 part series but IIRC part 1 has the highlights)

I'm sure that various other tools have been strongly influenced by this famous essay, and so many more things may be as fast as grep by now, but still...

P.S. one of the other high profile "ack"-like search tools would be "ag", aka "The Silver Surfer".

"The Silver Searcher is a 3-5x faster drop in replacement for ack (which itself is better than grep)."

2013

https://www.reddit.com/r/programming/comments/16bvah/the_sil...


Seems that most time for grep was spent waiting on disk. This underlines the reason look may be significantly faster: actually reading the bad lines may take time.


When running 'look' to get all comments in a file I only get the first found line in the ouput. Can someone explain this?

look "#" .zshrc

There are really a lot more comments (lines starting with '#') in my .zshrc then the first line.


From the manpage:

> As look performs a binary search, the lines in file must be sorted...

I assume you didn't sort your source file first.


When sorting the file first I get no output anymore:

sort ~/.zshrc -o /tmp/.zshrc

look "#" /tmp/.zshrc

Sorry for the noise but I am confused... :|


sort(1) might not use the same ordering as look(1)

Try setting LC_COLLATE=C and export it and retry.


Uh, wow! That was it!

Thanks for this answer...


It doesn't make sense to sort a configuration file, order and context matters. Easier to just use grep '^#' (or even better, ag).


Yes, I am aware of that but I wanted to understand the usage of 'look' at this point. It was a fault to try stuff using the .zshrc file at first because I ran into misunderstanding everything. It was just the first file that came in mind to me. Since sorting a source file and then "looking" for comments makes no sense in any way, grep seems to be really the better choice.


Your .zshrc is almost certainly not a sorted file.


One of the useful things about look is that it defaults to the dictionary. So if you weren't sure how to spell accommodate, you could type

look ac | grep date


Only 12 hours after looking at this did I realize that if your log files have one record per line, you can easily use this to search for all logs that happened in a particular ten minute period.


look a |wc -l => 5985

look -b a |wc -l => 1228

look -bf a |wc -l => 1228

You must sort the /usr/share/dict/words file according to ignore case for the option bf to work properly.

Edit: I was expecting

sort -f /usr/share/dict/words | look -bf a | wc -l to be 5985 but the result is 1228, don't know why.


If you don't have this basic linux utility, you can use

    grep '^string' file.txt
but there's a greater chance of having look installed than grep.


As long as I have been using *nixes, I have never met a machine that didn't have grep. It was written by Ken Thompson himself in 1974, or 41 years ago, and first came out in Unix 4th edition.


grep is also specified by POSIX, whereas look isn't.


The interesting thing about look is that it uses binary search, so it is much faster.


Binary search is useless unless the file is sorted. Most often my files are not sorted, so egrep '^line' works just fine.


It obvious that look was written as an optimization for this specific case though. The default file is the user dictionary which is interesting. Do you think they didn't know about grep in 1979?


The interesting use case for me is that you can make a sorted index and look up records efficiently, i.e. within a script. It works with ctags indexes, for instance.


Binary search is only faster if access to the elements is constant-time, which is not true for a file with varying-length records (lines). Line 1000 (counting from 0) could be at offset 2000 if each line is a single character + newline, or it could be at offset 2000000 if each line is 999 characters and a newline, or somewhere in between.


You don't need to know the number of the line you're looking at. You jump to the middle of the file, and you compare the nearest line to the search pattern. Then you recurse on the top or bottom half.


That still requires random seeking; it's not so bad on SSDs, but abysmal on rotating disks, and most of the filesystem and OS caching code is optimised for linear forward reads, so things like prefetching are not going to help at all.


At some point the cost of scanning a long file will overcome the cost of doing log(n) seeks. On short files, it really doesn't matter what you do, searching will be fast regardless.


It's probably much faster in most cases of text files though.

I also think your constant time claim sounds too strong. You can eat a lot of end of line search time after the binary search and still beat a linear/regex search.

As someone else pointed out, look can also exit sooner.


According to Wikipedia, grep was first used in "Version 4 AT&T UNIX" According to my 1984 Eunice manual, Look was part of the 7th edition of Unix.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: