Hacker News new | past | comments | ask | show | jobs | submit login
Make grep 50x faster (x-way.org)
49 points by iamtechaddict on Dec 15, 2013 | hide | past | favorite | 25 comments



If you take a look at the comments on the article, most of that speedup is because the LANG=C command was run second and the files were cached.

His estimate accounting for that was 7x, but this is clearly not a benchmark that was carefully thought through.


I am curious what will happen when we run the commands in the reverse order. The LANG=C variation before the first. I suspect some of the speedup is because you just brought the file into memory.


Not only that, the search branches are sitting in the CPU caches.


Wouldn't the cache be flushed by the OS and other background processes, in the mean time, plus the display updates in the shell and the shell history?


I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do underneath.

Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)

[1] http://beyondgrep.com/

[2] http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-...


Maybe, the silver searcher still seems a _lot_ faster than grep with this trick though. (About 0.5s vs 32s on an arbitrary test).


This is what I get on a Precise 32bit Ubuntu:

  stuff$ du -sh big.log
  2.8G	big.log

  stuff$ time grep -i e big.log > /dev/null
  real	0m30.228s
  user	0m12.213s
  sys	0m3.228s

  stuff$ time LANG=C grep -i e big.log > /dev/null
  real	0m30.130s
  user	0m12.105s
  sys	0m3.308s

What is LANG=C supposed to do?


Maybe your default locale is already C?

I'm still surprised that TFA can claim such a speedup, I would have thought IO speed was the bottleneck when you grep through a big amount of data.

As an other poster mentioned I wonder if the speedup is not mainly disk caching in RAM during the 2nd run.


FWIW my locale is en_GB.utf-8 and I also get no difference (in fact, the version with locale is slightly faster than without), with GNU grep 2.14 on OSX 10.9.

The built-in BSD grep (2.5.1-FreeBSD) also runs in 30% of the time GNU grep does.


I just did 'strings /dev/urandom > stuff' for 100M. Got it in memory first with untimed grep, then:

    $ time grep -i blah stuff
    bLAH
    blAH

    real	0m7.227s
    user	0m7.194s
    sys	0m0.011s

    $ time LANG=C grep -i blah stuff
    bLAH
    blAH

    real	0m0.486s
    user	0m0.467s
    sys	0m0.016s
So a pretty big difference. My default lang is en_US.utf8.


Sets the process's locale to C rather than whatever the system's default is (if the process is locale-aware, as "C" is the default locale for C programs). The main change[0] for this case is to also disable encoding (and thus decoding), all text is considered to be ASCII rather than whatever the locale specifies (usually UTF8 these days)

[0] the locale should have an impact on what character ranges match. See http://stackoverflow.com/questions/6799872/how-to-make-grep-... for an example


It's historical and sets your local character set. The default value on FreeBSD is POSIX which is an alias the historically value which is C. Desktop unix's like OSX or ubuntu set it to utf8.

Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)


>What is LANG=C supposed to do?

It changes the charset to do not use utf-8.


Not supporting Unicode often drive me crazy/sad, especially for a diff/patch tool.


Having control makes me happy/sane. Having unicode set to (on|off) does not affect either diff/patch.


It doesn't actually change the charset, it changes the behaviour of a few libc functions and other things that use it (like perl). Sane programs ignore libc locales, as you do not want the behaviour of your program changing randomly based on some environment variables, that way madness lies.


There was a really interesting post on here a while back on GNU grep vs BSD grep (2010)[1]

The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.

[1] http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...


Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy doesn't show up, so it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search. I suspect the insensitive version can also be done correctly much faster, though.


> it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search

It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.


The non-leading utf-8 bytes all are in an easily detected range that doesn't overlap with ascii.


Yes - it's not hard to do but it does require someone to remember to check before attempting the optimization.



And especially:

http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf...

"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."


I did not see any version numbers or if we are discussing BSD grep or GNU grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow the first thing I ask is if they are using OSX, the answer is almost always yes. GNU grep is a lot faster.

That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:

"grep ." pathologically slow in UTF-8 locales -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604408


There was a write up about this a while ago. http://www.inmotionhosting.com/support/website/ssh/speed-up-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: