Make grep 50x faster

agf · on Dec 15, 2013

If you take a look at the comments on the article, most of that speedup is because the LANG=C command was run second and the files were cached.

His estimate accounting for that was 7x, but this is clearly not a benchmark that was carefully thought through.

anilshanbhag · on Dec 15, 2013

I am curious what will happen when we run the commands in the reverse order. The LANG=C variation before the first. I suspect some of the speedup is because you just brought the file into memory.

ye · on Dec 15, 2013

Not only that, the search branches are sitting in the CPU caches.

pygy_ · on Dec 15, 2013

Wouldn't the cache be flushed by the OS and other background processes, in the mean time, plus the display updates in the shell and the shell history?

pmelendez · on Dec 15, 2013

I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do underneath.

Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)

[1] http://beyondgrep.com/

[2] http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-...

cs02rm0 · on Dec 15, 2013

Maybe, the silver searcher still seems a _lot_ faster than grep with this trick though. (About 0.5s vs 32s on an arbitrary test).

iagooar · on Dec 15, 2013

This is what I get on a Precise 32bit Ubuntu:

  stuff$ du -sh big.log
  2.8G	big.log

  stuff$ time grep -i e big.log > /dev/null
  real	0m30.228s
  user	0m12.213s
  sys	0m3.228s

  stuff$ time LANG=C grep -i e big.log > /dev/null
  real	0m30.130s
  user	0m12.105s
  sys	0m3.308s

What is LANG=C supposed to do?

simias · on Dec 15, 2013

Maybe your default locale is already C?

I'm still surprised that TFA can claim such a speedup, I would have thought IO speed was the bottleneck when you grep through a big amount of data.

As an other poster mentioned I wonder if the speedup is not mainly disk caching in RAM during the 2nd run.

masklinn · on Dec 15, 2013

FWIW my locale is en_GB.utf-8 and I also get no difference (in fact, the version with locale is slightly faster than without), with GNU grep 2.14 on OSX 10.9.

The built-in BSD grep (2.5.1-FreeBSD) also runs in 30% of the time GNU grep does.

a8da6b0c91d · on Dec 15, 2013

I just did 'strings /dev/urandom > stuff' for 100M. Got it in memory first with untimed grep, then:

    $ time grep -i blah stuff
    bLAH
    blAH

    real	0m7.227s
    user	0m7.194s
    sys	0m0.011s

    $ time LANG=C grep -i blah stuff
    bLAH
    blAH

    real	0m0.486s
    user	0m0.467s
    sys	0m0.016s

So a pretty big difference. My default lang is en_US.utf8.

masklinn · on Dec 15, 2013

Sets the process's locale to C rather than whatever the system's default is (if the process is locale-aware, as "C" is the default locale for C programs). The main change[0] for this case is to also disable encoding (and thus decoding), all text is considered to be ASCII rather than whatever the locale specifies (usually UTF8 these days)

[0] the locale should have an impact on what character ranges match. See http://stackoverflow.com/questions/6799872/how-to-make-grep-... for an example

UNIXgod · on Dec 15, 2013

It's historical and sets your local character set. The default value on FreeBSD is POSIX which is an alias the historically value which is C. Desktop unix's like OSX or ubuntu set it to utf8.

Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)

pmelendez · on Dec 15, 2013

>What is LANG=C supposed to do?

It changes the charset to do not use utf-8.

ttflee · on Dec 15, 2013

Not supporting Unicode often drive me crazy/sad, especially for a diff/patch tool.

UNIXgod · on Dec 15, 2013

Having control makes me happy/sane. Having unicode set to (on|off) does not affect either diff/patch.

justincormack · on Dec 15, 2013

It doesn't actually change the charset, it changes the behaviour of a few libc functions and other things that use it (like perl). Sane programs ignore libc locales, as you do not want the behaviour of your program changing randomly based on some environment variables, that way madness lies.

blassium · on Dec 15, 2013

There was a really interesting post on here a while back on GNU grep vs BSD grep (2010)[1]

The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.

[1] http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

comex · on Dec 15, 2013

Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy doesn't show up, so it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search. I suspect the insensitive version can also be done correctly much faster, though.

acdha · on Dec 15, 2013

> it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search

It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.

Joeri · on Dec 15, 2013

The non-leading utf-8 bytes all are in an easily detected range that doesn't overlap with ascii.

acdha · on Dec 16, 2013

Yes - it's not hard to do but it does require someone to remember to check before attempting the optimization.

tszming · on Dec 15, 2013

See:

[1] https://news.ycombinator.com/item?id=3337411

[2] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

acqq · on Dec 15, 2013

And especially:

http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf...

"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."

dfc · on Dec 15, 2013

I did not see any version numbers or if we are discussing BSD grep or GNU grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow the first thing I ask is if they are using OSX, the answer is almost always yes. GNU grep is a lot faster.

That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:

"grep ." pathologically slow in UTF-8 locales -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604408

nullanvoid · on Dec 15, 2013

There was a write up about this a while ago. http://www.inmotionhosting.com/support/website/ssh/speed-up-...