I am curious what will happen when we run the commands in the reverse order. The LANG=C variation before the first. I suspect some of the speedup is because you just brought the file into memory.
I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do underneath.
Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)
stuff$ du -sh big.log
2.8G big.log
stuff$ time grep -i e big.log > /dev/null
real 0m30.228s
user 0m12.213s
sys 0m3.228s
stuff$ time LANG=C grep -i e big.log > /dev/null
real 0m30.130s
user 0m12.105s
sys 0m3.308s
FWIW my locale is en_GB.utf-8 and I also get no difference (in fact, the version with locale is slightly faster than without), with GNU grep 2.14 on OSX 10.9.
The built-in BSD grep (2.5.1-FreeBSD) also runs in 30% of the time GNU grep does.
I just did 'strings /dev/urandom > stuff' for 100M.
Got it in memory first with untimed grep, then:
$ time grep -i blah stuff
bLAH
blAH
real 0m7.227s
user 0m7.194s
sys 0m0.011s
$ time LANG=C grep -i blah stuff
bLAH
blAH
real 0m0.486s
user 0m0.467s
sys 0m0.016s
So a pretty big difference. My default lang is en_US.utf8.
Sets the process's locale to C rather than whatever the system's default is (if the process is locale-aware, as "C" is the default locale for C programs). The main change[0] for this case is to also disable encoding (and thus decoding), all text is considered to be ASCII rather than whatever the locale specifies (usually UTF8 these days)
It's historical and sets your local character set. The default value on FreeBSD is POSIX which is an alias the historically value which is C. Desktop unix's like OSX or ubuntu set it to utf8.
Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)
It doesn't actually change the charset, it changes the behaviour of a few libc functions and other things that use it (like perl). Sane programs ignore libc locales, as you do not want the behaviour of your program changing randomly based on some environment variables, that way madness lies.
There was a really interesting post on here a while back on GNU grep vs BSD grep (2010)[1]
The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.
Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy doesn't show up, so it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search. I suspect the insensitive version can also be done correctly much faster, though.
> it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search
It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.
"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."
I did not see any version numbers or if we are discussing BSD grep or GNU grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow the first thing I ask is if they are using OSX, the answer is almost always yes. GNU grep is a lot faster.
That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:
His estimate accounting for that was 7x, but this is clearly not a benchmark that was carefully thought through.