Compression algorithms used by Google

barrkel · on May 17, 2010

In-memory compression is another technique not to be forgotten. If you can efficiently compress your in-memory representations of data you would otherwise have to hit disk or network for, you can dramatically increase the amount of data you can cache in memory, especially if you're aware of the structure and general patterns of the data.

Once I had to cache file paths, millions of them. I got a fairly easy 6x improvement by storing them in a Patricia trie, with each node's string stored as a range of UTF-8 encoded bytes, and indexed by arrays of bitfields - i.e. bit-packed arrays. Child nodes were similarly indexed using bitfield arrays. Turning what would be a 32-bit (or worse, 64-bit) pointer into (e.g.) a 28-bit integer really starts to add up when you have millions of them.

edge17 · on May 17, 2010

db2 and oracle both implement similar compression algorithms. There's nothing terribly special about google's compression algos.

MWinther · on May 17, 2010

That they are used by Google makes them somewhat special, in that one would assume they have put some thought and experimentation into choosing them. Also, that would make for pretty good field-testing of them.

edge17 · on May 17, 2010

Any company building enterprise level products puts thought and experimentation into choosing their algorithms. Google just advertises their research publicly as part of their branding.

seiji · on May 17, 2010

A few months ago I extracted bmdiff from hypertable to use standalone. Give it a try: http://bitbucket.org/mattsta/bmz/src

coderdude · on May 18, 2010

Thanks for extracting bmz into a standalone. I actually went searching for it after reading the article's comments last night and came across your link. One question I have is how up-to-date is this version of bmz? It says that it's from a Hypertable commit dated Jan 25, 2009. Would I be able to find a newer version in Hypertable's source?