In the comments Jarek Duda discusses some of the more technical details of the implementation and of his ANS paper that enabled this writeup. One cool thing I noticed is this:
Another advantage of ANS is that we can slightly perturb
the initialization procedure using a pseudo-random number
generator initialized with cryptographic key (e.g. and
also the number of block): for example choosing between
the lowest weight symbol and the second best one.
This way we get simultaneously a decent encryption for free.
Does anyone know what modern compression tools use for entropy coding? Do they all use huffman and would this help them increase speed? What about gzip for HTTP transactions?
There is a nice treatment on entropy coders in Managing Gigabytes (http://ww2.cs.mu.oz.au/mg/). I highly recommend the text as it gives good background on entropy encodings and it also takes a nice engineering approach to implementation.
I toyed around with entropy encoding at one point based on a republished copy of an old Dr. Dobbs' Journal article I found, and I managed to cleanly reimplement the described algorithm in Python and reproduce the example mentioned in the article... but eventually I hit a problem I wasn't smart enough to debug. I'm eager to see whether this new algorithm is more approachable.
Does anyone know the real history with arithmetic encoders?
A lot of places seem to credit Rissanen, but I remember from when this was actually my field, that Elias did most of the work in the early '60s (and actually went on a tangent that Shannon considered in the '50s), and Rissanen's contribution was getting the finite word length arithmetic considerations properly done.
In fact, when I studied this at the university (early 90s), it was introduced as "Elias (aka Arithemetic) coding".
Glen Langdon, who authored early papers with Jorma Rissanen on Arithmetic Coding was my advisor. I am not sure if I still have notes on what Glen taught in his data compression classes, but I will see what I have.
I've been meaning to get notes from his classes as well as some transform mathematics notes from one of David Huffman's classes online.
So if I understand this correctly, how much it outperforms Huffman greatly depends on the data being compressed. Which makes me wonder: what kind of real-life datasets would benefit the most from this new approach?
Check out the FiniteStateEntropy/test/probaGenerator.c code.
It builds a table of character sequences based on a probability. If you feed it a p=0.5 you get a sequence that halves each time:
0000000011111223 (shortened example, the real total length is 4096 so you get 2048x0, 1024x1 ...)
This is then used as an alphabet probability to build random sequences in the test file.
The algorithm scores better than huff for higher P, meaning that the alphabet shortens with a few dominant ones. This means that (statistically) you get a lot of words in your dictionary that resemble each other. And that's where the fractional probability plays its role.
To try to answer your question I'd guess that this technique shows its muscle for data sets in which the dictionary is large and consists of similar sequences. Example: a csv file with a lot of numbers, gis or kml datasets, Gerber files, ...
In the comments Jarek Duda discusses some of the more technical details of the implementation and of his ANS paper that enabled this writeup. One cool thing I noticed is this:
Does anyone know what modern compression tools use for entropy coding? Do they all use huffman and would this help them increase speed? What about gzip for HTTP transactions?