Hacker News new | past | comments | ask | show | jobs | submit login
Compression with the Burrows-Wheeler Transform (fabpedigree.com)
33 points by ssp on Feb 9, 2010 | hide | past | favorite | 6 comments



Unless I missed it, this page doesn't actually contain a description of how to invert the BWT. Fortunately, Wikipedia has a fairly decent description:

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text.


This algorithm is highly useful for next-gen sequencing.

Heng Li, the creator of MAQ, has recently released an aligner called BWA. The heart of BWA is the Burrows-Wheeler transform. The benefits? For genome research, they include a small memory footprint (<4GB), and fast exact matching of short reads to a reference.

Text compression, meet genomics. Genomics, meet compression.


In my mind, Burrows-Wheeler is a clever trick, but it's not clearly superior to dictionary-based methods. bzip2 beats gzip on compression not because B-W is better, but because it has a larger window size and because of the entropy encoder at the tail end.

7Zip builds on gzip, greatly expanding the window and using an arithmetic encoder at the tail end. It performs better in both time and space over bzip2, making both gzip and bzip2 obsolete


In tests I've seen, LZMA -1 is even slower than gzip -9 so I wouldn't say gzip is obsolete in situations where compression speed matters.


It’s funny that the article mentions gzip but not bzip2, which is also commonly installed on Unix systems and uses the Burrows–Wheeler block sorting transformation (followed by Huffman coding).


The original Burrows & Wheeler paper, SRC Research Report 124: http://apotheca.hpl.hp.com/ftp/pub/DEC/SRC/research-reports/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: