Nice writeup! FSE, one of the parts of Zstandard, is a very elegant entropy coder. If you're following this section of the article, you might be wondering how it's encoded, since the encoding process is somewhat less obvious than for Huffman coding.
What causes trouble is simply, "Once you've emitted a symbol, how do you know that the next symbol is in the range [BL, BL+2^NB)?" The answer is that the encoding is performed backwards, starting from the end of the input, and the table is constructed so that for any symbol emitted, you can find a previous symbol so that the current symbol is in the previous symbol's [BL, BL+2^NB) range.
There's another writeup about FSE here, which goes into more depth about FSE (rather than talking about Zstandard in general):
My intuition tells me that there's probably a homomorphism from Huffman coding to FSE that preserves the encoded size exactly, but I haven't done the math to check.
Do you happen to know of any write-ups of ANS/FSE that explain the way it works in a more accessible way? Jarek Duda's work in figuring all of this out is amazing, but it's currently not very easy to grok for mere mortals - even your link says so. Note that this is not a complaint about his output; you probably need to bend your brain into really weird positions with advanced maths to come up with this entropy coding solution in the first place. Plus he deserves a lot of credit for fighting to keep all of his work in the public domain and free of software patents.
Having said that, currently I only have a really superficial understanding of how it works, not even enough to experiment and implement something FSE-like on my own. And I'd like to change that.
> My intuition tells me that there's probably a homomorphism from Huffman coding to FSE that preserves the encoded size exactly, but I haven't done the math to check.
Well Huffman is "optimal" but limited to whole bits, so if it's possible to make a variant FSE encoder that doesn't use fractional bits but "rounds" to whole bits somehow then that probably ends up with the same compression size, no?
Have you gone into the details of Huffman coding and implemented a decoder and encoder?
I remember reading a description of Huffman out of a book, or maybe Wikipedia or somewhere else, and thinking I understood it. It’s conceptually easy. Then I wrote a Deflate decoder, and better understood what it means to use Huffman in practice—how you transmit the codebook, and how limiting the symbol size lets you make a simple and efficient decoder. Later, I wrote a Huffman encoder, and I realized that the “simple” problem of rounding all of the -log P symbol lengths to integer values is actually a tricky integer programming problem, which we only solve approximately. Making one symbol shorter means you make other symbols longer. The decoder doesn’t have to do this, because it gets a copy of the codebook from the encoder.
From that point, I found it easier to tackle FSE through the blog series I linked. Writing your own implementation helps—the blog makes some comment about how FSE works, and if you’re following along with your own implementation, you’re (usually) on a collision course with the reasoning behind that design decision. Meanwhile, you can also peek at the source code for an FSE implementation and try to understand things from that angle at the same time.
Duda’s work is also not FSE, but ANS. My understanding is that Duda’s work on ANS was adapted and modified to create FSE. So if you read the paper on ANS, and trying to make an FSE implementation, it might be a bit confusing.
I think it would be fun to try and write up a “demystifying FSE” article or series of articles, but my best estimate is that this would take at least a couple months, if not half a year, to do it. I most recently reverse engineered an audio compression format called VADPCM and the writeups for it are time-consuming.
I hear you on writing out explanations taking even longer than grokking the material yourself. But hopefully the path you describe will already help me find an angle to get into the material even if you never get around to it (would love to read such a series though).
I have only done Huffman encoding/decoding in "toy" contexts, although I do think I understand the theory. Also I was not fully aware that ANS and FSE aren't quite the same thing, I thought the latter was "just" the finite-state machine formulation of the concepts described by the former. Perhaps trying to make sense of FSE first will be an easier entrypoint.
Thanks for taking the time to write out all of that advice, it's really appreciated!
I wouldn't describe it as a reformulation, but as a different solution to the same problem. The goal of entropy coders in general is to encode each symbol using -log P bits, where P is the probability of that symbol. Huffman, arithmetic encoding, range coding, and ANS / FSE are different systems that solve this problem.
Huffman coding works by using an integer number of bits for each symbol and then creating a prefix-free code. This is computationally cheap but wastes some space.
Arithmetic coding works using interval arithmetic. Each symbol corresponds to an interval, and you multiply each successive interval together, renormalizing after each step. This is close to optimal but computationally expensive.
ANS / FSE work by using a state machine. A particular symbol can be coded using a variable number of bits, but that number is an integer. The average number of bits is close to -log P. Like Huffman coding, coding and decoding is fast--each symbol uses an integer number of bits, and you don't need to do any multiplication. Like Arithmetic coding, it is close to optimal in terms of space used.
These techniques share one thing in common--you only need to record symbol frequency in order to recreate the codebook. In Huffman coding, you typically encode the symbol frequency log 2 (i.e. the symbol length). In FSE and arithmetic coding, you typically encode the symbol frequency as a dyadic rational. You use a deterministic process to recreate the exact same codebook given these frequencies.
(Then the question is, "How do you encode the symbol frequency efficiency?" For Deflate, the answer is, funny enough, the symbol lengths for the Huffman codes are themselves encoded using Huffman codes! The codebook for that Huffman code is then coded using a fixed codebook.)
First off, I love zstd and thanks for your work - I've used it multiple times with great success.
My question is...what's up with zstd compression levels? It seems to be impossible to find documentation on what the supported compression levels are, and then there are magic numbers that make things more confusing (I think last time I checked, 0 means 'default' which translates to level 3?)
My notes when I was trying to track through the minimum compression level through multiple header files seemed to indicate it's MINCLEVEL...which appears to be -131072? But it starts talking about block sizes and target length, and I'm not entirely sure why they would relate to a compression level.
Thanks for the feedback! I've opened an issue to track this [0]
* Levels 1-19 are the "standard" compression levels.
* Levels 20-22 are the "ultra" levels which require --ultra to use on the CLI. They allocate a lot of memory and are very slow.
* Level 0 is the default compression level, which is 3.
* Levels < 0 are the "fast" compression levels. They achieve speed by turning off Huffman compression, and by "accelerating" compression by a factor. Level -1 has acceleration factor 1, -2 has acceleration factor 2, and so on. So the minimum supported negative compression level is -131072, since the maximum acceleration factor is our block size. But in practice, I wouldn't think a negative level lower than -10 or -20 would be all that useful.
We're still reserving the right to fiddle around with the meaning of our negative compression levels. We think that we may be able to offer more compression at the same speeds by completely changing our search strategies for very fast compression speeds. But, there is only so much time in the day, and we haven't had time to investigate it yet. So we don't want to lock ourselves into a particular scheme right now.
This is exactly what I was hoping for! If you just copied and pasted this into the documentation directly, that'd be more than enough. Thanks for writing it out so clearly and creating the issue.
Ohh, don't mind if I do! I'm working on CPU libraries to improve compression performance. Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.? Do you foresee the need in the future to offload compression/decompression to an accelerator?
Another Q, what hardware platform(s) do you use to test new versions of zstd for regressions/improvements? I see the Github page shows a consumer CPU in use, but what about server CPUs pushing maximum throughput with many threads/cores?
> Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.?
Zstd benefits mostly from BMI(2), it takes advantage of shlx, shrx, and bsf during entropy (de)coding. We do use SSE2 instructions in our lazy match finder to filter matches based on a 1-byte hash, in a similar way that F14 and Swiss hash tables do. We also use vector loads/stores to do our copies during decompression.
We don't currently benefit from AVX2. There are strategies to use AVX2 for Huffman & FSE compression, but they don't fit well with zstd's format, so we don't use them there. Additionally, our latest Huffman decoder is very competitive with the AVX2 decoder on real data. And a Huffman format that is fast for AVX2 is often slow for scalar decoding, so it is hard to opt into it unless you know you will only run on CPUs that support AVX2. Lastly, AVX2 entropy decoders rely on a fast gather instruction, which isn't available on AMD machines.
> Do you foresee the need in the future to offload compression/decompression to an accelerator?
Yes, there are trends in this direction. Intel QAT supports zlib hardware acceleration. The PS5 has accelerated decompression. I'm pretty sure Microsoft has also released a paper about a HW accelerated algorithm. Compression & decompression take up a significant portion of datacenter CPU, so it makes sense to hardware accelerate it.
> what hardware platform(s) do you use to test new versions of zstd for regressions/improvements?
We mostly test and optimize for x86-64 CPUs, on a mix of server and consumer CPUs. But, we also test on ARM devices to make sure we don't have major regressions. We've found that optimizing for x86-64 has done a good job of getting good baseline performance on other architectures, though there is certainly room left to improve.
In bioinformatics there is a lot of usage of blocked gzip, as: concatenated separated compressed chunks which can be indexed and decompressed independently.
See BAM format/samtools.
There is libzstd-seek[1] which implements one convention[2] and also ZRA[3] which implements its own thing (apparently?) and then there’s the splittability discussion[4]... Clearly people want seekability, clearly the Zstandard maintainers do not want to get locked into a suboptimal solution. But I have to join this question: I can haz zstd seeks plz?
Pardon my ignorance, but isn't this blocking something which could exist completely separately from / orthogonally to any particular compression algorithm? Is this a question of zstd supporting a blocked compressor, or of a blocked compressor supporting zstd?
The advantage of it being built into the compressor is that you have a standard format that works with the standard tools. If you wrap compressed blocks in a container format everyone using the result needs tools that understand the container layout.
For instance gzip with the --rsyncable option: it helps the rsync process and similar but anything that doesn't care can still unpack the compressed file with any old gzip decompression routine. There is a cost in that the dictionary resets result in a small reduction in compression rates for most content, but you've not introduced a new format.
The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable even though in theory you could start decompression at any point that the dictionary was reset. You could add an external index file that lists “output byte X is the start of a new cycle, at input byte Y” which would solve that without changing the gzip format itself, which might be how I'd have done things, but there are then issues of keeping the content and index file together, and verifying you have the right index, as data is passed around various places.
The zip format resets compression at file boundaries (each input file is compressed separately) and it keeps an index of where each file starts in the compressed output, hence you can see at the file level and unpack individual files, but doesn't index within each file so you can't use this to seek around the compressed version of a single large file. Resetting compression at file boundaries is why zip is particularly inefficient when compressing many small files, one of the reasons .tar.gz is preferred for source code archives even with the --rsyncable option.
So if it were possible (and practical) with the zstd output format, an indexed version of what gzip --rsyncable does could be a good compromise for a standard format that is seekable where it needs to be but with minimal reduction in compression. Assuming of course that zstd becomes as ubiquitous as gzip, available by default in most places, otherwise not needing different tools to read the compressed files is a moot point and the reduced compression becomes just a cost not a compromise.
>The bioinformatics people noted above don't use this despite using gzip (instead using a container format for gzip packed blocks) because the result isn't directly seekable
Not sure if we are talking about exactly the same thing, but one of the points of using such blocks is a really fast random access to bunch of records located in some bzip-ed chunk.
This works really well if you fulfill some prerequisites (simplified):
1. the pre-compressed, usually tab/space delimited data must be sorted (typical: by chromosome and position)
2. you create an index telling you that the data for chromosome such_and_such in an interval say 2_000_000-2_100_000 is in a chunk 1234 and it this chunk is at the position 987654321 in the file.
As long as all you do is "give me all records and all columns from interval X", this works really fast. On the other hand it can be abused/pushed to imho absurd levels where data files with thousands of columns are queried in that way.
Once you got giant say in tens/hundreds of Gb block-compressed files adding new data (that would be extra columns...) to such monstrosities becomes a real PITA.
> Not sure if we are talking about exactly the same thing
I think we are. The tweak that --rsyncable does to the compression stream reduces the effect of small changes early in the input but still only allows for forward motion through the data when processing. If an index of those reset points was kept it would act like the index you describe so you could random access in the compressed output with source block granularity and only have to decompress the blocks needed.
The original post I replied to mentioned gzip, though you mention bzip - not sure if the latter is a typo or you mean bzip2. bzip2 might make sense for that sort of data being compressed once and read many times (it is slower but with that sort of data likely to get far better compression), though it has no similar option to gzip's --rsyncable, but pbzip2's method of chunking may have the same effect.
Interesting, I've never heard of it before. I was aware that there were better standards than zip, however adoption was always an issue with licenses. If this is as open, is it purely an issue of wide adoption?
We're stuck in the C++ paradigm (I'm sure there is a better name), where everyone agree this isn't "ideal", that there is better, but not widely adopted enough?
Unless you have to use deflate/zlib/gzip/zip for compatibility or you're very constrained in amount of memory, you should always prefer zstd over deflate - it's always both faster and compresses better, so there is no tradeoff and using deflate is just wasteful.
Google's brotli is a close competitor and got supported in Chrome first, so HTTP uses that more than it uses zstd.
zstd has a very wide range of compression levels to tradeoff compression ratio for CPU time, much wider than the difference between gzip -1, gzip -6 (default) and gzip -9. zstd -3 is default. zstd -1 is very fast, zstd -19 compresses really well.
zstd also has a long range mode, for archives that contain the same file a gigabyte away. gzip's window is only 32KB in size, so it can't notice any repetitions more than that far away.
LZ4 by the same author as zstd aims at even faster compression, at the cost of worse compression ratio. It makes sense to use over fast network. Google snappy and LZO are the usual competitors.
When you want better compression than zstd (at the cost of CPU time, both during compression and during decompression), use PPMD for natural language text or source code and use LZMA (xz) for other things.
This is more of a replacement for deflate / gzip. You can use zstandard inside a zip file using compression type 20, as of zip v6.3.7 (2020), but I suspect not much software will be able to decompress it.
Seen a number of gzip replacements go by, like .bz2 (smaller but requires lots of CPU), .lzma and .xz (strictly an improvement over .bz2), and some others. Zstandard is a fairly solid improvement over these for similar use cases so I think it will get adopted. There's also the "extreme" compression algorithms like .lz4 (extremely fast, don't care about compression ratio) or PPMD (extremely slow, care deeply about compression ratio). My sense is that a large number of projects migrated to .bz2 and then to .xz over the years, and maybe we'll see more .zstd cropping up.
> [going from xz/lzma to zstd] yields a total ~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup.
I think personally, the biggest factor for using zstd is that compression settings don't matter for the decompression speed.
Doesn't matter if I used zstd-fast or zstd-19 the read speed is the same. Hence a lot of linux distros adopt it for packaging; they spent a few days of CPU time compressing their packages on zstd-19 (archlinux does IIRC), the user couldn't tell that it's been hard compressed like that.
Your questions are answered in the first sentence of the article.
> Zstd or Zstandard (RFC 8478, first released in 2015) is a popular modern compression algorithm. It’s smaller (better compression ratio) and faster than the ubiquitous Zlib/Deflate (RFC 1950 and RFC 1951, first released in 1995).
Is there some smaller version of this, suitable for embedding into programs?
$ size /usr/lib/i386-linux-gnu/libzstd.so.1.3.3
text data bss dec hex filename
508674 492 20 509186 7c502 /usr/lib/i386-linux-gnu/libzstd.so.1.3.3
Yikes; half a meg of code!
Say I just want something in my C program for saving certain files using less space, and only care about decompressing only those files. Is there some small implementation of a subset of Zstandard: say no more than four or five source files, and 64K of machine code?
If I add 500K to the program, I'm going to have to save 500K by compressing the accompanying files, just to begin breaking even.
For comparison, libz may not have the decompression speed or the ratio, but it's more than four times smaller:
$ size /lib/i386-linux-gnu/libz.so.1.2.11
text data bss dec hex filename
115537 588 4 116129 1c5a1 /lib/i386-linux-gnu/libz.so.1.2.11
This is more like it!
$ size /lib/i386-linux-gnu/libbz2.so.1.0.4
text data bss dec hex filename
60511 3520 4 64035 fa23 /lib/i386-linux-gnu/libbz2.so.1.0.4
Text 60511. Is there some configuration of Zstandard which can cram into comparable code space?
It's plausible that the lib you checked is the output from the project's default build target (zstd), which "(...) includes dictionary builder, benchmark, and supports decompression of legacy zstd formats"
The project also provides another build target, zstd-small, which is "CLI optimized for minimal size; no dictionary builder, no benchmark, and no support for legacy zstd formats"
Also, take a look at what exactly is bundled with the binary. Odds are you're looking at a lib that statically links optional stuff that makes no sense shipping, let alone on an embedded target.
I looked at the "nm --dynamic"; I did see some dictionary API's:
$ nm -D /usr/lib/i386-linux-gnu/libzstd.so.1.3.3 | grep -i -E 'init|create'
[...]
0000fa50 T ZSTD_createCDict
0000d3a0 T ZSTD_createCDict_advanced
0000fae0 T ZSTD_createCDict_byReference
0000d6f0 T ZSTD_createCStream
0000d6c0 T ZSTD_createCStream_advanced
00042890 T ZSTD_createDCtx
000427b0 T ZSTD_createDCtx_advanced
00046a10 T ZSTD_createDDict
00046960 T ZSTD_createDDict_advanced
00046a40 T ZSTD_createDDict_byReference
[...]
Longer term, we want to offer a stripped version of the library that includes the compression code, but only includes some of our compression levels. That way you can save code size for unused compression levels.
We've optimized pretty heavily in favor of speed over code size. But we want to offer better configurability, we just need to find the time to do it. We'd happily take PRs that go in this direction!
I'm not sure if it's significant, but the default zstd build contains legacy format support from 0.5 to 0.7 (which used different magic numbers). Setting `-DZSTD_LEGACY_SUPPORT=0` will completely disable legacy format support and might help you, especially given that you know what you deal is always not one of those legacy formats.
As an aside, I'm fairly disappointed that zstd didn't use a single byte header for uncompressible strings, so that they could guarantee that the compressed data will never be more than 1 byte larger than the source data. That property is very useful where lots of tiny strings are being compressed, such as in a database.
The first 4 bytes are the magic number and the last 4 bytes are the checksum [1] which you could always just chop off if you wanted (it's legal to omit the checksum, see the spec). That would get the total overhead down to 5 bytes.
Zstd has a 4-byte magic number, which is used to check if the data is zstd encoded. In addition to that, this example has a 2 byte frame header (including the decompressed size), a 3 byte block header, and a 4 byte checksum at the end (which can be disabled with `--no-check`).
Zstd does have a mode that passes-through incompressible data, both for incompressible literals, and completely incompressible blocks (128 KB chunks).
For small inputs, we recommend using dictionary compression. But even with dictionary compression, because of our header costs, you won't generally see benefits until your data is at least ~50 bytes. But YMMV depending on your data.
LZ followed by entropy coding seems to be the general strategy of choice for many compression algorithms, so this design of Zstandard was quite familiar to me. The FSE reminded me of the Q/MQ/QM-coder algorithm which is used in some image formats (and is surprisingly simple yet powerful, although relatively slow.)
Nonetheless, I like to learn a file format by studying a worked example at the bits and bytes level.
I did the same when I was playing around with various image and video codecs (JPEG, GIF, MPEG-1/2, etc.), and I agree that it definitely helps with understanding as well as debugging; you can get to the point of being able to decode, slowly, by just staring at a hexdump and "reading" the bits and bytes as if it was a new language.
My biggest gripe with zstd is that it is difficult to find a Windows program to operate on them. I had to manually download zstd and a bunch of dependencies manually from msys2 in order to decompress.
What causes trouble is simply, "Once you've emitted a symbol, how do you know that the next symbol is in the range [BL, BL+2^NB)?" The answer is that the encoding is performed backwards, starting from the end of the input, and the table is constructed so that for any symbol emitted, you can find a previous symbol so that the current symbol is in the previous symbol's [BL, BL+2^NB) range.
There's another writeup about FSE here, which goes into more depth about FSE (rather than talking about Zstandard in general):
http://fastcompression.blogspot.com/2013/12/finite-state-ent...
My intuition tells me that there's probably a homomorphism from Huffman coding to FSE that preserves the encoded size exactly, but I haven't done the math to check.