Out of curiosity, I modified the benchmark program to simply toss the pixel data...

willvarfar · on Nov 24, 2021

I believe (from quick code inspection) that the symmetry in encode/decode performance for QOI is because it has to generate the hash-table on decode too.

Normal fast encoders use some kind of hash table or search to find matches, but encode the offset of the match in the source rather than in the table. QOI is encoding the offset into the table, which gives much shorter offsets but means the decoder has to maintain the table too.

(The slow PPM compressors etc do maintain the same table in both encoder and decoder, and have symmetrical encode/decode performance too. See http://www.mattmahoney.net/dc/text.html)

Its not really in the spirit of of QOI to add the complexity, but I'd imagine allowing the encoder to specify how big the hash table was would be only a small tweak, and encoding literal blocks instead of pixel-by-pixel will improve handling input that QOI can't compress better.

I would be curious to know if planar helps or hinders. Perhaps even see what QOI makes of YUV etc. And I want to see 'heat maps' of encoded images, showing how cheap and expensive parts of them are, and which block type gets used. Yeah, going a bit beyond the spirit of QOI :D

And from looking at the code I'm a bit confused how 3-channel is handled for literals because the alpha still looks like it gets encoded. Would have to walk through the code to understand that a bit better. Is it going to deref off the end of the source RGB image? Etc.

(People interested in compression may be interested in the go-to forum for talking about it https://encode.su/threads/3753-QOI-(Quite-OK-Image-format)-l...)

zenogais · on Nov 24, 2021

Like the author said this is completely unoptimized. The natural next step in optimization might be to profile and then SIMD optimize the slow bits in compression and decompression. This would likely produce a significant speedup and may even bridge the gap with lz4.

phire · on Nov 25, 2021

The algorithm is extremely resistant to SIMD optimizations.

Every pixel uses a different encoding, 95% of the encodings rely on the value of the previous pixel, or the accumulated state of all previously processed pixels. The number of bytes per pixel and pixels per byte swing wildly.

SIMD optimization would basically require redesigning it from scratch.

sitkack · on Nov 25, 2021

SIMD only gets you up to the width that your hardware platform supports and every SIMD program has to be rewritten for the new width.

Two other immediate avenues are multithreading, which think could be quite effective for this algorithm or GLSL, of that I have no opinion.

injinj · on Nov 24, 2021

I believe libpng default zlib compression level is 6. libpng may use CRC (or adler) checksums, which account for around 30% of the decode time. In my experience with zlib, the higher compression levels have faster decode times. My reasoning at the time was (early '00s) that higher compression tends to code longer runs of copying of the dictionary. More time in the copy loop rather than the input and decode loop.

Trixter · on Nov 24, 2021

> liblz4 probably benefits from decades of tweaking & tuning to get there

Decade, singular. LZ4 was proper around 2011. But the real reason it's faster is because its not doing nearly as much work as QOI. LZ4 is just about as asymmetric as you can get.

eisbaw · on Nov 24, 2021

What about qoi followed by lz4?

sitkack · on Nov 26, 2021

Any possibility you could repost the patch? Was it pulled because of a bug?