Hacker News new | past | comments | ask | show | jobs | submit login

Out of curiosity, I modified the benchmark program to simply toss the pixel data into lz4/lz4hc for comparison (patch file: https://paste.debian.net/1220663/).

I'm actually quite impressed how the resulting size is a little bit smaller than lz4hc (actually not even that far away from libpng), while the encoding speed is quite close to lz4, despite seemingly not having as many hacks and tweaks under the hood to get there. So the encoder is IMO doing amazingly well actually.

However, the decoding speed of liblz4 is off by roughly an order of magnitude. But again, liblz4 probably benefits from decades of tweaking & tuning to get there. It would be interesting how much performance could be gotten out of the qoi decoder.

Here are the totals I get for the "textures" benchmark samples:

          decode ms   encode ms   decode mpps   encode mpps   size kb
  libpng:       2.2        32.4         58.67          3.94       160
  stbi:         2.1        17.0         61.50          7.49       228
  qoi:          0.7         0.7        191.50        170.54       181
  lz4:          0.1         0.6       1226.06        206.40       258
  lz4hc:        0.1        70.9       1029.26          1.80       200
And for the "wallpaper" samples. The spreads seem to be a bit more in favor of qoi here:

          decode ms   encode ms   decode mpps   encode mpps   size kb
  libpng:     131.9      2287.1         66.63          3.84      8648
  stbi:       147.5      1177.1         59.55          7.46     12468
  qoi:         56.3        56.5        156.13        155.60      9981
  lz4:         14.3        53.1        614.53        165.50     18019
  lz4hc:       13.9      1901.7        630.94          4.62     12699



I believe (from quick code inspection) that the symmetry in encode/decode performance for QOI is because it has to generate the hash-table on decode too.

Normal fast encoders use some kind of hash table or search to find matches, but encode the offset of the match in the source rather than in the table. QOI is encoding the offset into the table, which gives much shorter offsets but means the decoder has to maintain the table too.

(The slow PPM compressors etc do maintain the same table in both encoder and decoder, and have symmetrical encode/decode performance too. See http://www.mattmahoney.net/dc/text.html)

Its not really in the spirit of of QOI to add the complexity, but I'd imagine allowing the encoder to specify how big the hash table was would be only a small tweak, and encoding literal blocks instead of pixel-by-pixel will improve handling input that QOI can't compress better.

I would be curious to know if planar helps or hinders. Perhaps even see what QOI makes of YUV etc. And I want to see 'heat maps' of encoded images, showing how cheap and expensive parts of them are, and which block type gets used. Yeah, going a bit beyond the spirit of QOI :D

And from looking at the code I'm a bit confused how 3-channel is handled for literals because the alpha still looks like it gets encoded. Would have to walk through the code to understand that a bit better. Is it going to deref off the end of the source RGB image? Etc.

(People interested in compression may be interested in the go-to forum for talking about it https://encode.su/threads/3753-QOI-(Quite-OK-Image-format)-l...)


Like the author said this is completely unoptimized. The natural next step in optimization might be to profile and then SIMD optimize the slow bits in compression and decompression. This would likely produce a significant speedup and may even bridge the gap with lz4.


The algorithm is extremely resistant to SIMD optimizations.

Every pixel uses a different encoding, 95% of the encodings rely on the value of the previous pixel, or the accumulated state of all previously processed pixels. The number of bytes per pixel and pixels per byte swing wildly.

SIMD optimization would basically require redesigning it from scratch.


SIMD only gets you up to the width that your hardware platform supports and every SIMD program has to be rewritten for the new width.

Two other immediate avenues are multithreading, which think could be quite effective for this algorithm or GLSL, of that I have no opinion.


I believe libpng default zlib compression level is 6. libpng may use CRC (or adler) checksums, which account for around 30% of the decode time. In my experience with zlib, the higher compression levels have faster decode times. My reasoning at the time was (early '00s) that higher compression tends to code longer runs of copying of the dictionary. More time in the copy loop rather than the input and decode loop.


> liblz4 probably benefits from decades of tweaking & tuning to get there

Decade, singular. LZ4 was proper around 2011. But the real reason it's faster is because its not doing nearly as much work as QOI. LZ4 is just about as asymmetric as you can get.


What about qoi followed by lz4?


Any possibility you could repost the patch? Was it pulled because of a bug?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: