Hacker News new | past | comments | ask | show | jobs | submit login
Quite OK Audio Format (phoboslab.org)
89 points by picture on April 29, 2023 | hide | past | favorite | 43 comments



Interesting find: one of the implementations is written in “Ć”, which transpiles cleanly into multiple languages: https://github.com/pfusik/cito

Haxe comes to mind but this seems a lot simpler.


Lots of discussion from a couple years ago on the Ć Programming Language (although most of it is about the name):

https://news.ycombinator.com/item?id=28805825


Related:

QOA, the Quite OK Audio Format - https://news.ycombinator.com/item?id=34625573 - Feb 2023 (78 comments)


Feels like it gets posted every other week.


This is an interesting case because it looks like this is the entire list of submissions on the topic:

Quite Ok Audio Format Benchmark Results and File Format Specification - https://news.ycombinator.com/item?id=35729036 - April 2023 (0 comments)

QOA Benchmark Results and File Format Specification - https://news.ycombinator.com/item?id=35721419 - April 2023 (0 comments)

The Quite OK Audio Format [pdf] - https://news.ycombinator.com/item?id=35203976 - March 2023 (1 comment)

QOA, the Quite OK Audio Format - https://news.ycombinator.com/item?id=34625573 - Feb 2023 (78 comments)

A sequence like that, if you happen to see all of the posts, can make a topic feel over-represented, but I'd say this isn't really excessive, wouldn't you? Keep in mind that on HN, reposts are fine if a topic hasn't had significant attention in the last year or so (https://news.ycombinator.com/newsfaq.html).

In this case there was a major thread back in Feb but I decided not to treat the current post as a dupe because the two articles seemed to differ significantly.


Maybe it is being confused with the series of posts on the quite ok image format? We had a number of posts on that one.

(I don’t mind. The comments are always interesting.)


  static inline int qoa_clamp_s16(int v) {
      if ((unsigned int)(v + 32768) > 65535) {
      ...
  }
This code is wrong. It assumes int is at least 32-bit which is not true in C. It should either use long or int32_t.


That might be true in theory, but not relevant in practice unless you're on a really old or really esoteric CPU (one could extend that to: the C standard is important in theory, but usually not relevant in practice - the only thing that matters in the real world is how specific C compilers behave).

In the real world, for this type of code (audio code, most likely for games), only a very small number of CPUs (Intel and ARM) and C compilers (Clang, GCC and MSVC) matter, and on those, int is 32 bits wide.

PS: For new code I would also suggest using the "new" C99 fixed-width integer types though.


Dont a lot of 8 bit micros still use 16 bit ints?


Might be (Arduino maybe?). AFAIK there's still quite a bit of C code written in the homecomputer retro scene, using specialized C compilers like sdcc or cc65, but IME this requires so much manual tweaking and platform specific hints that it is very far away from 'standard C', and most common open source libraries wouldn't even work even if the code was fully standard compliant C, just because of memory constraints.


But what about the game boy!


I think you’re okay there, the Game Boy doesn’t have any sound output channel this could work with anyway :)



I was aware of this, but I don't think a 32 sample 4-bit wave RAM is quite good enough to make something representative :)


It's possible to play ~8KHz+ 4bit PCM. Just need to update the wave RAM four times per frame using an hblank interrupt handler. Or write completely deterministic code that unpacks the audio and writes it out at the correct time intervals.


It's also odd that everything is big-endian, but then the official format is C structures (which will be LE on every PC and phone that would ever run this)

And in that C struct.... is `char` and also `uint8_t`. I can never remember if `char` is signed or not. It should be `int8_t` or `uint8_t`.

This reminds me of the Farbfeld image format. So simple that I'm content to just make my own little-endian ad-hoc implementation and ignore the official spec.


> I can never remember if `char` is signed or not.

char is effectively a trivalent type. Only use it when signedness is inconsequential. This is addressed in C23 with char8_t expressly to avoid the problems that happen with UTF-8 bytes as accidental signed chars.


> I can never remember if `char` is signed or not.

Technically it's neither (e.g 'char', 'signed char' and 'unsigned char' are three different types), but in practice this also usually doesn't matter (but it's more likely to be bitten by this specific detail than int not being 32 bits).


> which will be LE on every PC and phone that would ever run this

Why would you want to run this on a modern high end PC or phone? There's almost no point using a simple codec like this if you have the computing resources or dedicates hardware to use a "proper" codec.

The codec is probably a reasonable choice for a microcontroller where you do want to save memory but you barely have CPU cycles to process the audio data. Many microcontrollers are big endian.


For cross-platform 'gamey code' (e.g. game jam stuff or smaller indie games), having an audio file format which is very simple to load and decode in your own code without having to use the plethora of platform-specific audio decoder APIs, or having to integrate decoder libraries with often questionable API design choices (which is often a challenge of its own) has a lot of advantages.

Same arguments for using the Quite OK Image Format over things like PNG or JPEG for instance (although since stb_image.h this isn't such a big issue anymore).


Your post: "Why would you want to run this on a modern high end PC or phone?"

Another post in the same thread: "In the real world, for this type of code ..., only a very small number of CPUs (Intel and ARM) and C compilers (Clang, GCC and MSVC) matter ..."

These posts are strongly disagreeing with each other but neither of you directly replied to the other, so I'm not sure if you've seen it.

I'm replying to you because I think, between the two of you, you're the one that's wrong and the other poster is right. The original announcement doesn't mention microcontrollers at all and does mention computer games[1]. Further, their defense for the endianness decision was simply that they like it, rather than that they expect any target CPU to be big endian. I think GP is right--this will almost never be run on a big endian CPU.

Indeed, the start of this thread was a piece of code that won't work at all on smaller microcontrollers. I think it's likely this code has never been run on a microcontroller.

[1] https://phoboslab.org/log/2023/02/qoa-time-domain-audio-comp...


Sure there is lore diversity in microcontrollers and there are bog endians but much more are or can ne configured as little endian ARM, riscv, avr, mips, msp430 ..


Yeah. Bitten me not-too-long-ago...


On what CPU and compiler did this happen if I may ask? (I guess it was either in the area of vintage computing, or embedded stuff?)


Embedded stuff. For example MSP430 has 16-bit ints.


See also, the Quite OK Image format: https://qoiformat.org/qoi-specification.pdf (discussed previously at https://news.ycombinator.com/item?id=34035024).


A lossy codec that has only a single bitrate/quality? How does it compare to other lossy codecs in terms of audio quality?

All values, including the slices, are big endian.

In a world where LE basically won, that's an astonishingly stupid decision --- especially if you care about tiny differences in performance.


I'm adapting QOA to the Dreamcast for streaming music. I've modified it to use 32-bit little endian values instead of 64-bit since (besides avoiding having to byteswap) the CPU in the Dreamcast is badly suited for 64-bit shifts.

For converting a "slice" from the 64-bit format (which is has a 4-bit value in the highest bits, and 20 3-bit values organized from high to low. Each 3-bit value is read from the high to low), I treated it as two separate 32-bit values. The bottom two bits of are combined to get the 4-bit value, and the rest contains two pairs of 10 3-bit values organized low to high.

So this:

  sample = (slice >> 57) & 0x7;  //Get sample
  slice <<= 3;  //Advance to next sample (64-bit shift)
Became this:

  sample = slice & 0x7;  //Get sample
  slice >>= 3;  //Advance to next sample (32-bit shift)
I think the conversion to 32-bit little endian was something like an 8-10% speed up, with no other changes? (This was a few months ago, and I didn't write down what the exact difference was.) The overhead big endian conversion and 64-bit shifts would become even more noticeable after further optimizations. (Replacing qoa_lms_predict with inline asm to use an integer multiply-accumulate instruction that GCC can't generate got something like another 20-25% speedup. Properly written asm would be even faster.)


BE is still common in very resource constrained applications like microcontrollers. You know, applications where having to do endianess swap actually has a significant impact? Applications where it might actually make sense to use this kind of codec?

I was really thankful that Doom WAD file was big endian (side effect of being developed on NeXT?) when I ported doom to an nRF microcontroller.


Isn't endianness either little or selectable on most (if not all) microcontrollers these days? For example https://infocenter.nordicsemi.com/pdf/nRF52832_PS_v1.3.pdf is little endian.

Considering that little endian is objectively better for more types of data structures than big endian is [1], one would expect things to eventually converge on little endian (barring the entrenched big endian network protocols that hold everything back). Building for the eventuality of little endian would make more sense in a modern format.

[1] https://www.technicalsourcery.net/posts/on-endianness/


If you're exporting assets for a game, and the swap would be significant, shouldn't it really be generating big or little endian depending on the target? Many audio engines do this, using e.g. RIFX on big endian targets and RIFF on little.


I believe that in the case with Doom, and many other games that released there code as open source; the game assets are not in public domain and remain closed.


I would argue Big Endian won in network protocols and file formats.


For network protocols, that's true only at the lower layers, and largely due to historical reasons.

The majority of file formats I've worked with are LE.


Looks like this fixes the issue raised in the last discussion [1] by no longer re-quantizing the decoder state at the start of every frame. I still think it's not worth the extra complexity over a simpler fixed set of second order linear predictors like BRR or ADX, though that clocks in at over 4 bits per sample. Might still be ok with 3 and more predictors?

Regarding clamping, might it be better to require the encoder to never generate QOA that would overflow? I've seen at least one ADPCM implementation (GameCube "AFC") that takes this approach. Seems like it wouldn't produce more distortion than clipping would.

[1] https://news.ycombinator.com/item?id=34630543


Semi-relatedly, is there anyone working on hardware encoders or decoders for audio codecs that are plausibly patent-free (e.g. mp3, vorbis, flac, opus)?

AIUI decoding for some of these formats can be a battery drain, particularly for devices with tiny batteries like ear buds, watches, ipod nanos.


Why is this approach better than adding an extension with the same compression scheme to AIFF?


I guess because the implementations aren't intending to support a lot of different input codecs, so there's no benefits to the flexibility of AIFF/AIFC. Also avoids having to deal with an 80-bit extended precision float just for the sample rate.


Hijacking this comment in case you (or someone from HN) knows.

Is there any particular reason that AIFF uses an 80-bit extended float precision for the sample rate, instead of a 32-bit integer like WAVE?

I assume this was an oversight at the time, but I'm curious if there are actual applications where sub-Hz (or 2^32Hz) precision is needed.


How well does it handle less bit depth? (like 8 bit)

What about lower sample frequency? (such as 28kHz)

If it could scale down, it'd be extremely useful for new productions for "retro" platforms.


If I had to guess the codec isn't meant to handle this well. To me this looks like a codec that wins because it is optimized for one specific use case (when you need "quite ok" audio on a modern day platform.

It could be that it handles slower sampling rates well, but I am not sure about the bits.


From a glance at the 1-page PDF, it seems freq is variable, but depth is hardcoded to 16bit.


I came here to say the extension should be .QOK and pronounced “kwok” and that’s that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: