Let's Build an MP3 Decoder

psykotic · on Feb 6, 2012

In the same genre and likewise in Haskell, I recommend Exploring JPEG:

http://www.imperialviolet.org/binary/jpeg/

On the decoder side, for someone who has basic familiarity with compression and signal processing, the obstacles in understanding MP3 are polyphase filter banks and the mixed discrete cosine transform, which are not much explained in this article. They both take advantage of alias cancelation properties (in the frequency domain for the filter bank and in the temporal domain for the MDCT) that in my opinion are pretty subtle. The rest of the components in an MP3 decoder have close analogues in image compression (e.g. joint stereo is a decorrelation transform that is analogous to going from RGB to a luminance-chrominance color space), so they shouldn't be difficult to grasp.

For those trickier signal processing parts, Pan's IEEE article A Tutorial on MPEG/Audio Compression is a great place to start:

http://www.ee.columbia.edu/~dpwe/e6820/papers/Pan95-mpega.pd...

noonespecial · on Feb 6, 2012

Wow, that takes me back. One of the first "serious" programs I wrote was an mp3 decoder in pascal. It took about 30 minutes per minute of audio to do the decode on my p90. I felt like king of the whole world.

jensnockert · on Feb 6, 2012

Yeah, it is a good thing is that you can do it in essentially any language and it will be fast enough for realtime today.

At least as long as you design it reasonably well.

sliverstorm · on Feb 6, 2012

Bubble sort!

jensnockert · on Feb 6, 2012

Javascript is the new bubble sort?

darxius · on Feb 6, 2012

Great read! (although I admit I'm not completely done it)

I think the whole process of taking something audible and "real" (like sound) and digitizing it is simply amazing.

Qweef · on Feb 6, 2012

I don't think the "A Haskell tutorial" part of the submission title is accurate. It's an mp3 tutorial... which happens to use Haskell. Not everything which uses Haskell is specifically about doing so. :-)

petercooper · on Feb 6, 2012

Was that actually in the submission title here at HN? I didn't put that into it and it's not there now, so if it was.. we know an admin was fidgeting with it! :-)

DarkMeld · on Feb 6, 2012

What does it mean when a codec is "lossless"?

Caerus · on Feb 6, 2012

This is in no way rigorous, but hopefully the analogy helps.

Imagine you want to encode something like y = 2 * sin(2 * x) + .1 * sin(10 * x) to save space.

http://www.wolframalpha.com/input/?i=sin%28x%29+%2B+.1*sin%2...

That is composed of two different sine waves added together - one low frequency, large amplitude and one high frequency low amplitude. That can be encoded two different ways - lossy or lossless.

A lossy encoder gets rid of unimportant data and keeps a "close enough" representation of the original. In this example, .1 * sin(10 * t) is a minor component of the overall signal (plot them separately if you need to compare the difference) so the encoder chooses to delete it and only save 2 * sin(2 * t). In a real world sound, this would be like throwing away noise that is so high pitched we can't hear it, or is so quiet we can't detect it. A real encoder has to decide what the "small enough it can be deleted" threshold is, and it's rarely black and white.

A lossless encoder looks at that signal and thinks "there has got to be a more efficient way to store that data". They are both sine functions, so that doesn't need to be repeated twice. "x" is completely unnecessary, because it knows the sine functions are dependent on some variable. So, it could write out something like "sin (2,2) (.1,10)". All of the original data is still there, if whoever receives the data (the decoder) knows how to interpret it.

bwarp · on Feb 6, 2012

Note: A fourier transform is never lossless - it's an approximation (usually a fairly good one though).

JoshTriplett · on Feb 6, 2012

Not necessarily true. If you have a particular bitstream that approximates a waveform, and you use a fourier transform to generate an approximation with sufficient accuracy that it reproduces the same bitstream, you have something lossless.

Also, even if you don't go that far, many lossless algorithms operate by first providing an approximation and then including the (usually small) deltas between the approximation and the actual bitstream.

Bou · on Feb 6, 2012

The Fourier transform itself is an exact mathematical transformation with an exact inverse transform. There's nothing lossy about it.

bwarp · on Feb 6, 2012

Mathematically, you are correct. Practically you are not if you consider the transform source.

psykotic · on Feb 6, 2012

What do you mean? If the original source is analog and sampled below its Nyquist rate in the analog-to-digital conversion, the process is indeed irreversible. But that all happens before any transforms from the time domain to the frequency domain are in play, so it's a separate issue.

Beyond that, discrete Fourier and cosine transforms as usually implemented are not fully reversible because of loss in precision. A colleague of mine blogged about the issue in the context of Haar transforms a few years ago: http://cbloomrants.blogspot.com/2008/09/09-08-08-1.html. By decomposing an orthogonal transform into shears as explained by Charles, you can design reversible fixed-precision variants of the DCT like binDCT: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.8...

bwarp · on Feb 7, 2012

Sorry it is a separate issue - I should have been more clear.

The original point the OP made was a bad example. Unfortunately I made a poor attempt at explaining that.

ugh · on Feb 6, 2012

That's the principle, but not how it really works. As you said, it's only an analogy, but one that comes close to sounding like the real thing and is as such potentially confusing.

Audio codecs don't try to find functions that describe waves, they sample the waves.

What you get from a microphone or an electric guitar is a changing voltage, a wave of some sort. To digitize that wave (i.e. to save it using numbers and not vinyl discs or magnetic tape) you take samples (i.e. measure the voltage) at regular intervals. You typically might do that 48k times or 96k times per second (i.e. have a sampling frequency of 48kHz or 96kHz).

What you do then with those samples is quantize them, i.e. they get an integer depinding on how high or low the voltage is, for example a 16 Bit number.

That's how an analog-digital converter works on a basic level (you also need an filter to get rid of all the frequencies you don't want to digitize). This process is of course lossy. It's not possible to recontruct the wave that was digitized perfectly, it's only an approximation. All analog-digital converters are lossy (in some sense of the word). If you sample at 48kHz all frequencies higher than half of that (24kHz) are irrevocably lost (look up the Nyquist-Shannon Sampling Theorem if you want to know more). Quantization also means that you have to round the voltage you are measuring up or down, the audible effect this has on the sound is that it adds noise.

The parameters are, however, usually chosen in such a way that a human ear can't tell the difference. A healthy and young ear can hear frequencies up to about 20kHz, that's why the music on CDs is sampled at 44.1kHz (40kHz with some safety margin).

But that's only the conversion and it has not much to do with lossy codecs.

A lossless codec takes the result of this digitization (let's say 48k 16Bit numbers for a one second sound) and encodes them in such a way that those exact same 48k 16Bit numbers can be reconstructed after decoding.

Re-encoding a sound file with a lossless codec can consequently still be lossy: If you change the parameters (e.g. you half the sampling rate) you are still losing imformation - but it's at least possible to preserve the result of the original digitization perfectly.

A lossy codec doesn't aim for perfect reconstruction of all the numbers. It uses quirks in human hearing to be more or less precise about how those numbers are reconstructed. For example: A really loud sound masks quieter sounds. It's not possible for humans to hear that quieter sound, so why bother with all that precision for that quiet sound? That's where, for example, MP3 can dynamically decrease the quality. It's quite ingenious, really.

It is of course inevitable that during that process information is lost. It's not possible to get those original 48k numbers back, but at least something that sounds to an human ear more or less like them.

route66 · on Feb 6, 2012

> that's why the music on CDs is sampled at 44.1kHz (40kHz with some safety margin)

A nice explanation of the 44.1kHz is here: http://www.cs.columbia.edu/~hgs/audio/44.1.html

Initially digital video tracks were used and as such they were constrained by the video formats commonly recorded on them and the playback devices coming with that.

daeken · on Feb 6, 2012

It means that there's no loss of fidelity. If you convert raw audio (a WAV for instance) to MP3, vorbis, or other lossy formats and then convert it back, you'll end up with different (lower fidelity) data. Doing the same with FLAC or another lossless format gives you the exact same data back.

In the image space, PNGs are likewise lossless, whereas JPEG causes a loss of fidelity.

ZeroGravitas · on Feb 6, 2012

Another word for lossless is "reversible", i.e. you can get back out what you put in, which isn't the case with mp3, but is with zip or flac. If you don't worry about it being bit exact you can make it a fair bit smaller though, particularly if any of the original data is difficult/impossible to hear.

ewoodrich · on Feb 6, 2012

Think of more like RAR or ZIP compression: filesize it reduced, but no data is lost.

Unlike MP3 which uses a number of lossy compression techniques, including psychoacoustics, that are specifically designed to reduce the bitrate of an audio stream with minimal artifacts.

nodata · on Feb 6, 2012

Lossless: zip/unzip.

Lossy: jpeg.

getsat · on Feb 6, 2012

Or:

Lossless: PNG

Lossy: JPEG

rplnt · on Feb 6, 2012

Using this codec would probably be illegal in some countries. I guess it is all right to play with it, but don't incorporate it in your applications.