Totally awesome presentation! I've never seen anyone describe this stuff this cleanly (even though I studied it in college).
Plus, I love the audio work on that video. The positioning of his voice to tell you what he is doing while you don't see him is pure genius. Also, note how you can hear he is facing away from the microphone even if you can't see him. Great stuff. I wish more videos would use effects like these.
> The positioning of his voice to tell you what he is doing while you don't see him is pure genius.
I'm told that this is pure accident based on using a stereo mic and having the input-side equipment be on the left and the output on the right, but I thought it was a nice touch too. :)
It was not exactly an accident, though the image was a bit 'wider' than I'd have preferred. Tradeoffs.
I don't like wearing lavalier mics for this sort of thing (an extra layer of complexity, and they have to be headsets to avoid vocal amplitude changing) and the stereo image reduces the apparent echoeyness of the space. The main goal was to sound better than last vid :-)
Could you perhaps make a quick write-up about the making of the video?
I was more impressed with the fact that it was produced using all F/OSS software than anything. Other than the type on the title cards looking goofy, the production value is higher than anything I've ever seen come out of a non-fruity unix box.
FCP is one of the very last reasons I still have a mac...
Only a few things have changed since then. I've done some more work within Cinelerra (I wrote a new resampler and a new color grading filter for it this time around). I'm also using a pair of matched suspension microphones (AT853a Unipoints) instead of the Crown boundary piezeos, which I've decided just aren't suited to anything I do.
Cinelerra is not for the faint of heart, and I don't casually recommend it. It is very powerful, rather unfriendly, and picky about the formats it uses. However, it will do things none of the other FOSS editors can.
This is the clearest presentation I have ever seen about digital audio in general, even if it's not trying to be that general.
It really is peculiar how many otherwise technically minded people with experience in DSP and/or audio somehow think there is something "magical" outside of what can easily be proved, and this video demostrates. I don't doubt that there will still be those who will contest the results, regardless.
I did especially enjoy the use of analog test equipment, in an effort to put to rest any conspiracy that the talk was flawed by being digital. This must have taken lot of time to put together, but it's certainly worth it. I'm going to drop this URL into a lot of debates in future.
An entire signal processing tutorial in 25 minutes.
This guy is a master of explaining AND of using tech
so the combination is very powerful. You will probably
learn more useful stuff from this video than from taking an entire signals and systems ugrad course.
As a layperson there were a couple of things I didn't quite get.
I don't 100% get the band limited signal bit. How does band limiting imply that there's only a single possible reconstruction of the digital signal? I can kind of picture the fourier transform meaning that there's only one representation but it's a bit of a leap for me. Can the converter itself not create the signal incorrectly?
Isn't there also an argument that frequencies above 19khz can be heard by some people so need to be accurately represented?
I also don't understand how you can necessarily say that because it's ok for a sine wave, it's ok for all waves. with the square wave, the output is a sort of digital approximation right? Presumably that's a perceptible difference from the original analogue version? This video was focusing on differences in digital processing so maybe that's just not relevant to the discussion.
Would be interesting to see a complex analogue signal being fed through the digital pipeline and then inverted against the original to see the differences. Again, maybe that's a known difference and not really part of the argument here.
Just holes in my understanding really - credit to this video though, it's forced me to actually have to think about this :)
> I don't 100% get the band limited signal bit. How does band limiting imply that there's only a single possible reconstruction of the digital signal? I can kind of picture the fourier transform meaning that there's only one representation but it's a bit of a leap for me. Can the converter itself not create the signal incorrectly?
Band limiting means that there's a maximum frequency. You can think of it as a limit on how fast the signal can change. There's only one possible way to draw a curve that goes through all the points without changing too quickly.
It's something like graphs of polynomials. There's only one possible line (first-degree polynomial) going through any two points. Three points completely determine a second-degree polynomial (parabola), etc. There are an infinite number of parabolas you can draw through two points, just as there are an infinite number of squiggles you could draw through the digital sample points. But you can't create a different solution without adding higher-order terms to the polynomial/higher frequencies to the signal. The number of points you have is sufficient to completely specify the curve.
Though there's only one theoretical solution, the converter can still "create it incorrectly".
> Band limiting means that there's a maximum frequency. You can think of it as a limit on how fast the signal can change. There's only one possible way to draw a curve that goes through all the points without changing too quickly.
This was incredibly helpful for my understanding. This is the one corner of this story that was always fuzzy for me. In the video he seemed to be making this point when he drew silly stuff all over the screen but it didn't quite click. Thank you.
It's also the same deal as the "wagon wheel" effect in video. At 30fps, a wheel rotating at 1800rpm (30rps) will appear to be standing still, and one's at, say, 2400rpm and 600rpm will appear to be rotating at the same speed.
"I don't 100% get the band limited signal bit. How does band limiting imply that there's only a single possible reconstruction of the digital signal?"
The band limiting here means removing frequencies above the Nyquist limit. This allows for a unique solution. In the case these higher frequencies aren't removed, you get aliasing. One solution is the real one, and another solution folds back into the lower frequencies causing distortion.
You are probably already familiar with aliasing in the visual domain. This is the effect that causes tires to look like they are rotating backwards once they reach a fast enough forward spin.
"Isn't there also an argument that frequencies above 19khz can be heard by some people so need to be accurately represented?"
"Would be interesting to see a complex analogue signal being fed through the digital pipeline and then inverted against the original to see the differences."
The square wave is fairly complex and included in the video. While he doesn't invert the signal, he does show it fed through multiple times and then compensated for the delay.
Wow, I've never made the connection between "aliasing" wrt images and audio, and "no unique solution" before! That word choice suddenly makes a lot more sense, haha!
> Can the converter itself not create the signal incorrectly?
Certainly the converter can be bad or malfunctioning... but there is only one bandlimited signal that passes through the sampling points.
The ideal behavior is perfect and described by simple (if surprising) math and we can measure any converter against the ideal. It turns out that _very_ good converters— ones with variation from the ideal are at the level of the thermal noise in the electronics— are commonly and cheaply available.
There was another chapter of the video that got dropped due to length, the desire to get something shipped, and because the tone didn't quite fit with the rest which compared the performance of a couple of consumer grade DAC/ADCs and basically showed they were all very good. Hopefully it should make its way out as 'bonus material' sometime soon.
> I don't 100% get the band limited signal bit. How does band limiting imply that there's only a single possible reconstruction of the digital signal?
I agree that was a little unclear. I think what he's saying is that since humans can't hear above ~20 KHz, frequencies above that are lost. The 'wobbles' in the square wave are what happens when you take a square wave (which is a summation of infinite sine saves with frequencies an integer multiple of the fundamental) and drop the frequencies above the ~20 KHz cutoff.
In other words, it has nothing to do with any analog/digital conversion. It's just what happens if you ignore frequencies above 20 KHz, which we can't hear anyways. We can't hear any difference between a 'perfect' square wave and the bandwidth limited one.
I also don't completely understand about their being only one possible solution. It makes sense if you know that the signal is a single sine wave, but since in general the signal is a summation of an unknown number of sine waves, it's not clear. I guess this is something he didn't have time to get into more depth for the quick overview :)
> Isn't there also an argument that frequencies above 19khz
The rule of thumb I've heard is 20 KHz. I also wondered if this is an issue. It'd be interesting to know the distribution of how many people can perceive frequencies how much higher than that. I think though that the 20 KHz is already fairly far along the long tail, and most people's cutoffs are actually lower.
> I also don't completely understand about their being only one possible solution. It makes sense if you know that the signal is a single sine wave, but since in general the signal is a summation of an unknown number of sine waves, it's not clear. I guess this is something he didn't have time to get into more depth for the quick overview :)
I'll give this a stab. Assume you're connecting the dots (samples) with a pencil. You can see that the high frequency filter would not let the pencil go up or down several times between 2 adjacent dots because any movement at that rate would be too high a frequency right? Now extend that thinking: Any movement not along a specific path (the one possible solution) would produce frequencies above the limit of the filter. Even a teeny tiny little bend in that line would result in a tiny few decibels of sound at a frequency above the filter and hence are not allowed. Make sense? Experts: is that accurate?
"I can kind of picture the fourier transform meaning that there's only one representation but it's a bit of a leap for me.Can the converter itself not create the signal incorrectly?"
Well. the converter add always small amounts of noise into the signal that could accumulate to 1 less significant bit or more.
With some special signals, mostly those made by man, like radar or sonar you could add more error(as signal accumulates so much in selected frequencies with exact multiples that have a strong bias) but most natural audio signals don't have much trouble.
"Isn't there also an argument that frequencies above 19khz can be heard by some people so need to be accurately represented?"
No, most kids could hear above 19khz, most adults can't, no argument about that, it is something you could test with a speaker and an electronic workbench using your kids or family members.
Nyquist proved that if you have a SAMPLED signal, with the FT you could reconstruct EXACTLY the same sampled signal, provided that you don't add quantization errors and so, that are small, you could even reconstruct square signals(with signal in all frequencies).
In the case you see those differences you will hear those transformation errors if they were significant(they are not), or you will have to turn on the volume and will hear a "sssssshhhhhhhhhh" sound.
> I don't 100% get the band limited signal bit. [...]
If I can venture yet another explanation for this. Consider the projection
π:signals --> signals,
which corresponds to band limiting a signal to 20k[Hz].
Given any signal in the "time basis" x(t) we can transform
it to the "frequency basis" X(f) and in the frequency basis
the action of the filter is to leave all the frequencies
< 20k unchanged and to set to zero all frequencies f > 20k.
This is similar to the way the projection onto the x axis
leaves the x coordinates of vectors unchanged and sets the
y coordinate to zero.
> Can the converter itself not create the signal incorrectly?
>
Let x the signal and π(x) be the band limited version of x.
The video is talking about the general properties of projections,
namely:
π(π(x)) = π(x).
The time you project you change the signal (restricts to subspace f<20k in freq. domain),
but there is no further harm done by projecting multiple times.
Would be interesting to see a complex analogue signal being fed through the digital pipeline and then inverted against the original to see the differences.
It's not a real demo but there's a drawing of it at 9:45 or so. The "quantization noise" is the noise introduced by digitization. And of course the rest of that whole section is the same thing, albeit with a simple 8-bit signal.
It's not obvious, but that's actually a real plot of a real signal example. Have a look in the source code for the cairo animations, it's generated from data.
It was easier to use a real example than try to draw something by hand that was close to correct :-)
I misread the poster's question. Yes, this video does answer his/her question as stated. I thought s/he meant why signal shape and frequency variations at very high frequencies don't matter, which I answered. At 20khz you'd have an extremely hard time noticing any difference between a sine or square wave, or the different tone of 20.5khz vs 20.1 khz. Even though as the video shows, the wave retains full fidelity and detail at 20khz, he's zooming in to show that. Our ears can't "zoom", they hear it at the super high, tightly packed shape. The logical extension of this is why a square wave at 30khz sounds exactly the same as a sine wave: it's silence to us, thus the same.
> I don't 100% get the band limited signal bit. How does band limiting imply that there's only a single possible reconstruction of the digital signal? I can kind of picture the fourier transform meaning that there's only one representation but it's a bit of a leap for me. Can the converter itself not create the signal incorrectly?
Nyquist's theorem shows that you can reproduce a sine-wave, provided that you have strictly greater than two samples per cycle (and implicitly assume that you have a sine wave).
Any periodic signal (which, for audio, which is AC-coupled, is always) can be represented exactly by it's Fourier transform. The FT is simply a different representation of the same function, namely a sum of sine-waves. In general the FT of a signal has many terms (i.e., many spectral components or harmonics). However, in a band-limited system, only certain harmonics are allowed to pass. If your audio system claims to have, say, 20Hz to 20kHz bandwidth, then any Fourier component outside that range will be greatly attenuated (if the attenuation puts that component below the noise floor, we could say that the component has been completely eliminated). It is the act of attenuating those out-of-band components that causes the square wave to go squiggly.
This means that for any arbitrary signal which has been band-limited to half your sampling frequency, you can exactly reproduce the band-limited copy.
The sampling system usually (for consumer audio, always) has an anti-aliasing filter between the signal source and the sampler (not having an AA-filter is what allowed Tektronix sampling oscilloscopes to show multi-GHz signals in the 1960's, by effectively heterodyning the signal frequency down to something usable). That means that the digitizer only ever sees the band-limited signal. The bandwidth of the AA-filter is chosen so that the maximum frequency passed to the sampler is half (or less) than the sampling rate.
> Isn't there also an argument that frequencies above 19khz can be heard by some people so need to be accurately represented?
That argument has been put forward, yes. "Back in the day" Bell Labs performed actual experiments with human test subjects to produce "equal loudness" curves. The upshot is that the power of an audio source must increase as the fundamental tone rises (starting from about 4kHz, or so). Then you must consider the effect of loud sounds on the human ear. Hearing damage is cumulative and the rate of damage increases with sound power, so it's hard to put an exact upper bound on the upper power limit of human hearing. However, at some point in the 18kHz to 22kHz range, the equal loudness curve crosses the damage curve. At that point you have to turn up the power of the audio source so high that you damage your hearing listening to it: So there is a definite upper frequency limit to human hearing. It may be slightly higher for some people than others, but 22.05kHz (the Nyquist limit for CD-rate sampling) is almost certainly greater than many standard deviations above the mean.
It's difficult to admit for a group of people (engineers and scientists) who are always striving to make things better, but we have come to a point where audio reproduction is "good enough." For a modest amount of money, the sampling, storage and reproduction is for all intents and purposes "perfect." Your living room is never going to sound like a concert hall with a live orchestra, but it's not because you don't have enough bits or samples or bandwidth.
Wow. That video is fantastic. The conceptual leap between DSP theory and audio behaviour is not a trivial thing to understand. This is the best educational demonstration of these concepts I've ever seen. I hope high school and university educators embrace this shining example of online learning.
Is there a Xiph Foundation or something I can sign up for when new stuff is published? Material of this quality is a great way to raise awareness of Xiph's core goals.
Amazingly well done. Clear, great animations, and does a great job of showing you exactly what is going on. I wish all learning material/tutorials were this well produced.
Really great video. He goes things quite fast but he does nearly every explanation with a demo! I can't think of many of my uni lectures in this topic like that. I hope they release more. Is there any planned?
They take an insane amount of work, so they are slow in coming. This latest one had about 50kloc of code written to build the demo software, animations, and to improve the FOSS video editing software that was used.
Though I do find the choice of using Cinelerra to edit the video a bit odd. Apparently there was an update to the project less than a year ago but I have personally never even been able to even import a video without it crashing.
Using something like OpenShot, KDenLive or PiTiVi or even Blender only makes sense to me.
But Cinelerra obviously worked out for them; the productions are absolutely spectacular. Loved every bit of this one. The clearest explanation one could hope to get.
The sad truth is that all the open source video editors are like that depending on what you're doing (with the possible exception of OpenShot, which I found to be reliable but limited).
In short, I found no FOSS video editor that did what I needed reliably out of the box. Given that problem, Cinelerra at least came close, the code was readable, and the design sound. Its biggest problem is that its file loaders/exporters are all old and bitrotting so they have become unreliable and crash-prone. For my stuff, I use raw video and so don't hit those problems.
In the process of making Episode 2, I wrote a new compositor resampler and a new color grading filter for Cinelerra (named Blue Banana). These should be appearing officially in the next major release. I hope to have some more file loader work done by then too :-)
There is a lot of ignorance in that thread though. This is one of those subjects where you should be very careful about who you listen to, given all the disinformation floating around on the internet.
Honestly, the ignorance and arrogance regarding audio and its playback is one of the distasteful things about the audio engineering/audiophile community. Even in a room full of engineers, you will still get people arguing to the death about things outside of human perception (My family, full of electrical engineers is unfortunately filled with this type). For whatever reason, the idea that their ears may be just as limited to the audio spectrum as their eyes are to light spectrum is a concepts just too foreign to grasp.
You have places like gearslutz, which is, as far as I know (though I haven't been there in a long time), the biggest audio engineering forum out there. The community, as a whole seems to disagree with the very concept of an AB, or ABX test. I mean, it's approached flat-out superstition.
The thread that finally made me through up my hands and leave the place was when a big post erupted after a very well respected member of the community had the gal to suggest that microphones were not filled with magic, and thus limited to physical properties. It devolved into personal attacks against the guy, and his education, and people asserting that "science can't measure what I'm hearing."
I recently checked out a few online courses on Signal Processing and EE. (Not the MIT ones, though). Khan Academy is good too, but doesn't cover advanced topics like this.
Nothing touches this.
When I see this, I get this tremendous optimism. Online video is going to be a fantastic complement to the internet as a learning platform.
Huge kudos to the xiph.org team for producing this!
Excellent video, though I do wonder about the treatment of pixels as point samples. I've seen it before, and it appears to be a popular viewpoint among code designers, but it doesn't agree with how I use and create pixels, or with my understanding of how they are sampled by cameras. So it feels like a convenient misrepresentation.
Given that, I also wonder about the point sample nature of acoustic signals, espoused in this video. I know nothing about the sampling, but imagine a stairstep function could be more accurate for that, just reasonably represented as point samples transformed with the use of theoretical understanding of sound.
The square wave example provides a further interesting representational quandary. Given that it's a simple function, the bandlimiting results in a surprisingly large signal loss at a notable increase in bit depth. Combined with the signal loss that results from assuming the frequencies a priori, I wonder if there is enough signal lost and converted to noise in the digitization process to spend processing power looking for non-sinusoidal, perhaps merely skewed, signals.
The math treats pixels as point samples, because that's what sampling is.
You can quibble about how a CCD cell in a camera converts its narrow field-of-view into its output but at the conclusion of the snapshot you will have a finite set of data points from that cell (the "pixel"), nothing more or less.
His point is that after this sampling process all you can say about the original data is that the cell at X,Y recorded a value of RRGGBB (expand to as many bits as appropriate for your camera). That is a "point sample" by definition: An ordered pair of [ Point, Sample-Value ]. It doesn't matter how special your code that uses or creates the pixels, this is a simple fact-of-life of the physical sampling process itself.
Given that you admittedly know nothing about the sampling, I would imagine that you would be careful about how you judge the adequacy of the fit for a mapping from the sampled data to arbitrary functions intersecting with the sample.
But I didn't see his point as being that; I saw him arguing that pixels are correctly interpreted as representing an area of zero, which is very convenient when you're fitting a function to interpolate between them (since that way you don't have to correct for integrals).
I would argue that this approach is what gets you JPEG artifacts, and that there might be an analogous effect in audio to explore, since the bitrate avenue has been exhausted. It's perfectly possible that more accurate bases for the decomposition step would result in unjustifiable bitrates, but I want to hear of it.
As for judging the fit, my aim is to judge the mapping from input to output, not from sample to output (the ideal mapping from sample to output would of course be an identity function). This naturally requires recognizing some assumptions.
JPEG artifacts are unrelated to the definition of pixels. They're an artifact of using lossy compression, much as how MP3 exhibits some distinctive artifacts at lower bitrates. Monty is only talking about plain old uncompressed sampled data.
Since you're focused on the "size of a pixel" you might also be thinking of resampling artifacts, which is also beyond the scope of the video. Resampling is the process of changing the effective sample rate of data by reconstructing a new, larger or smaller set of samples from the original data. Monty alludes to this when he says "the samples are not stairstepped", because when we go to reconstruct samples we have to figure out a method that is NOT stairstepped, but rather:
1. Reasonably represents the original signal
2. Remains band-limited
Fortunately there's theory that explains what a mathematically ideal resampler should look like. It's good study material for thinking about sampling theory in more depth and understanding some outcomes that develop from defining samples as infinitely small points. It also explains why pixel art doesn't respond well to standard image resampling algorithms.
JPEG artifacts are separate from the definition of a pixel, yes, but not to the extent you propose. The artifacts are specifically a result of pixels becoming corrupted during the compress-decompress cycle, and feature most prominently where the source function has discontinuities. Treating a pixel as a point sample of a smooth function removes our ability to consider such things.
Quantization is a form of lossy compression, which is why it adds noise. Monty certainly spoke of that, and of randomization as a recourse in the face of systematic error.
I wasn't thinking about resampling, no, but it certainly fits the topic. Could you expand on how the ideal resampler explains why standard resamplers are occasionally poor, and whether this can be affected by treating the sample as filtered?
> The artifacts are specifically a result of pixels becoming corrupted during the compress-decompress cycle, and feature most prominently where the source function has discontinuities. Treating a pixel as a point sample of a smooth function removes our ability to consider such things.
The pixels are not "corrupted", the source pixels are instead deliberately replaced by a pixel set that compresses better. The way the JPEG encoder chooses this replacement pixel set is why you see the artifacts in areas with significant edges: JPEG deliberately removes the image counterpart to high-frequency data, much the same way that MPEG audio encoding removes high-frequency audio.
What you see is the "Gibbs effect" in the visual realm from deliberately throwing away that high frequency data, which is not at all due to "treating a pixel as a point sample" (since as before, that's the very definition of a pixel, a value at a point). The solution is easy enough: force JPEG to keep the extra data, by driving the quality to the maximum permitted level (at the penalty of horrible compression).
JPEG has other artifacts too (e.g. you can see the block size used by the encoder), but those are also not inherent to treating pixels as point samples but are instead inherent to the design of the encoder.
Deliberate corruption is still corruption. As for your comparison to MPEG audio, I feel that's symptomatic - high-frequency components are perceptually negligible in audio, but stand out in video (until the stroboscopic effect). You wouldn't want to use the same encoder for both.
The Gibbs effect is absolutely due to discontinuities not being gracefully handled by a Fourier transform. Luckily point samples from a continuous function don't have to worry about that like a fully defined discrete function might, which is why I expect people want to define pixels as the former.
A camera might produce a pixel by computing the average light input over a square, and in that sense you could consider the pixel itself a square.
But in signal processing you interpret it as a two step procedure: First, for every mathematical point of the scene, compute the average light input of the surrounding square to produce a band limited signal. Second, for every pixel report the value at the corresponding point in the band limited image.
The first step is filtering (in this case with a box filter), the second is point sampling.
I've heard that one of the problem areas is that you can't make a perfect, colorless filter - one where 20KHz is allowed through but 20.001 is not. A filter has to slope down at least somewhat gradually. The more steep you make the filter, the more it colors (distorts) the signal around the cutoff frequency. One of the purported benefits to a higher sampling rate is that you can make the high frequency filter with a more gradual slope which introduces less distortion.
I realize any differences in this area would be very slight if they exist at all. Anyone know more about that?
Edit: Apparently no one can dell a difference in listening tests with well made equipment. Theoretically if a device was built using an extremely cheap filter you'd be better off with the higher sampling rate, but I can't find any real examples.
In my experience, another source of great DSP explanations is Professor fred harris [1]. I attended a 2-day lecture by him (20 years ago) and it was beautiful in its clarity.
That was the digital audio version of the Feynman Lectures on Physics. After years of being bombarded with mediocre TED talks, this is a shining example of pure excellence.
You might be interested to know that there were about 50k lines of code[1] written for the demonstrations, all of which is available.
Also, don't miss the original video[2], or the article on HD audio[3].
[1] https://wiki.xiph.org/Videos/Digital_Show_and_Tell#Use_The_S...
[2] http://xiph.org/video/vid1.shtml
[3] http://people.xiph.org/~xiphmont/demo/neil-young.html