This is pretty cool. If you couldn't tell, he is using a dual trace oscilloscope in X-Y Mode [1]. The left audio channel is driving the horizontal deflection. You can usually pick up one of these old analog scopes for under 100 bucks if you look around!
This gets even neater if you take a moment to ponder the theory / math behind how it actually works.
The same signal is driving the video as is driving the audio. Since the oscilloscope is in X-Y mode, the horizontal position is the voltage for the left audio channel and the vertical position is the voltage for the right channel.
So if you drew a fixed dot somewhere on the screen, you'd get DC in both audio channels, which doesn't produce an audible tone. If you want a sound, you must move the dot. But of course that will affect the picture too. The two are not independent.
So what the guy has done is created a single signal that is carefully crafted to both look and sound good. (It's kind of like those clever programs that compile in two languages.)
Once you appreciate that, it sounds nearly impossible, but I believe there are some tricks involved that make it more tractable. You've still got the time dimension to play with, so that gives you some freedom. Due to persistence of vision (and phosphorous on the CRT), you can, for example, move the dot between two positions quickly, and the eye can't tell how quickly so you can vary the frequency of a sound without changing the picture. I'm sure there's a lot more to it than that, but that's just one trick that it seems like he must be using.
[1] https://en.wikipedia.org/wiki/Oscilloscope#X-Y_mode