Interesting. I wonder how well a logistic regression that spits out masks would perform in the source separation task.
Also a bit surprising to see that they had to STFT the audio before feeding it into a convnet. I thought half the point of convnets was that they figure out how to do spectral domain representations on their own...
I've been messing around with audio nets and that was the first thing that surprised me too - convnets don't work nearly as well (out of the box) for audio as they do for images. This article has some good reasons why audio data is different from image data: https://towardsdatascience.com/whats-wrong-with-spectrograms...
isn't that supposed to be the magic of convnets though? they _figure out_ the right format. instead of doing feature engineering, like stfts and mel warping, you do stuff like build convolution layers into an ann and let it sort it out?
For the sound splitting, it's probably not extremely useful. You can however use this frequency analysis from video to synchronise audio with a video (for a music video where you want to use a studio audio recording with a video clip)
I haven’t tried my hand at any machine learning, but I’m impressed that it could work with only 60 hours of training data. Perhaps the input clips were fairly short, which would increase the total number of videos.
Unlike reading, I don't think audio can convey the same meaning in a different sensory format. At best, they perceive it but in an alien way to most people. It's like describing a painting in musical notes.
Not op but I imagine the idea must be that they can point on the thing they want to year in a noisy environment and software can amplify the sound generated at that location.
Does anyone remember Gerry Anderson? he designed relays attached to puppets, which made the jaws clack in time to the sound-track being played, while they filmed the puppets for Thunderbirds are go in the 1960s. Look at me ma! my puppet is speaking!
Thus, it only took us 50-odd years to write the reverse-compiler..
Incidentally, due to the way a lot of stereo tracks are mixed, it's often possible to mostly remove the vocal track from a song. I'm more curious if this algorithm could perform the reverse task - playing the vocals only. My intuition is that the results would be poor because of the wide human vocal range and the fact that words need to be discernible, not just notes. But I would love to be proven wrong here.
Well that's not quite true. The point is, I believe, that vocals are generally put right in the center of the sound-stage, so they play equally in the left and right channels. Thus right - left is most of the rest of the song, but the vocals cancelled each other out.
However, the right - left mix isn't exactly the song minus the vocals, it's an odd off-version, so subtracting that from the original song will leave mostly the vocals but with artifacts from the difference between the song truly without vocals and the right - left mix's interpretation thereof
Yes, certainly. My point is that if you have "mostly no vocals" you can subtract that from the left + right mix to get "mostly the vocals". It won't be exactly right, sure.
Also a bit surprising to see that they had to STFT the audio before feeding it into a convnet. I thought half the point of convnets was that they figure out how to do spectral domain representations on their own...