The Sound of Pixels

a-dub · on Sept 25, 2018

Interesting. I wonder how well a logistic regression that spits out masks would perform in the source separation task.

Also a bit surprising to see that they had to STFT the audio before feeding it into a convnet. I thought half the point of convnets was that they figure out how to do spectral domain representations on their own...

WhiteNoiz3 · on Sept 25, 2018

I've been messing around with audio nets and that was the first thing that surprised me too - convnets don't work nearly as well (out of the box) for audio as they do for images. This article has some good reasons why audio data is different from image data: https://towardsdatascience.com/whats-wrong-with-spectrograms...

black_puppydog · on Sept 25, 2018

in theory yes, but in practice, giving the network the full information in the right format is crucial to have it train well and quickly.

a-dub · on Sept 25, 2018

isn't that supposed to be the magic of convnets though? they _figure out_ the right format. instead of doing feature engineering, like stfts and mel warping, you do stuff like build convolution layers into an ann and let it sort it out?

keyle · on Sept 25, 2018

Cool stuff. Any real world usage/benefits for this? I can't think of any.

utopcell · on Sept 25, 2018

It would be great for practicing songs. In this scenario you would be clicking on the instrument to turn off.

drhodes · on Sept 25, 2018

matching faces with voices in a surveillance situation, which person is talking?

vernie · on Sept 25, 2018

That's actually this paper: https://ai.googleblog.com/2018/04/looking-to-listen-audio-vi...

black_puppydog · on Sept 25, 2018

that's actually a whole subfield, not "this paper".

rolfvandekrol · on Sept 25, 2018

For the sound splitting, it's probably not extremely useful. You can however use this frequency analysis from video to synchronise audio with a video (for a music video where you want to use a studio audio recording with a video clip)

rflrob · on Sept 25, 2018

I haven’t tried my hand at any machine learning, but I’m impressed that it could work with only 60 hours of training data. Perhaps the input clips were fairly short, which would increase the total number of videos.

app4soft · on Sept 25, 2018

It remind me such apps like PhonoPaper[0], Nature - Oscillator[1] and PixiVisor[2]

[0] http://warmplace.ru/soft/phonopaper

[1] http://warmplace.ru/soft/nosc

[2] http://warmplace.ru/soft/pixivisor/

funkdified · on Sept 25, 2018

This is going to be huge for the hard of hearing.

pjgrad · on Sept 25, 2018

Unlike reading, I don't think audio can convey the same meaning in a different sensory format. At best, they perceive it but in an alien way to most people. It's like describing a painting in musical notes.

pjgrad · on Sept 25, 2018

And the "some" to that "most" are those with synesthesia

stavros · on Sept 25, 2018

Or describing a piano piece in musical notes.

jovial_cavalier · on Sept 25, 2018

tmalsburg2 · on Sept 25, 2018

Not op but I imagine the idea must be that they can point on the thing they want to year in a noisy environment and software can amplify the sound generated at that location.

ggm · on Sept 25, 2018

Does anyone remember Gerry Anderson? he designed relays attached to puppets, which made the jaws clack in time to the sound-track being played, while they filmed the puppets for Thunderbirds are go in the 1960s. Look at me ma! my puppet is speaking!

Thus, it only took us 50-odd years to write the reverse-compiler..

xtagon · on Sept 25, 2018

I wonder if this can segregate vocals from instrumentals in a mix? That would be great for mashups.

bscphil · on Sept 25, 2018

Incidentally, due to the way a lot of stereo tracks are mixed, it's often possible to mostly remove the vocal track from a song. I'm more curious if this algorithm could perform the reverse task - playing the vocals only. My intuition is that the results would be poor because of the wide human vocal range and the fact that words need to be discernible, not just notes. But I would love to be proven wrong here.

stavros · on Sept 25, 2018

If you can remove the vocals from a piece, you can then subtract that from the original to get just the vocals.

nawgszy · on Sept 25, 2018

Well that's not quite true. The point is, I believe, that vocals are generally put right in the center of the sound-stage, so they play equally in the left and right channels. Thus right - left is most of the rest of the song, but the vocals cancelled each other out.

However, the right - left mix isn't exactly the song minus the vocals, it's an odd off-version, so subtracting that from the original song will leave mostly the vocals but with artifacts from the difference between the song truly without vocals and the right - left mix's interpretation thereof

stavros · on Sept 25, 2018

Yes, certainly. My point is that if you have "mostly no vocals" you can subtract that from the left + right mix to get "mostly the vocals". It won't be exactly right, sure.

nawgszy · on Sept 29, 2018

Fair enough!

hrayr · on Sept 25, 2018

Interesting, I was wondering what to do with my audiblepixel.com domain.