Hacker News new | past | comments | ask | show | jobs | submit login
The Sound of Pixels (csail.mit.edu)
131 points by myth_drannon on Sept 24, 2018 | hide | past | favorite | 26 comments



Interesting. I wonder how well a logistic regression that spits out masks would perform in the source separation task.

Also a bit surprising to see that they had to STFT the audio before feeding it into a convnet. I thought half the point of convnets was that they figure out how to do spectral domain representations on their own...


I've been messing around with audio nets and that was the first thing that surprised me too - convnets don't work nearly as well (out of the box) for audio as they do for images. This article has some good reasons why audio data is different from image data: https://towardsdatascience.com/whats-wrong-with-spectrograms...


in theory yes, but in practice, giving the network the full information in the right format is crucial to have it train well and quickly.


isn't that supposed to be the magic of convnets though? they _figure out_ the right format. instead of doing feature engineering, like stfts and mel warping, you do stuff like build convolution layers into an ann and let it sort it out?


Cool stuff. Any real world usage/benefits for this? I can't think of any.


It would be great for practicing songs. In this scenario you would be clicking on the instrument to turn off.


matching faces with voices in a surveillance situation, which person is talking?



that's actually a whole subfield, not "this paper".


For the sound splitting, it's probably not extremely useful. You can however use this frequency analysis from video to synchronise audio with a video (for a music video where you want to use a studio audio recording with a video clip)


I haven’t tried my hand at any machine learning, but I’m impressed that it could work with only 60 hours of training data. Perhaps the input clips were fairly short, which would increase the total number of videos.


It remind me such apps like PhonoPaper[0], Nature - Oscillator[1] and PixiVisor[2]

[0] http://warmplace.ru/soft/phonopaper

[1] http://warmplace.ru/soft/nosc

[2] http://warmplace.ru/soft/pixivisor/


This is going to be huge for the hard of hearing.


Unlike reading, I don't think audio can convey the same meaning in a different sensory format. At best, they perceive it but in an alien way to most people. It's like describing a painting in musical notes.


And the "some" to that "most" are those with synesthesia


Or describing a piano piece in musical notes.


how


Not op but I imagine the idea must be that they can point on the thing they want to year in a noisy environment and software can amplify the sound generated at that location.


Does anyone remember Gerry Anderson? he designed relays attached to puppets, which made the jaws clack in time to the sound-track being played, while they filmed the puppets for Thunderbirds are go in the 1960s. Look at me ma! my puppet is speaking!

Thus, it only took us 50-odd years to write the reverse-compiler..


I wonder if this can segregate vocals from instrumentals in a mix? That would be great for mashups.


Incidentally, due to the way a lot of stereo tracks are mixed, it's often possible to mostly remove the vocal track from a song. I'm more curious if this algorithm could perform the reverse task - playing the vocals only. My intuition is that the results would be poor because of the wide human vocal range and the fact that words need to be discernible, not just notes. But I would love to be proven wrong here.


If you can remove the vocals from a piece, you can then subtract that from the original to get just the vocals.


Well that's not quite true. The point is, I believe, that vocals are generally put right in the center of the sound-stage, so they play equally in the left and right channels. Thus right - left is most of the rest of the song, but the vocals cancelled each other out.

However, the right - left mix isn't exactly the song minus the vocals, it's an odd off-version, so subtracting that from the original song will leave mostly the vocals but with artifacts from the difference between the song truly without vocals and the right - left mix's interpretation thereof


Yes, certainly. My point is that if you have "mostly no vocals" you can subtract that from the left + right mix to get "mostly the vocals". It won't be exactly right, sure.


Fair enough!


Interesting, I was wondering what to do with my audiblepixel.com domain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: