End-to-end Music Source Separation

homarp · on Nov 1, 2018

code is here https://github.com/jordipons/source-separation-wavenet

"Currently the project requires Keras 2.1 and Theano 1.0.1, the large dilations present in the architecture are not supported by the current version of Tensorflow"

jlarcombe · on Nov 2, 2018

Interesting this should come up now. I just bought Izotope RX7 which has a built in 'Music Rebalancer' and, according to their PR, this uses machine learning techniques. I've been playing around with it this week in spare time and have been quite impressed with how well it works on quite disparate sources from my music collection. It's extremely useful when you want to transcribe details from instrumental tracks that are hard to discern past the vocals in the overall mix. It seems to work less well on material that's been crushed in the mix/mastering, and material that has a lot of upper-mid content competing with the vocals. On clean, open material, it works remarkably well.

myself248 · on Nov 1, 2018

"Music source separation" appears to mean "vocal removal" but also setting aside the vocals into their own track.

That's all I can glean from context, anyway.

amelius · on Nov 2, 2018

If only the recording studios would provide unmixed tracks ...

It might even be a significant revenue source as there are plenty of audiophiles that would pay for being able to change the volume of the string section compared to the flute section, etc cetera. Or DJs making remixes.

beat · on Nov 2, 2018

It's not that easy, unfortunately. Mixing is a lot more complex and involved than it looks. It's a balancing act, making tracks stand out from the background or blend into the background, and finally putting it all into a coherent-sounding whole.

For example, many mix engineers - most, probably - use a buss compressor as "glue" to help blend the whole mix together. Take out something loud like lead vocals or drums, and the buss compressor behavior changes, changing the rest of the mix. A lot of mixers also use sidechain compression to duck parts depending on other parts (like the kick drum taking a few db out of the bass).

And besides the mix, it needs mastered (which costs money), and packaged separately from the rest of the album, making it a different product, to be sold separately.

On the other hand, some fields actually do instrumental tracks with vocals stripped out as regular products - karaoke and hip-hop tracks are often made from the original mixes.

Personally, as a mix engineer, I don't want some home genius second-guessing my mixing decisions.

amelius · on Nov 2, 2018

> Personally, as a mix engineer, I don't want some home genius second-guessing my mixing decisions.

But the home user could start from your default settings?

beat · on Nov 2, 2018

Not really. Not unless they have my DAW (mixing software), and the plugins I use on it. There's a lot more to mixing than just relative levels. It's kind of a black art that takes years to learn. I've been at it seriously for a decade now, and I feel I'm just beginning to get good (although I've gotten lucky in the past, it wasn't because I really understood what I was doing).

Here's an example. I listen critically to a lot of mixes, in order to learn from them, so I hear flaws everywhere. One badly flawed classic song, to my ear, is Al Stewart's "Year of the Cat". It's extremely sibilant on the vocals. Listen to it on bright-sounding headphones, and every S sound in the vocal is shrill and hissy-sounding. Adjusting the level won't fix that. I want to run a de-esser on the vocal. And it's not a knock on the mix in general, which is mostly gorgeous, lush, classic 1970s Abbey Road sound. 98% of it is magic beyond my meager skills. But oh god the sibilance.

edit: I should add here that individual tracks, solo'd outside the context of a mix, can sound really weird and wrong - they get manipulated to blend into the whole, not to sound great on their own. For example, I high-pass most guitars at 250-300hz, even though the low E string is way down at 82hz. The reason is to get them out of the bass space, freeing it for the bass guitar/synths and kick drum. So if you solo the guitar, it can sound thin and wrong. But in the mix, you'll never notice the difference - but you'll hear clearer bass.

golergka · on Nov 2, 2018

https://www.native-instruments.com/en/specials/stems/

That's exactly this, targeted for DJs, already available and widely used.

stan_rogers · on Nov 2, 2018

That was something Glenn Gould was always on about as well - that recording should go beyond eliminating the "non-take-twoness" on the performers' end to allowing the listener to participate actively in the performance.

marssaxman · on Nov 2, 2018

That's one application, but in general it is the problem of "unmixing" a piece of music - separating the percussion, bass, chords, melody, and so forth out from a single recording.

btown · on Nov 2, 2018

Source separation is much more general than just music. For instance, a common application is hearing assistance; can a hearing aid separate the voice of the person you want to listen to, from the background noise of many other perfectly valid conversations?

I can't speak to the research presented in this link in context, but the first segment provides a good introduction to the problem domain: http://spandh.dcs.shef.ac.uk/chat2017/presentations/CHAT_201...

jordipons_mtg · on Nov 2, 2018

I'm Jordi Pons, one of the coauthors of the paper.

Is very interesting that you mention that! Actually, we used source separation to remix music to improve the musical experience of cochlear implant users!

Basically, music perception remains generally poor for cochlear implant users (due to the complexity of music signals). In order to simplify music for them, we remove the accompanying instruments to enhance vocals and beat (drums and bass), that is what they perceive the best.

This was a nice source separation application that helped many people! :)

https://asa.scitation.org/doi/full/10.1121/1.4971424

btown · on Nov 3, 2018

Very cool! I was introduced to the problem by a former colleague of mine: http://ryanmcorey.com/ . His current research on using multichannel microphone arrays to improve real-time source separation, particularly for human listening applications, might be of interest to your team!

marssaxman · on Nov 2, 2018

Nice, thanks for the link. I'm only familiar with it in the context of music analysis (ISMIR etc), though it's obvious that many of the core algorithms were developed originally for speech analysis.

jordipons_mtg · on Nov 2, 2018

I'm Jordi Pons, one of the coauthors of the paper.

Note that in our second part of the demo, besides separating vocals, we also separate drums and bass. See: http://jordipons.me/apps/end-to-end-music-source-separation/

antpls · on Nov 2, 2018

So, I'm not an expert in music nor signal processing, but : on some of the vocal separation examples, we can hear the rythmed instruments in the background. Would it not be easy to detect those instruments and remove them from the data? It's not like those are random noises, they are predictable signals. Is it a limitation of using neural networks, or is the problem harder than it looks?

EADGBE · on Nov 2, 2018

I believe "remove vocal" algorithms already implemented rely mostly on the pitch and tonal spectrum of the voice.

For your example, what would be done when vocals sync up with rhythm instruments (straight 1/8's or 1/4 notes played on guitar, sung as well). Happens all the time, often only a bar at a time, but sometimes longer. e.g. The Strokes - Last Nite (https://youtu.be/TOypSnKFHrE?t=62) @1:03 "they don't under-stand" (repeated) could trick the detection into thinking it's some sort of rhythm instrument.

My point is, all music is extremely rhythmic and requires everything else in the band to be rhythmic as well.

synthmeat · on Nov 2, 2018

Why don't they just use audio stems as training data?

francesclluis · on Nov 2, 2018

I'm Francesc Lluis, one of the coauthors of the paper.

The reason why we don't use audio stems as training data is because during the preparation of the MUSDB dataset, conversion to WAV can sometimes halt because of an ffmpeg process freezing that is used within the musdb python package to identify the datasets mp4 audio streams. This seems to be an error ocurring upon the subprocess.Popen() used deep within the stempeg library. Due to its random nature, it is not currently known how to fix this.

rorykoehler · on Nov 1, 2018

This is awesome. I've been thinking about attempting something like this in order to improve the accuracy of samples I want for tracks. Looking forward to finding out how it works out

cosmic_ape · on Nov 2, 2018

This mentions ICA and NMF, but in contrast to those the proposed method is supervised learning, not unsupervised.

I'd suggest the authors try something like an autoencoder in the waveform domain. That would be a more close analog to the ICA methods.

cannam · on Nov 2, 2018

I think it only mentions ICA/NMF to say that they generally aren't applied to time-domain signals, which are not non-negative and have phase as a confounding factor.

Here's another intriguing (very different) recent paper on time-domain source separation: https://arxiv.org/abs/1810.12679

jordipons_mtg · on Nov 2, 2018

I'm Jordi Pons, one of the coauthors of the paper.

You both are right! We basically mention ICA/sparse coding as prior work on waveform front-ends for source separation.

Our method is supervised, and we did not explore the unsupervised learning approach. However, some people are doing that! Check S. Venkataramani and P. Smaragdis work! https://scholar.google.es/citations?user=hCSSNZwAAAAJ&hl=es&...

Although we did our best via comparing against DeepConvSep and Wave-U-Net, I agree that it would be useful to properly benchmark all that!

_fbpt · on Nov 2, 2018

You could use retro game music (specifically snes) as training data (since it's very to use programs to render individual channels to WAV).

Note: I'm the author of towave-j, a tool for game music splitting.