Hacker News new | past | comments | ask | show | jobs | submit login
Spleeter – Music Source-Separation Engine (deezer.io)
258 points by jph98 on May 19, 2020 | hide | past | favorite | 63 comments



Once this technology gets incorporated into DJ mixers / CDJs, this is going to make DJing much more creatively interesting.

Historically, blending between mixed stereo tracks has limited to mixing EQ bands, but now DJs will be able to layer and mix the underlying stems themselves -- like putting the vocal from one track onto an instrumental section on another (even if there were never a capella / instrumental versions released.)

It also opens up a previously unreachable world for amateur remixing in general; for instance, creating surround sound mixes from stereo or even mono recordings for playback in 3D audio environments like Envelop (https://envelop.us) [disclaimer: I am one of the co-founders of Envelop]


Disclaimer: my hobby is correcting people on the Internet when they say disclaimer but they really mean disclosure :)


How are those windmills doing, Don Quijote?


Hey, don't knock him. I'd never thought about it before. Being taught or corrected is great as long as people aren't dicks about it. Even then it has some value :)

I shall use this knowledge and endeavour to share it where possible.


Wouldn't it be much more efficient for everyone (and even lucrative for the owners) to also provide the studio stems at a slightly higher/different price?

(not that some of these are not already available when you know where to search, but it's not very... structured)


Native Instruments tried that with their Stems [0] project. Didn't seem to get all too far though.

[0]: https://www.native-instruments.com/en/specials/stems/


This one is proprietary to them.

The "open source" format/practice exists already: just bounce your mix into separate audio files, one for each track or group, into a folder, zip it, ship.

Only, it's not (yet) much embraced on the commercial side. When you pay up to 20 boxes to get a full album, why couldn't you pay, say, 100 to get the same album but with separated tracks for your own use + instructions as who to contact & how for any other kind of uses?


This is a thing specifically in the contemporary Christian music industry, so that churches can pick-and-choose parts from the original song to use as backing tracks for live performance. See e.g. https://www.multitracks.com/songs/Hillsong-Young-And-Free/Th...


For anyone who wants to try Spleeter in a version that "just works" without having to install TensorFlow and mess with offline processing, Spleeter has been been built into a wave editor called Acoustica from Acon Digital. It's been working really well for me, and the whole package is solid competition to editors like iZotope RX:

https://acondigital.com/products/acoustica-audio-editor/


I've been trying for months to make redistributable Spleeter "binaries" that I can bundle with user-facing applications. Happy to see someone's succeeded where I've failed. Really sad they've chosen not to share their changes :(

I emailed them requesting more info on how their implementation works. I think this might be a violation of the MIT license?


The MIT license isn't copyleft, there's no obligation to share modifications or provide source - just to acknowledge the copyright / credit.

But they seem friendly and proactive (from my experience on music forums anyway), so hopefully you'll get a helpful reply.


Does Acoustica offer full ability to rebind the hotkeys?


It's something they're working on. You can already change most keyboard shortcuts, but there's a few corner cases that people have been asking for (shortcuts with arrow keys are a problem at the moment). The developers have been extremely responsive to feature requests on the Gearslutz forum though, I've seen some feature requests implemented in just a few days:

https://www.gearslutz.com/board/product-alerts-older-than-2-...


Previous discussion, where I posted a demo using a full song (legally under Creative Commons):

https://news.ycombinator.com/item?id=21431071

Note: I'm not affiliated with this project; I just think it's cool.


I'm going to pretend that we didn't see this (otherwise extremely helpful) link to a major discussion from 6 months ago, so as not to have to mark the current post a dupe.


I often have voice recordings with a lot of background noise (e.g. a public lecture in a room with poor acoustics, recorded from a phone in the audience — there's usually sounds of paper rustling, noises from the street, etc). Is this "source-separation" the sort of thing that could help, or does anyone have other tips? The best thing I have so far is based on this https://wiki.audacityteam.org/wiki/Sanitizing_speech_recordi...

(1) Open the file in Audacity and switch to Spectrogram view, (2) set a high-pass filter with ~150 Hz, i.e. filter out frequencies lower than that (which tend to be loud anyway), (3) don’t remove the higher frequencies (which aren’t loud), because they are what make the consonants understandable (apparently), (4) look for specific noises, select the rectangle, and use “Spectral Edit Multi Tool”.

But if machine learning can help that would be really interesting! This Spleeter page does mention “active listening, educational purposes, […] transcription” so I'm excited.


I'd generally try iZotope RX for cleaning up audio - Dialogue Isolate is probably the exact feature you would want (and I gather is often used in movies to clean up on location dialogue), but it's only in the most expensive Advanced version:

https://www.izotope.com/en/products/rx/features/dialogue-iso...

Cheaper versions of RX still have various noise reduction tools, de-verb for reducing reverb and room echo, and a range of spectral editing tools as well.


You could give a shot to the Nvidia RTX Voice plugin if you have one of the compatible cards. I'm not sure how it deals with low background noises, the youtube reviews mostly tested it with over the top cases like a vacuum cleaner next to the speaker.



https://krisp.ai uses machine learning to remove background noise. I've used them with Zoom calls and it works really well. I think they don't currently have an "upload audio" feature for existing recordings, but it would be awesome if they offered this in the future.

Sorry it's not something you can use now, but I just thought I would mention it! I also did a quick Google search but unfortunately I couldn't find any AI noise removal tools that might solve this problem.


Is the processing happening remotely? I can't use any software that sends data (especially communications) of premises.


There is a Max/Ableton live plugin version here, which makes it much easier to experiment with Spleeter artistically.

https://github.com/diracdeltas/spleeter4max/releases/


Nice! I had this idea and was too lazy to do it haha, glad someone else wasn’t


needs a reaper version as well!


Another recent open source contender for source separation is Open Unmix: https://github.com/sigsep/open-unmix-pytorch/

I’ve not had time to try it yet but have read good things.


Just tried this and it's really impressive, I'd say it does a nicer job on vocals than Spleeter. Less of the "underwater" effect compared to what I remember of Spleeter.


Unfortunately, it doesn't look like they've got out-of-the-box windows support.


Very cool!

I was even able to run it on their notebook https://colab.research.google.com/github/deezer/spleeter/blo... without setting anything up locally.

The results of vocal separation were quite impressive.


Can you share the outputs please?


Sorry, I tried it on a (typical) copyrighted song.


Here's the sample output, for those who are curious:

- Sample track: https://files.catbox.moe/56op27.mp3

- Spleeted vocals: https://files.catbox.moe/4d9aru.wav

- Spleeted accompaniment: https://files.catbox.moe/y67g23.wav


A local radiostation has a broadcast of four hours. They are required to play an x amount of music tracks by the station (about 6 per hour), but there has been demand to make the broadcast available as podcast without the music.

Could this make it possible to automatically remove the music from the MP3 file they have available? With 6 tracks per hour times 4 hours, manually removing the music is time consuming.

I doubt it, as it seems all vocals are are output to a single file...

Is there any other tool someone can recommend?


Presumably they own the rights on broadcast material, so they'd have to be directly involved in the podcast production. That given, it would probably be more straight forward to take the microphone feeds from their broadcast desk (via "aux-out" perhaps) and record only the spoken output separately.

Sox etc. could be used for silence detection, probably best done in post (scriptable), but could be piped through after experimenting with settings. Otherwise, even old desks can trigger when a mic channel fader is raised, so that too is a possibility for pausing the recording during music.


> Is there any other tool someone can recommend?

Audacity. I can think of two ways.

0. Import into audacity

1. Playback the recording at 4x (1 hour of playback for 4 hours real-time broadcast). Mark the edges where music stops and starts. You have to do that 12 times for 6 songs. Youll have to slow down near the changes in order catch the precise time of an edge. Delete the music between the two edges. Repeat 5 more times.

There may be audacity plugins that do what you want or do something closer to it.

2.use some combination of low pass and high pass filters to remove the music. It's not going to be perfect and you'll still need to edit out the filtered music anyway.


At this point it'd be easier to just duplicate the sources to an external recorder, right?


Leveraging a state-of-the-art source separation algorithm for music information retrieval

https://www.youtube.com/watch?time_continue=42&v=JIR6HJISrtY...


Now we can create all-star bands that never existed. For example:

Neil schon from journey. Lead guitar

Heart sisters doing lead vocals and lead/rthyum guitar

Flea -- bass guitar from Chili Peppers

Neal Peart -- drummer from rush

Tony kay --- keys from genesis

The only difficulty is they must all be playing the same song. Then we can extract, transpose if needed, and remix together.


We can deep fake vocals and redraw your photos as if they were painted by van gogh... I'm sure someone has trained something that immortalizes different artists into their AI instrumental avatar.

If not, I'm sure if you ask nicely Amazon will give you a few credits to burn on a pandemic art project.


Tony Banks?


I couldn't find any examples so was wondering for anyone that's tried this are the results better than using a bandpass filter and an equalizer to isolate frequencies or one of those auto karaoke things?

Because the ability to separate any song into separate tracks would be amazing. The ability to remix any song or just play with any instrument or vocal track would be awesome. But does it have the same poor quality and limitations of most frequency based source separation?


Yeah, the results are a lot better than filtering... deep learning has pushed the state of the art in source separation on quite a lot recently.

It isn’t magical and the results still have artefacts (mostly that kind of slightly underwater sound of a low bitrate MP3, I believe due to the way the audio is reconstructed from FFTs), and some songs trip it up entirely, but it’s definitely worth playing around with and I think it could potentially have applications for DJ/remix use if you added enough effects etc.

It’s fairly easy to install and runs quickly without GPU, or you can try their Collab notebook, or seems someone has hosted a version at https://ezstems.com/


Had a play with the Colab and it's quite good indeed. The authors claim "100x real time speed", which is mighty impressive, but I'd be more interested in seeing a "Try Really Hard" mode, trading off quality and speed. Is that a thing that can be done in the current code, I wonder?


If you're trying to run it on Windows with Python 3.8, add numpy and cython to the dependencies, and change Tensorflow's requirement to be >= rather than ==.

Though then you'll run into compatibility errors like "No module named 'tensorflow.contrib'" which you'll have to fix.


While this is awesome, it's trained on MUSDB18-HQ which as far as I can tell is proprietary. zenodo.org claims it is available, however I have filled out their "request access" page a half-dozen times. Does anyone know of a training data-set that's possible to obtain?

Here is the zenodo response:

Your access request has been rejected by the record owner.

Message from owner: no justification given

Record: MUSDB18-HQ - an uncompressed version of MUSDB18 https://zenodo.org/record/3338373

The decision to reject the request is solely under the responsibility of the record owner. Hence, please note that Zenodo staff are not involved in this decision.


This reminds me of this open source project (and its predecessor manyears and open hardware projects 8/16soundsusb).

https://github.com/introlab/odas https://github.com/introlab/manyears https://github.com/introlab/16SoundsUSB

Website of the team behind these:

https://introlab.3it.usherbrooke.ca/


Out of interest, and to put this in context - your brain can only do this for conversation, not music.

You routinely suppress background noise and room acoustics when listening to someone speaking. But you don't do the same thing when listening to music. At best you can focus on individual elements in a track, and you can parse them musically (and maybe lyrically).

But you don't suppress the rest to the point where you don't hear it.


Maybe _your_ brain can, but mine can't.

To somebody with APD, it sounds like science fiction, although it does require more suspension of disbelief than faster than light travel or teleportation.


Is there some research behind this or are you opining on how YOUR brain works?


https://en.wikipedia.org/wiki/Cocktail_party_effect

Not sure how this pertains to music, but this ability normally requires localizing different voices and noises.


Once you have obtained just the Guitar from a track, are there any tools out there which can work out the Tablature (eg. https://www.ultimate-guitar.com//top/tabs) so you can play along?



Well, it seems neural networks started to appear for vocal and instrumental track isolation^^ recently I've discovered https://www.lalal.ai and it works quite well


I tried using the 2 stem model to remove the music from an audio recording of two people talking. It kept sucking in some of the music whenever someone started talking, however. Is there a better model to use for that?


It says it can be 100 times faster than in real-time.

So can it be run in real-time?

I am thinking about extracting features for music visualization but it could make a DJ happy also.


Sometimes the distinction is made between "real-time" and "online" processing.

The first one refers to the speed of the processing in relation to the length of the recording - so, say, you can process a 10 minute recording in 1 minute then you're 10x real-time. However, your analysis might require the full track to be available for best outcomes, and so you cannot really start with the processing until the full source is available.

The latter is what "online" processing refers-to, the ability to process on-the-fly in parallel to the recording. Obviously, this cannot be faster than real-time ;-) but hopefully it is not slower, either. Often times, though, you get a (somewhat constant and) hopefully slow offset, i.e., you can process a 10 minute recording online in the same time but you need another 10 seconds on top of that.

This is, by the way, not restricted to source separation, it applies to other disciplines as well, say, automatic speech recognition.


Exactly, while fast if this method needs to parse the full track before starting to generate the results then it can't be used in real-tine.

To be used with arbitrary audio in real-time, after initialization and setup you need an API that looks like:

ProcessAudio (samples, num_namples)

And it would return n packets of num_namples samples. One packet for each generated track.


I experimented with the spleeter architecture quite a bit and I would say this is not suitable for real time audio processing. The reason is that the model needs at least 512 frames of audio samples to produce an output usable for source separation. This adds a ton of latency. I tried with smaller windows but the results are very bad.


This person https://github.com/diracdeltas/spleeter4max

created an max for live native version of spleeter and demos it here:

https://www.youtube.com/watch?v=4pcJoI5CUOA&feature=youtu.be

It's way faster than real time, im not sure why slowing it down would be an advantage. You still need to take the resultant data and do things with them, as a dj, and faster is better.


You could try spleeter on the cloud here https://voxremover.com


The output appears to cut off after 10 minutes. How do you make it operate on longer files, like in the 100 minute range?


Deezer is pretty useless if all supported hardware require your phone to stream.

They should spend dev time on something that matters


This is very cool, I have started using it for experimenting creating hardstyle dance remixes of popular songs


This is ultra-cool .. I have a few terabytes of jam-session recordings that I'm going to throw at this. If it ends up being usable to the point that I can re-do vocals over some of the greatest moments in the archive, I'll be praising whatever Spleeter deity makes itself visible to me at the time, most highly ..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: