Without giving away your secret sauce, what are your approaches to the cleaning ...

autoencoders · on Nov 20, 2021

The audio is edited in several phases. It uses different algorithms, but most of them are deep learning based. It is surely overengineered, but as a Data Scientist, ML is the most fun part for me.

nmstoker · on Nov 20, 2021

How is the latency and, if it's sufficiently low, could this realistically be applied to "nearly live" content?

That scenario seems really appealing for conferences, even if it just quietens down the verbal ticks, but I'm guessing if the lag is too great it would get like a bad lip sync issue

pokot0 · on Nov 20, 2021

How does real time makes sense in the first place for an algorithm that gets 1 minute of audio and gives you back 50s? You are gonna have to fill the gaps anyway with something not meaningful.

nmstoker · on Nov 21, 2021

An awareness of your point was precisely why I mentioned "quietening down verbal ticks" (ie 1 minutes gives you back 1 minute but with the ticks removed/muffled)

To me this seems like it could be worthwhile even if it results in silence or less prominent umms and other filler - I've sat through enough conference talks by technically gifted people who I very much wanted to hear but who unfortunately make their talks much harder to follow due to the ticks. It might even help relax some nervous speakers if they knew any of these that creep in were being suppressed.

pokot0 · on Nov 22, 2021

I understand now. Very niche, but I applaud the effort to give voice to people who have something to say, instead of those who know how to talk in public.

staticautomatic · on Nov 20, 2021

Silence is meaningful, but pretty awkward when not deliberate!

laumars · on Nov 20, 2021

Tools like this are designed to remove awkward silences.

What it sounds like the GP is after is something more like hiss and pop removal (to use an only vinyl analogy) and that’s a different and also simpler problem to solve. I’d wager there are already tools on the market for that.

pokot0 · on Nov 20, 2021

Very insightful :). Now I need an AI to tell me when silence is deliberate or not. :)

autoencoders · on Nov 20, 2021

It would be a huge engineering endeavour, which I wouldn't be capable of doing. That said, things like background noise and some sounds can be removed. See Krisp.ai

Fogest · on Nov 20, 2021

Nvidia RTX voice does similar. It's pretty similar to other technology though where it focuses more on removing background noise. It actually works very well. It would definitely be interesting to see it also filter speech itself. But I feel like this would be hard to do without introducing extra latency. If someone is saying "umm" or some other filler before a word you kinda need to know what that word will be to determine if it's filler or not. So it almost can't be done without introducing latency as it would need some future speech to determine if filler or not.

pessimizer · on Nov 21, 2021

To do this, the speaker would have to wear an EEG cap. You're talking about cutting the mic before a verbal tic happens.

With an EEG cap, though, I bet a smart person familiar with the methods could bash something together in a day that would work.

autoencoders · on Nov 21, 2021

True. You don't even need a full CAP. Just some channels in the visual cortex. (With more advanced AI) So you would just need to hear a headband or one of those EEG which look more elegant.

qmmmur · on Nov 20, 2021

Izotope plugins already do some of these things but not all. In particular their de-clicking algorithm is pretty good but definitely not automatic or low latency.

qmmmur · on Nov 20, 2021

Do you do any audio segmentation to remove the filler words and such?

jwuphysics · on Nov 20, 2021

Based on the OP's username, surely one of the deep learning algorithms is a denoising autoencoder, right?