Hacker News new | past | comments | ask | show | jobs | submit login

Without giving away your secret sauce, what are your approaches to the cleaning process? Is it a combination of different passes of algos or is it something more generic and "sausage machine-like" like a neural network?



The audio is edited in several phases. It uses different algorithms, but most of them are deep learning based. It is surely overengineered, but as a Data Scientist, ML is the most fun part for me.


How is the latency and, if it's sufficiently low, could this realistically be applied to "nearly live" content?

That scenario seems really appealing for conferences, even if it just quietens down the verbal ticks, but I'm guessing if the lag is too great it would get like a bad lip sync issue


How does real time makes sense in the first place for an algorithm that gets 1 minute of audio and gives you back 50s? You are gonna have to fill the gaps anyway with something not meaningful.


An awareness of your point was precisely why I mentioned "quietening down verbal ticks" (ie 1 minutes gives you back 1 minute but with the ticks removed/muffled)

To me this seems like it could be worthwhile even if it results in silence or less prominent umms and other filler - I've sat through enough conference talks by technically gifted people who I very much wanted to hear but who unfortunately make their talks much harder to follow due to the ticks. It might even help relax some nervous speakers if they knew any of these that creep in were being suppressed.


I understand now. Very niche, but I applaud the effort to give voice to people who have something to say, instead of those who know how to talk in public.


Silence is meaningful, but pretty awkward when not deliberate!


Tools like this are designed to remove awkward silences.

What it sounds like the GP is after is something more like hiss and pop removal (to use an only vinyl analogy) and that’s a different and also simpler problem to solve. I’d wager there are already tools on the market for that.


Very insightful :). Now I need an AI to tell me when silence is deliberate or not. :)


It would be a huge engineering endeavour, which I wouldn't be capable of doing. That said, things like background noise and some sounds can be removed. See Krisp.ai


Nvidia RTX voice does similar. It's pretty similar to other technology though where it focuses more on removing background noise. It actually works very well. It would definitely be interesting to see it also filter speech itself. But I feel like this would be hard to do without introducing extra latency. If someone is saying "umm" or some other filler before a word you kinda need to know what that word will be to determine if it's filler or not. So it almost can't be done without introducing latency as it would need some future speech to determine if filler or not.


To do this, the speaker would have to wear an EEG cap. You're talking about cutting the mic before a verbal tic happens.

With an EEG cap, though, I bet a smart person familiar with the methods could bash something together in a day that would work.


True. You don't even need a full CAP. Just some channels in the visual cortex. (With more advanced AI) So you would just need to hear a headband or one of those EEG which look more elegant.


Izotope plugins already do some of these things but not all. In particular their de-clicking algorithm is pretty good but definitely not automatic or low latency.


Do you do any audio segmentation to remove the filler words and such?


Based on the OP's username, surely one of the deep learning algorithms is a denoising autoencoder, right?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: