I am really impressed with what Nvidia is doing here.
I think there is a huge market for improving sound quality in video calls.
For me, roughly every second call I make is somehow harmed by some kind of "bad audio" problems. Breathing, reverb, noise, clipping, too silent, there are so many things that can go wrong.
And this really harms the productivity of video calls.
I have started collecting and building tools to detect all of these sources of bad audio and am collecting them at https://www.tinydrop.io Maybe these APIs can help people to improve their setup. But if software like Nvidia's comes along and just fixes the problem once and for all - that's great as well!
Disclosure: I'm the author of the blog post and co-founder at 2Hz.
This is a guest post on NVIDIA Developer Blog. The author of the technology is a startup called 2Hz (2hz.ai). Our passion is to improve voice audio quality in audio/video calls. It's a tough problem but also fun to work on.
Agree, breathing, reverb, noise are all problems and should be fixed. We started with noise and already shipped a product you can try on your Mac. The app is called Krisp (krisp.ai).
Hi! As someone who seems to struggle more than most to understand people on video calls, I'd like to give you my impressions.
Something struck me about the sample video. The very first sample included background noise, but it was very easy to understand regardless of the noise, probably because it was recorded by a pro microphone rather than a phone. Every other sample was far more difficult, regardless of noise removal. Noise removal doesn't really seem to help; in fact, any imperfections in the noise removal process actually make the audio more difficult to understand because I have to guess not only the speaker's voice and the noise but also the algorithm for noise removal.
What does help me is low frequency pickup. I think the first sample is easy because there are plenty of low frequency components that are later lost through the phone.
Low frequencies are presumably difficult to pick up due to the size of the microphone in a phone, but could there be a way to restore those frequencies through audio processing? It would be interesting to analyze the response of specific microphones to specific low frequencies and find patterns that an audio processor could use to restore the low frequency components.
Anyway, kudos for doing some very interesting work. I don't know how representative my experience is.
In my experience it's the loss (or masking) of high frequencies that are the most problematic for understanding speech. The most important sounds in speech are consonants, which are higher frequency sounds. Combine this with foreign accents, and more often than not conference calls quickly degenerate into an unintelligible babble (for me, at least).
> I don't know how representative my experience is.
As someone who works with speech content, this seems unusual. Typically, low frequencies are reduced because there's not much useful voice signal there—for example, NPR typically rolls off frequencies below 250 Hz.
Here's something concrete: the first phrase in the video ends with "small demonstration", but starting with the second instance, I distinctly hear "sall" instead of "small". In the version with the noise, the "m" sounds like an aberration of the noise and is detectable. With the noise removed, the "m" is replaced with a blip that sounds like an encoding error.
Hi, don't know if you're taking unsolicited requests:
But here are some toggle options I would want a system like this to do (enabled by default):
* Do not send whispers. If I am a primary speaker, and I switch to address someone local to my side of the call via a whisper, that audio should be effectively muted to the other side.
* Focus muting. If I look away from the screen and begin addressing someone off camera, away from the mic, mute that as well.
* Bark and siren filtering. Specifically able to ID and mute barks and sirens. (Planes, motorcycles and trucks would be awesome)
---
What is your imoressikn of the company and app Temi?
Sorry, I noticed that too late. I'm currently reading the paper linked in the blog post. Are there other resources on the topic that you can recommend?
Ironically I've had the audio cleanup filters get in the way.
I was trying to make a program which would FFT sounds from my mic and trigger on certain frequencies or combinations of frequencies. The ideas was to have audio files on my phone to act as a sort of poor mans remote control. Yes I know there are wifi and bluetooth ways to do it but I wanted to experiment a bit with sound.
Anyway I'm pissing around for hours with packages and settings and I can't get the damn thing to work. Turns out my computer came with some super fancy beats audio^tm sound system which actively suppresses a microphone input of constant frequency under the assumption its an unwanted buzzing sound.
And don't forget hearing aids. That's a market that's only going to keep getting bigger over time. The first people to ship a super-power-efficient ASIC for wind/restaurant denoising which allows reasonable hearing aid battery life are going to make a well-deserved fortune.
Some unsolicited feedback. when you list your features could you list the text under the image ?. its a bit tedious to click on each image to get a quick idea about the product. the text is worth a lot more than the picture but is completely hidden.
Hey, yea, that this is a terrible experience currently and I am planning to fix it of course. I'm simply still using the stock images that came with the theme...
A mic element costs about thirty cents, and the processing power required for noise cancellation already exists in the CPU of the mobile device.
I think it's particularly interesting that Amazon has made microphe arrays particularly cheap, due to Alexa. MiniDSP offers a microphone array for under $100, which is an unheard of price considering what these cost ten years ago.
Apple sold over 200million iPhones last year. That's 60million$ in saved BOM costs. At the scale that smartphones are sold, saving pennies here and there adds up to millions in additional revenue.
Obviously there's costs to running servers somewhere, but that hasn't stopped companies from making similar decisions for a variety of other services.
Isn’t that addressed in the article that multi microphone works when there is physical distance between the mics — on a phone that’s around 4 inches.
What is that on a watch? An inch? Maybe one can be put on the band. Also on a phone the other mic is on the opposite side of the phone so it isn’t directly in the line of fire from your voice.
The Matrix Creator has an 8 mic array as well plus a bunch of other hardware goodies for about the same price. No experience with it but it looks cool,
How does multi microphone filtering work? I guess they localize different sound sources by cross-correlation (to get the timings) and triangulation (based on the timings and the speed of sound)?
I think the (or perhaps only one) key phrase is "beamforming". A single microphone element has a certain sensitivity pattern (e.g. it may be a very directional microphone, or be equally sensitive in all directions). With multiple pick-ups, you can emulate some different sensitivity patterns.
A related idea in radar is synthetic-aperture radar (SAR).
A lot of the interesting things in audio were inspired by radar. Dan Wiggins at Sonos used to work on radar, and Don Keele created a loudspeaker technology called "CBT" that's based on radar technology.
Because microphones are basically the inverse of loudspeakers, what works in loudspeaker arrays can also work in microphone arrays.
When you record with a single microphone, you are going to pick up a great deal of background noise. This is because the mic will pic up the person speaking AND the background noise; there's no way to differentiate the two.
With two microphones, we know the following:
1) we know where the microphones are
2) we have a general idea where the persons mouth is, because we know how they hold the phone
Based on that, we have a good idea of how long it should take for the sound to arrive, because the speed of sound is a fixed number.
The first time I ever heard a dual mic phone was when one of my coworkers made a call from the inside of our data center. Typically, he'd have to shout into the phone, because the data center was so noisy, and worst of all, the noise was completely random and broadband. But with dual mics, poof, background noise is gone. It was almost like he was speaking in a quiet room.
Amazon Alexa takes this quite a bit further, and uses something called "beamforming." What beamforming allows you to do is to determine WHERE the person is in the room, based on the arrival times of the sound. It's sort of the inverse of a dual mic setup; in a dual mic setup we can 'clean up' the signal because we know where the person speaking is. In a beamforming arrangement, we can use the arrival times to FIGURE OUT where the person is in the room.
If some security company was clever, they could probably use a beamforming microphone array to train a camera on people in the room.
And keep in mind, Alexa beamforming is two dimensional, but you could go crazy and do a 3D beamforming array if you wanted to! (Alex only knows where you are on a horizontal plane.)
That does sound neat. Sounds like it could be combined with 3D localization to allow eavesdropping particular sounds sources even e.g. in large rooms with lots of people talking. Multipath/ghosting might make precise localization difficult though.
You can do this with ICA (Independent component analysis, a somewhat lesser known, non-Gaussian cousin of principal component analysis). Basically you take the data with multiple components and break it down into its consistent component parts.
This is amazing! Full props to the Nvidia team that accomplished this.
I downloaded the Mac app they provided [1] which I highly suggest everyone with a mac tests out. I ran it on my old MacBook Air 2013 using daily.co. It worked like a charm. Definitely using this in the next group chat, where there is always someone who forgets to turn off their microphone.
One cool side effect is that it actually removes the reverberations that happen when you have two computers on the same call, where the mics keep picking up on the output of the other computers and a large high-frequency noise happens (that I'm sure we all have experienced). The system simply removed it and I didn't even know it was there until I turned off the app.
Amazing work and I really hope that Skype, Apple, Google etc implement this into their voice apps, or even phone providers build this into phones. Maybe in the future, we actually can have phone conversations in windy weather and on the streets.
Disclosure: I'm the author of the blog post and co-founder at 2Hz.
This is a guest post on NVIDIA Developer Blog. The author of the technology is a startup called 2Hz (2hz.ai). Our passion is to improve voice audio quality in audio/video calls. It's a tough problem but also fun to work on.
Agree, breathing, reverb, noise are all problems and should be fixed. We started with noise and already shipped a product you can try on your Mac. The app is called Krisp (krisp.ai).
An interesting human problem that I imagine would come up here is that the speaker could be getting distracted with all the noise (crying baby/siren/etc.) while the listener would have no idea what's going on and think the speaker is being confused/dumb/slow/etc... very curious how this would play out in real conversations!
I downloaded the mac app, configured a virtual device to send the system output to the "Krisp Speaker" and verified that it cuts most of the music out of what I'm listening to, leaving only the voice (at a some what degraded quality). I wish I could configure it _cancel_ ambient noise, not just remove it from the input signal.
It'll not be quick enough for phase cancellation, but presumably you could diff the output with the input, phase-invert it, and get the signal you want that way.
Can imagine a codec, which would suppress noise, recognize the speech, send the text along with the voice information so in case of broken signal codec on the receiving end could reconstruct the speech using the text applying speaker's voice features via style transfer.
So, if my voice is distorted in a broken line, it would be reconstructed from the text and reconstruction would sound like me.
I guess it will be the ultimate 1kbps codec.
My friend is an airline mechanic. One thing his coworkers all had was the jawbone headset. This was back in like 2008-2009. He said he could call up a mechanic working right next to the turbine while it was running and hear him crystal clear. I wonder if any of that technology paired with software technology will make it so there is 0 noise in calls. Maybe an implanted bone mic.
IIRC, that technology originated in fighter jets, and worked it's way down to consumer goods. The company sold Bluetooth headsets for a while, but multi-mic solutions were cheaper and worked better. They tried to hang in there for a few years, diversifying into consumer electronics like the "Jawbox."
Really impressive results, though I wish they had gone more into the deep learning part of it (but I guess that's probably the secret sauce).
Can't help but notice how well Nvidia is positioned for what appears to be a growing wave of demand for GPUs. Surprised this hasn't reflected in their share price (feels like they could be the next Intel, but what do it know).
Dedicated chips for machine learning (inference) are being developed by many companies.
The hope is that these will be used instead of (or in addition to) GPUs for ML tasks.
Not that Nvidia is poorly positioned. In fact, I expect that if dedicated ML chips work out, Nvidia will also put one on the market.
Have you tried SoliCall Pro (http://solicall.com/solicall-pro/)? Once installed using virtual audio device technology it will improve the audio with multiple options like NR, PNR, RNR, and more.
Does this deep learning noise cancelling also work for music with headphones? If so then we can ditch proprietary noise cancelling headphones and just use the phones?
Not really. What works inside noise canceling headphones is a very different technology, called Active Noise Cancellation (ANC). You don't necessarily need Machine Learning to solve the problem.
The technology described in the blog post is for suppressing the noise which goes from your surrounding environment to the other participants of the call (and vice versa).
What others haven't mentioned and I think is important is that active noise canceling headphones don't try modify your incoming audio. They listen to the outside environment and try to cancel external audio that you'd otherwise hear. For example, listening to a podcast on a noisy bus ride.
This sort of noise canceling tries to remove the unwanted noise that is already mixed in with the wanted sound in the same recording. For example, recording a podcast on a noisy bus ride.
If I understand that figure correctly, at 8KHz with no latency one gets 12 dB cancellation. With 50 ms latency, 0 dB. And above that frequency, the cancellation actually makes the noise worse. An analog cutoff filter would be needed.
Love it. Don’t really love the idea of audio contents of conversations being routed to a cloud server for processing though — needs to stay on-device for privacy.
Thanks for the heads up! I really like what you're doing - not only is it great for the general public, it's a game changer for people with difficulties hearing.
Their ultimate goal must be to be acquired by Apple, Google, or similar. This will never fly as a third-party app/install, even if it's "promised" to be on-device (such promises can change). Moreover, the average user isn't going to know about, care about, or seek out such an app, let alone pay for it. There's no widespread reach or profit in selling direct to consumer.
The only way this works is for it to be built into each device/OS (ie: firmware shipped by manufacturer). If the tech has merit, we'll be seeing it in a couple of years, whether that involves an acquisition here or independent R&D/patenting.
Agree, I wasn't suggesting users would download a special app to use it - I'd like to see the tech make its way into the OS for all voice input.
I can see why the cloud processing makes sense for certain applications / licensers / acquirers (e.g. a VoIP provider like Xoom), but voice comms is really the domain of smartphones, and my hunch is most are plenty powerful enough to do this processing locally.
> devblogs.nvidia.com uses an invalid security certificate. Certificates issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign are no longer considered safe because these certificate authorities failed to follow security practices in the past.
> Certificates issued by GeoTrust, RapidSSL, Symantec, Thawte, and VeriSign are no longer considered safe because these certificate authorities failed to follow security practices in the past.
Those are some pretty big names. Names where reasonable companies could believe that nobody would ever dare enforce the rules against them, because it would break the Web.
Who says nobody ever got fired for buying IBM?
Heck. If the for-pay CAs keep screwing up, Let's Encrypt could become the sane, reasonable, conservative choice, even among the most Enterprise of Enterprise Enterprises.
The story title is "AI powered Noise Cancellation" but the text never uses the term "AI" at all. It's deep (machine) learning. It doesn't need the useless marketing bonus term "AI" to make it better — it's already interesting enough without.
I think there is a huge market for improving sound quality in video calls.
For me, roughly every second call I make is somehow harmed by some kind of "bad audio" problems. Breathing, reverb, noise, clipping, too silent, there are so many things that can go wrong.
And this really harms the productivity of video calls.
I have started collecting and building tools to detect all of these sources of bad audio and am collecting them at https://www.tinydrop.io Maybe these APIs can help people to improve their setup. But if software like Nvidia's comes along and just fixes the problem once and for all - that's great as well!