For those who haven't dug into the audio analysis and synthesis field, it is rea...

viraptor · on May 3, 2010

As much as I agree with you in general... why this? "So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI."

It's not like we hear the waves really. We hear the pitch and volume mainly (proved nicely by playing simple impulses at >10Hz). So it's not like we fail easy pattern recognition on waves. I'd say that it's something we should not be looking at. Ever. Especially when you can have many representations which sound the same (they sound the same, because they look really similar in the frequency domain).

Using the frequency analysis, DFT, etc. is the way to approach it.

chime · on May 3, 2010

> Using the frequency analysis, DFT, etc. is the way to approach it.

Indeed. I meant to say there is no shortcut that immediately makes speech recognition easy to solve. Hence, we have to use a lot of advanced math that works pretty well but like the article says, not as well as a typical human.

However, I don't think humans are that great either. The proof of that becomes evident when working on projects with people from around the world. Ask 10 people from around the world to dictate 10 different paragraphs that the other nine have to write down. I doubt they'll have the 98% accuracy that the article states, especially if there is no domain-constraint. Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. Of course, this doesn't mean it's not fun and interesting for computer scientists. It's just hard for most people to realize why the stupid automated voice-attendant can't understand they said "next" and not "six".

viraptor · on May 3, 2010

It works quite well if you don't have to care about the accent though.

Trivia: "Automatic computer speech recognition now works well when trained to recognize a single voice, and so since 2003, the BBC does live subtitling by having someone re-speak what is being broadcast." (http://en.wikipedia.org/wiki/Closed_captioning)

But I'm pretty sure they started using it globally sometimes, because you can see the recognition quality fall when there's an interview with someone from outside UK.

bootload · on May 3, 2010

"... Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. ..."

RP [0] [1] is an attempt to solve this "regional dialect" problem and from that may lie a clue.

[0] http://www.bbc.co.uk/dna/h2g2/classic/A657560 [1] http://en.wikipedia.org/wiki/Received_Pronunciation

mstoehr · on May 3, 2010

It's not actually clear what we hear at this point, there is evidence that we respond to something like frequencies, pitch, volume, and other things. But the jury is still out on the low-level signal processing that is occurring in the cochlea and the primary auditory cortex. What's happening further downstream in the brain is even less clear.

rsaarelm · on May 3, 2010

The speech recognition example from a Nexus One phone elsewhere in the comment thread was pretty good on the level of individual words. Pattern matching learned with a massive teaching corpus seems to go pretty far.

The example didn't make sense on the level of sentences though. Problems start with homophonic words. Speech is pretty noisy, and there are words whose sounds you can't distinguish even in theory, but which still need to be picked correctly when transcribing speech. My guess is that humans are good at this by reconstructing a word string that roughly matches the sounds heard and seems to be saying something sensible. (You could test this easily by having one person read out computer generated word soup nonsense and have another person listen and transcribe what they hear.)

The big problem with speech recognition would then be that recognizing meaningful messages from the various likely interpretations of the speech sounds is pretty much AI complete for the general case. You could try to do the Google translate approach and use a huge amount of existing text to teach the system about the type of sentences people write a lot, but this is still a bit limited.

Things might work better if the language for the recognized speech is assumed to be somewhat artificial to begin with. There are a lot of stereotypical phrases which the system could know to expect in various professional voice protocols like the ones doctors or aviators use. The system might also work with a computer command language that has a regular grammar by rejecting unparseable interpretations of the speech sounds, but that might not be very comfortable to use.

jacquesm · on May 3, 2010

I'd like to recommend a book, it's called 'the acoustical foundations of music', it has some excellent chapters on how the ear works, and what sound does when it enters the ear.

No speech recognition software that I'm aware of uses the 'raw' waveform, there is always a domain transformation before even beginning analysis, usually to freq/amplitude using FFT.

abstractbill · on May 3, 2010

Indeed, computers are not good at listening. Even pitch detection is hard. Even monophonic pitch detection is surprisingly so (I just finished reading a bunch of papers about that).

robertfortner · on May 3, 2010

What word(s) are we looking at in the .gif?

chime · on May 3, 2010

I just linked to a random GIS for waveform to highlight what I was talking about. However, looking at the gif, you can't even tell if it's music, human voice, machine noises, or something completely different.

sqrt17 · on May 3, 2010

This is precisely why no one uses the raw waveform. Speech recognizers and phoneticians alike use something which looks more like this: http://commons.wikimedia.org/wiki/File:Spectrogram_-iua-.png Speech recognition uses a couple more fancy transforms, plus the derivatives, but you get the idea.

And the claim that it can't be right because computers are discrete is just ludicruous. Neuron firing patterns are just as discrete as bits in a computer; and at the higher level, it doesn't matter much if you're looking at (continuous) firing rates or (continuous) floating-point numbers.