Hacker News new | past | comments | ask | show | jobs | submit login

For those who haven't dug into the audio analysis and synthesis field, it is really hard to understand why speech recognition is so complex. As living beings, we interpret sounds in a fundamentally different way than computers. Our brain has the ability to interpret sound-waves and provide meaning and context to them without us even realizing it. Once you hear a cat's meow, even if it happens in a different frequency range later, you can still associate it as a cat's meow.

This is what computers "hear": http://www.ling.ed.ac.uk/images/waveform.gif - in fact, this is ALL they can hear. When you hear 44.1khz, audio, that is 44100 bytes of information (a value between -128 to +127 per byte) per second. There is no hidden metadata behind the waveform. At 100% zoom-level, the waveform IS the audio. Theoretically, you can take a screenshot of a 100% zoomed waveform and convert that to actual sound with absolutely no loss of data (of course, you'd need a really high resolution to show a graph 44100x256 pixels in size. Now given such a waveform, how would you convert that to plain-text?

As an example, try recording "ships" and "chips" into a mic and view the waveform. See if there are any patterns you can identify between the two waveforms. I've done it over a hundred times. There isn't an easy way to discern if the letter was "sh" or "ch". Yet our brain does it so very easily thousands of times every day. So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI.

My unscientific gut-feeling is that we're going about all of this the wrong way. We are using the wrong tools. Discrete, digital computers will never be able to tackle problems like pattern recognition in their current state. Switching to analog isn't going to improve anything either. I don't know what the correct instruments/devices will be but I know programming them will be very different.




As much as I agree with you in general... why this? "So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI."

It's not like we hear the waves really. We hear the pitch and volume mainly (proved nicely by playing simple impulses at >10Hz). So it's not like we fail easy pattern recognition on waves. I'd say that it's something we should not be looking at. Ever. Especially when you can have many representations which sound the same (they sound the same, because they look really similar in the frequency domain).

Using the frequency analysis, DFT, etc. is the way to approach it.


> Using the frequency analysis, DFT, etc. is the way to approach it.

Indeed. I meant to say there is no shortcut that immediately makes speech recognition easy to solve. Hence, we have to use a lot of advanced math that works pretty well but like the article says, not as well as a typical human.

However, I don't think humans are that great either. The proof of that becomes evident when working on projects with people from around the world. Ask 10 people from around the world to dictate 10 different paragraphs that the other nine have to write down. I doubt they'll have the 98% accuracy that the article states, especially if there is no domain-constraint. Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. Of course, this doesn't mean it's not fun and interesting for computer scientists. It's just hard for most people to realize why the stupid automated voice-attendant can't understand they said "next" and not "six".


It works quite well if you don't have to care about the accent though.

Trivia: "Automatic computer speech recognition now works well when trained to recognize a single voice, and so since 2003, the BBC does live subtitling by having someone re-speak what is being broadcast." (http://en.wikipedia.org/wiki/Closed_captioning)

But I'm pretty sure they started using it globally sometimes, because you can see the recognition quality fall when there's an interview with someone from outside UK.


"... Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. ..."

RP [0] [1] is an attempt to solve this "regional dialect" problem and from that may lie a clue.

[0] http://www.bbc.co.uk/dna/h2g2/classic/A657560 [1] http://en.wikipedia.org/wiki/Received_Pronunciation


It's not actually clear what we hear at this point, there is evidence that we respond to something like frequencies, pitch, volume, and other things. But the jury is still out on the low-level signal processing that is occurring in the cochlea and the primary auditory cortex. What's happening further downstream in the brain is even less clear.


The speech recognition example from a Nexus One phone elsewhere in the comment thread was pretty good on the level of individual words. Pattern matching learned with a massive teaching corpus seems to go pretty far.

The example didn't make sense on the level of sentences though. Problems start with homophonic words. Speech is pretty noisy, and there are words whose sounds you can't distinguish even in theory, but which still need to be picked correctly when transcribing speech. My guess is that humans are good at this by reconstructing a word string that roughly matches the sounds heard and seems to be saying something sensible. (You could test this easily by having one person read out computer generated word soup nonsense and have another person listen and transcribe what they hear.)

The big problem with speech recognition would then be that recognizing meaningful messages from the various likely interpretations of the speech sounds is pretty much AI complete for the general case. You could try to do the Google translate approach and use a huge amount of existing text to teach the system about the type of sentences people write a lot, but this is still a bit limited.

Things might work better if the language for the recognized speech is assumed to be somewhat artificial to begin with. There are a lot of stereotypical phrases which the system could know to expect in various professional voice protocols like the ones doctors or aviators use. The system might also work with a computer command language that has a regular grammar by rejecting unparseable interpretations of the speech sounds, but that might not be very comfortable to use.


I'd like to recommend a book, it's called 'the acoustical foundations of music', it has some excellent chapters on how the ear works, and what sound does when it enters the ear.

No speech recognition software that I'm aware of uses the 'raw' waveform, there is always a domain transformation before even beginning analysis, usually to freq/amplitude using FFT.


Indeed, computers are not good at listening. Even pitch detection is hard. Even monophonic pitch detection is surprisingly so (I just finished reading a bunch of papers about that).


What word(s) are we looking at in the .gif?


I just linked to a random GIS for waveform to highlight what I was talking about. However, looking at the gif, you can't even tell if it's music, human voice, machine noises, or something completely different.


This is precisely why no one uses the raw waveform. Speech recognizers and phoneticians alike use something which looks more like this: http://commons.wikimedia.org/wiki/File:Spectrogram_-iua-.png Speech recognition uses a couple more fancy transforms, plus the derivatives, but you get the idea.

And the claim that it can't be right because computers are discrete is just ludicruous. Neuron firing patterns are just as discrete as bits in a computer; and at the higher level, it doesn't matter much if you're looking at (continuous) firing rates or (continuous) floating-point numbers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: