Rest in Peas: The Unrecognized Death of Speech Recognition

nostrademons · on May 3, 2010

This comment is voice posted to my nexus 1, without ed its.

I find that the speech recognition on my next 1 is adequate 4 basic search queries. I tried old freezes listed in the article as search query. Rest in peace high st correctly. Sb inspiration came out of sudan inspiration. Serve as the installation, remarkably, king out exactly correct. Saving 1 into the phone give me a number instead of a word. Saying recognize speech came out okay.

The problem with speech recognition of long passages things to beat that there is a large amount of information beyond the worst insults. This looks like that speak for example. Humans are also very sensitive to misplaced woods. That would be in the last sentence completely changes the meaning of this. I also found the speaking twin machine feels very natural. I have to stop and pause between each sentence because i can't remember what i'm thinking about.

As you can see from descon and, speech recognition has a long way to go to it. But you can at least sort of get the gist of the conversation.

nostrademons · on May 3, 2010

I'm going to forget what I actually meant to say above by morning, so here's the translation typed out:

"This comment is voice-posted from my Nexus One, without edits.

"I find that the speech recognition on my Nexus One is adequate for basic search queries. I tried all the phrases listed in the article as search queries. 'Rest in peace' parsed correctly. 'Serve as the inspiration' came out as 'Sudan inspiration'. 'Serve as the installation', remarkable, came out exactly correct. Saying 'one' into the phone gave me a number instead of a word. Saying 'recognize speech' came out okay.

"The problem with speech recognition of long passages seems to be that there is a large amount of information beyond the words themselves. This looks like netspeak, for example. Humans are also very sensitive to misplaced words. The 'woods' in the last sentence completely changes the meaning of it. I also found that speaking to a machine feels very unnatural. I have to stop and pause between each sentence because I can't remember what I'm thinking about.

"As you can see from this comment, speech recognition has a long way to go before it becomes practical. But you can at least sort of get the gist of the conversation.

FluidDjango · on May 3, 2010

> you can at least sort of get the gist of the conversation.

Or: you can get the opposite of the intended meaning, e.g., when you said 'UNnatural' it heard 'natural'.

nostrademons · on May 3, 2010

The unfortunate part is that I didn't even know it made that error until I posted and re-read my comment from my laptop. The Nexus One's text boxes seem to have issues with a lot of text...when the page is big enough to scroll and the text is also big enough to scroll, it's hard to scroll the text without moving the page. So everything after the second paragraph basically happened off-screen with no visual feedback.

fhars · on May 3, 2010

If you say that it parsed "rest in peace" correctly, does that mean that it mis-parsed "rest in peas" and so commited the error mentioned in the article of replacing the unexpected word that carries most of the meaning with the most likely word?

nostrademons · on May 3, 2010

I said "rest in peace" and it came out as "rest in peace". I just tried "rest in peas" and it also came out as "rest in peace", so yeah, the N1 does commit the error that the article mentioned.

10ren · on May 3, 2010

It's interesting that the article said human recognition is up to 98%. We're not perfect. So perhaps another tack is for the computer to correct mistakes in a more natural way. eg. by asking; by checking what the word makes sense in terms of subsequent context. BTW I didn't notice that "woods" was a speecho, but read it as "words".

I agree that linguists are sometimes too theory-focussed to notice the data. Pinker's excellent but self-consciously clever The Language Instinct has examples of nested phrases that he claims are understandable - but I can't parse them using my native speech recognition technology (I can parse them using linguistic theory):

The rapidity that the motion that the wing has has is remarkable. ["has" is repeated]

In other words: my native human grammar does not nest arbitrarily; the linguistic theory does. I'm going with the theory being wrong.

Anyway, as has been said, we'll have speech recognition when we have speed comprehension, ie strong AI.

dirtyaura · on May 3, 2010

The article highlighted that that recognition accuracy has plateaued around 80% and this hasn't really improved much in 00s. Glancing over your trial, it seems that Nexus One is below 80% recognition accuracy. Thus, if the claim of the article is to be believed, speech recognition might never become practical.

mikeklaas · on May 3, 2010

Wow, I thought you were purposefully making a joke post and inserting fake errors. Those results are terrible.

jacquesm · on May 3, 2010

I understand your meaning, but as speech recognition goes they're actually pretty good.

http://www.youtube.com/watch?v=IkeC7HpsHxo

jodrellblank · on May 3, 2010

Does the nexus 1 do speech recognition, or does it send audio to Google servers to do it?

jacquesm · on May 3, 2010

It sends the audio to googles servers.

lutorm · on May 3, 2010

I used NaturallySpeaking quite a lot for many years due to severe RSI, and while it's nowhere near perfect, it's way better than the Google VR that you get through Google Voice's transcription. I assume the nexus one works on the same technology.

It's not a fair comparison though, because for DNS to work well you a) have to have a good noise canceling microphone and a good aural environment, and b) you have to talk like a newscaster.

The 80% in the article is a very pessimistic figure, in my experience. I guess the question is what they mean by "conversational". If you speak like you would to another fluent speaker of the language, you're bound to fail. The closest I can compare it to is to imagine you're speaking to a foreigner with only basic understanding of the language. The same issues with homophones and inability to correctly separate an utterance into the correct word boundaries trip up human learners, too.

donaldc · on May 3, 2010

I've actually found that if I speak slowly and clearly, the nexus one figures out exactly what I'm saying for more than 90% of the sentences. This is very useful, and much better than any voice recognition program I'd seen about ten years ago.

chime · on May 3, 2010

For those who haven't dug into the audio analysis and synthesis field, it is really hard to understand why speech recognition is so complex. As living beings, we interpret sounds in a fundamentally different way than computers. Our brain has the ability to interpret sound-waves and provide meaning and context to them without us even realizing it. Once you hear a cat's meow, even if it happens in a different frequency range later, you can still associate it as a cat's meow.

This is what computers "hear": http://www.ling.ed.ac.uk/images/waveform.gif - in fact, this is ALL they can hear. When you hear 44.1khz, audio, that is 44100 bytes of information (a value between -128 to +127 per byte) per second. There is no hidden metadata behind the waveform. At 100% zoom-level, the waveform IS the audio. Theoretically, you can take a screenshot of a 100% zoomed waveform and convert that to actual sound with absolutely no loss of data (of course, you'd need a really high resolution to show a graph 44100x256 pixels in size. Now given such a waveform, how would you convert that to plain-text?

As an example, try recording "ships" and "chips" into a mic and view the waveform. See if there are any patterns you can identify between the two waveforms. I've done it over a hundred times. There isn't an easy way to discern if the letter was "sh" or "ch". Yet our brain does it so very easily thousands of times every day. So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI.

My unscientific gut-feeling is that we're going about all of this the wrong way. We are using the wrong tools. Discrete, digital computers will never be able to tackle problems like pattern recognition in their current state. Switching to analog isn't going to improve anything either. I don't know what the correct instruments/devices will be but I know programming them will be very different.

viraptor · on May 3, 2010

As much as I agree with you in general... why this? "So, failing easy pattern recognition, we have to use frequency analysis, DFT, and tons of AI."

It's not like we hear the waves really. We hear the pitch and volume mainly (proved nicely by playing simple impulses at >10Hz). So it's not like we fail easy pattern recognition on waves. I'd say that it's something we should not be looking at. Ever. Especially when you can have many representations which sound the same (they sound the same, because they look really similar in the frequency domain).

Using the frequency analysis, DFT, etc. is the way to approach it.

chime · on May 3, 2010

> Using the frequency analysis, DFT, etc. is the way to approach it.

Indeed. I meant to say there is no shortcut that immediately makes speech recognition easy to solve. Hence, we have to use a lot of advanced math that works pretty well but like the article says, not as well as a typical human.

However, I don't think humans are that great either. The proof of that becomes evident when working on projects with people from around the world. Ask 10 people from around the world to dictate 10 different paragraphs that the other nine have to write down. I doubt they'll have the 98% accuracy that the article states, especially if there is no domain-constraint. Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. Of course, this doesn't mean it's not fun and interesting for computer scientists. It's just hard for most people to realize why the stupid automated voice-attendant can't understand they said "next" and not "six".

viraptor · on May 3, 2010

It works quite well if you don't have to care about the accent though.

Trivia: "Automatic computer speech recognition now works well when trained to recognize a single voice, and so since 2003, the BBC does live subtitling by having someone re-speak what is being broadcast." (http://en.wikipedia.org/wiki/Closed_captioning)

But I'm pretty sure they started using it globally sometimes, because you can see the recognition quality fall when there's an interview with someone from outside UK.

bootload · on May 3, 2010

"... Understanding what the other person says is hard. Put a Texan and South Indian in the same room and see what I mean. ..."

RP [0] [1] is an attempt to solve this "regional dialect" problem and from that may lie a clue.

[0] http://www.bbc.co.uk/dna/h2g2/classic/A657560 [1] http://en.wikipedia.org/wiki/Received_Pronunciation

mstoehr · on May 3, 2010

It's not actually clear what we hear at this point, there is evidence that we respond to something like frequencies, pitch, volume, and other things. But the jury is still out on the low-level signal processing that is occurring in the cochlea and the primary auditory cortex. What's happening further downstream in the brain is even less clear.

rsaarelm · on May 3, 2010

The speech recognition example from a Nexus One phone elsewhere in the comment thread was pretty good on the level of individual words. Pattern matching learned with a massive teaching corpus seems to go pretty far.

The example didn't make sense on the level of sentences though. Problems start with homophonic words. Speech is pretty noisy, and there are words whose sounds you can't distinguish even in theory, but which still need to be picked correctly when transcribing speech. My guess is that humans are good at this by reconstructing a word string that roughly matches the sounds heard and seems to be saying something sensible. (You could test this easily by having one person read out computer generated word soup nonsense and have another person listen and transcribe what they hear.)

The big problem with speech recognition would then be that recognizing meaningful messages from the various likely interpretations of the speech sounds is pretty much AI complete for the general case. You could try to do the Google translate approach and use a huge amount of existing text to teach the system about the type of sentences people write a lot, but this is still a bit limited.

Things might work better if the language for the recognized speech is assumed to be somewhat artificial to begin with. There are a lot of stereotypical phrases which the system could know to expect in various professional voice protocols like the ones doctors or aviators use. The system might also work with a computer command language that has a regular grammar by rejecting unparseable interpretations of the speech sounds, but that might not be very comfortable to use.

jacquesm · on May 3, 2010

I'd like to recommend a book, it's called 'the acoustical foundations of music', it has some excellent chapters on how the ear works, and what sound does when it enters the ear.

No speech recognition software that I'm aware of uses the 'raw' waveform, there is always a domain transformation before even beginning analysis, usually to freq/amplitude using FFT.

abstractbill · on May 3, 2010

Indeed, computers are not good at listening. Even pitch detection is hard. Even monophonic pitch detection is surprisingly so (I just finished reading a bunch of papers about that).

robertfortner · on May 3, 2010

What word(s) are we looking at in the .gif?

chime · on May 3, 2010

I just linked to a random GIS for waveform to highlight what I was talking about. However, looking at the gif, you can't even tell if it's music, human voice, machine noises, or something completely different.

sqrt17 · on May 3, 2010

This is precisely why no one uses the raw waveform. Speech recognizers and phoneticians alike use something which looks more like this: http://commons.wikimedia.org/wiki/File:Spectrogram_-iua-.png Speech recognition uses a couple more fancy transforms, plus the derivatives, but you get the idea.

And the claim that it can't be right because computers are discrete is just ludicruous. Neuron firing patterns are just as discrete as bits in a computer; and at the higher level, it doesn't matter much if you're looking at (continuous) firing rates or (continuous) floating-point numbers.

barrkel · on May 3, 2010

I think this article conflates two quite different things: speech recognition and understanding language.

The problem of speech recognition, it seems clear now, is unsolvable without understanding language, because only with an understanding of language are the ambiguities inherent in speech properly resolved.

But having a workable understanding of language would seem to rely on us having some kind of model of a mind. Mere statistical relations between words or sentence structures aren't enough to map human language into mental models of the world, but having a consistent internal model of the world seems to be the key to understanding language. We disambiguate by minimizing the internal inconsistency.

Having a computational model for such an internal view of the world seems to be a Hard AI problem. We probably won't make much further progress with speech recognition until we've made headway there.

mstoehr · on May 3, 2010

Actually most research effort in speech is more on the language side rather than the signal processing of the speech signal. So I think many people have a similar intuition as yourself.

Bear in mind though, that humans significantly outperform machines in tasks where isolated or streams of non-sense syllables are said: i.e. "badagaka" is said and humans can pick out the syllables whereas computers can have a lot of difficulty (in noise in particular).

Computers start approaching human performance most when there is a lot of linguistic context to an utterance. So it appears that humans are doing something other than using semantics.

fauigerzigerk · on May 3, 2010

Good points, but I think we underestimate how much situational context humans use when they interpret language. Sometimes we can communicate with very little language simply because we know what the purpose of the interaction is.

Another thing I keep wondering about is why so little emphasis is put on dialog. When humans don't understand something, they ask, or offer an interpretation and ask whether it's the right one.

Speech recognition systems don't seem to do that. They say "Sorry, I could not understand what you said. Please repeat". That's not very helpful for the computer of course. It should say: "Huh, Peas? Why would anyone rest in peas for heaven's sake??". Then the human could sharpen his SS and say "PeaCCCEE!!! not peas. I'm not talking about food, I'm talking about dying!".

lutorm · on May 3, 2010

Context is huge for human interpretation. If you've ever have someone address you in a different language than you were expecting, you know what I mean. It's almost like you can imagine the search just going deeper and deeper without finding anything that makes sense until it swaps in the other language and go: Ah, you said "good morning"! :-)

eru · on May 4, 2010

Especially embarrassing when somebody addresses you in your native language, and you expected something different.

mstoehr · on May 3, 2010

It is true that humans do use situational context. In the cases where semantics is important and complex for understanding an utterance a computer will fail even more because it won't get the semantics or the speech signal.

On the topic of dialog, this is arguably the area that speech recognition has gained in over the last nine years. Prior to 2001 there were not many usable dialog systems and (depending on your definition of "usable") there are many usable dialog systems deployed in call centers around the world.

Most call center dialog systems have a rudimentary system asking for people to repeat things when it doesn't understand. Although, if it asks more than once the callers tend to get very angry.

jerf · on May 3, 2010

Nobody would use a system that interrogated you on every fifth word. That would actually be a step worse than silent failure on every fifth word.

fauigerzigerk · on May 3, 2010

It shouldn't interrupt you once every 5 words of course. What it should try to do is to create a model of what you meant to say. At some point, if the system is unsure, it should ask you to confirm or correct what it has understood so far.

cia_plant · on May 3, 2010

Where does it conflate those? I thought it discussed exactly the point you've just made.

stingraycharles · on May 3, 2010

I did an internship working on speech recognition, and we worked on exactly the statistical relations between words as you describe it. "Rest in peas" is a perfect example of a sentence that would have been corrected.

ebiester · on May 3, 2010

Two anecdotes:

First, I worked with a quadriplegic engineer who relied on speech recognition. While it wasn't perfect, he certainly did well with it. The trick is that he trained himself as much as he did the computer. He spoke more slowly, and enunciated the places where he knew of problem points. I don't even know how much of this was conscious - I only watched a few times and never asked him.

If we treat our voice recognition similarly, we get much better results.

2. Due to a few too many loud concerts and bars, I have partial hearing loss and can miss words myself. The difference between myself and a computer is that the computer is expected to output immediately rather than being allowed to wait for more context clues, and the computer isn't allowed to interrupt to ask about words.

These two strategies, allowing for a more conversational flow, may be what we need to improve speech recognition.

TravisAllison · on May 3, 2010

Surprising fact despite what your smartphone tells you: speech recognition in general situations hasn't improved since 2001. A good exposition of why Norvig and Google's approach of using statistical analysis may be approaching a dead end.

Tichy · on May 3, 2010

I think something must be wrong with Norvig's approach, because humans can learn a lot faster. We don't need one billion voice samples to learn understanding. So there must be a better way than the ginormous dataset approach.

eru · on May 4, 2010

Why? Perhaps your genes have been trained (=evolved) on big datasets to produces good speech recognizers.

xxzz · on May 3, 2010

But Hinton recently made significant advances in speech recognition using RBMs.

Google Tech Talk: http://www.youtube.com/watch?v=VdIURAu1-aU

mstoehr · on May 3, 2010

He didn't make any advance that has made its way to a full word recognizer, he's merely recognizing phonemes (which are linguistic subunits of words) several researchers in the field have criticized his methods. Additionally, none of the top five phoneme recognizers have ever been deployed as a word recognizer, and there is little chance that they even will be in the next few years.

woodson · on May 3, 2010

The concept of phonemes isn't undisputed either. When analyzing actual speech it becomes clear that there are no real steady states, but much coarticulation between the "segments". Of course, part of it could be attributed to the fact that speech sounds are produced by articulatory gestures, which necessarily overlap in time. On the other hand, these coarticulation patterns are not language-independent. So, a purely (articulatory/auditory) phonetical explanation of why these differences exists is rather unlikely.. I know this seems rather off-topic with regard to speech recognition, but the question of the basic building blocks of language is kind of at the heart of the problem.

mstoehr · on May 3, 2010

I agree that its at the heart of it (and I'm presently writing a paper where I'm using articulatory-phonetic features rather than phonemes). Unfortunately, there is no large-vocabulary speech recognizer that uses articulatory phonetics (yet!). Every large scale speech recognizer and most small scale use phonemes and are trained using speech that has been transcribed into phonemes. There is almost no data that is annotated with articulatory phonetics (a problem I'm working on right now).

woodson · on May 5, 2010

I guess that's in part because it's even more difficult to (manually) transcribe speech into articulatory-phonetic elements based on the acoustic signal (laryngeal gestures?? Clearly they are there in articulation, but their acoustic correlates are masked to some extent).

Automatic alignment methods are probably quite hard to implement, given the various coarticulation patterns in the signal depending on context/prosodic position etc.

Could you provide a link to papers or other materials dealing with articulatory features in speech recognition?

I guess I should take another look at Browman/Goldstein's Articulatory Phonology

J3L2404 · on May 3, 2010

Nice post. Question- Is there a relation between RBM and HMM?

sqrt17 · on May 3, 2010

Kind of. Both RBMs and HMMs are a mean to squeeze statistics out of something so that you can take the something away and still know what it looked like. RBMs are a bit more involved (and hence a whole lot slower, which still matters in speech recognition), while HMMs are simpler but good enough for people to stick to them (frustratingly for all the people who propose something fancier)

ephermata · on May 3, 2010

While the article focuses on speech recognition for arbitrary speech, it misses the fact that speech interfaces for specific situations are now actually useful. Today, I told my car "play track 'severed head'," and it actually played the correct song. I asked my phone "what is my next appointment?" and heard my calendar. I then said "dial 206 421 8989," my phone dialed properly, and so on and so forth. This is all without any explicit training on my part. No need to read Mark Twain or anything like that.

There are still problems here, but the technology for speech interfaces has gone from terrible to OK in the last seven years. I'm looking forward to seeing where it goes next.

jerf · on May 3, 2010

"But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space."

"As with speech recognition, parsing works best inside snug linguistic boxes, like medical terminology, but weakens when you take down the fences holding back the untamed wilds."

No, it did not miss it. It was a core point of its argument; the entire arc of the article is about how we made steady progress on the small cases but crapped out on the general case.

discipline · on May 3, 2010

Totally anecdotal, but I thought I'd pass this on.

Whenever I use Goog 411 or Dragon, my results improve if I speak like a standard radio announcer. You know, that kind of jokey, fake lilting voice that they use on car commercials? I hate doing it because I sound like such an ass, but whenever I've done so, Google understands what I'm saying. YMMV.

JasonAllison · on May 6, 2010

Speech recognition is pattern recognition. Computers are inefficient and relatively limited in their capacity to discern relevant data from complex data sets when the size of these sets begins to soar exponentially. That's why weather prediction in many areas isn't worth much more than a week. Consider how well the human brain can synthesize disparate data in the relatively simple form of sentences with scrambled word spellings. Add to this, the nearly infinite combinations of vocabulary, grammar, voice tone and context of speech and we can begin to see the scale of the task of matching with programs and algorithms, what the human mind does unconsciously in real time with a high degree of accuracy.

http://simplebits.com/notebook/2004/01/16/mipellssed-wdors/

Mipellssed Wdors Posted on January 16th, 2004 at 1:44 pm

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh?

Filed in: misc

MichaelGG · on May 3, 2010

MSR's MindNet, mention in the article, is online:

http://stratus.research.microsoft.com/mnex/Main.aspx

donaldc · on May 3, 2010

It's not dead. It is in fact used a lot more than ten years ago. I've yet to get a voice message that google voice was not able to decode sufficiently, and most of the transcribed messages have been 100% accurate.

General speech recognition has a long way to go, but it will happen last in speech recognition. Meanwhile, in the real world, speech recognition is noticeably better for everyday use than it was a decade ago.

tjpick · on May 3, 2010

I have conflicting anecdotes: 1. it's used heavily and successfully in medical dictation situations, ie medical professionals dictating case notes. 2. I was so excited about speech control in OSX, and I tried it, and it was shit because I had to speak with a terrible impression of an American accent to get it to do anything.

robertfortner · on May 3, 2010

I'd say they aren't really conflicting. Commercial speech recognition really started in the medical field. Accuracy is much higher with a controlled vocabulary.

Speaker-specific characteristics, e.g. accent, by contrast are much harder for recognizers.

cwinters · on May 3, 2010

Medical dictation is a success for recognition, but it's important to remember that no such system can function by itself. Instead, they're often advertised as making the human transcriptionist's job easier -- reducing the number and type of fixes a human has to make when correcting the dictation results.

kowen · on May 3, 2010

I paid my way through college doing medical transcription. Throughout my time there, the hospital was experimenting with speech recognition, but the success rate was very low, and the doctors spent a lot of time going over and correcting the output. I believe they finally ditched the voice recognition software (though I do not know for sure).

I'm guessing that the reason the experiment failed at this hospital is that the language was non-English. In addition, most of the doctors were foreign, speaking with wildly differing accents. Also, many of the doctors would mix two languages when dictating, in addition to the usual Latin terms.

Basically, it was easier to just pay some humans to parse it.

tjpick · on May 3, 2010

the ones where it is successful invest quite a bit into training the software and the doctors use macros heavily.

jessriedel · on May 3, 2010

My anecdotal evidence from a surgeon in the family is that the medical dictation is still very bad. His practice of 4 doctors spent a bunch of money on Dragon, fought with it for a month, and then went back to old-fashioned human dictation.

tjpick · on May 3, 2010

from what I remember, you might be right that Dragon is not that great, PowerScribe is the one I remember being more popular -- I was doing RIS (Radiology Information System) work at the time about a year ago.

mkramlich · on May 3, 2010

Interesting about it being used heavily and "successfully" in the medical field. That's the one field where I'd want 100% transcription perfection. Mistakes there could kill.

barrkel · on May 3, 2010

It could hardly be worse than doctor's handwriting...

billswift · on May 3, 2010

I think the core difficulty to further speech recognition is semantic understanding.

One of the reasons computers aren't good at it yet is sheer lack of computer power. Which forces them to use less context to decide word meanings. There is a reason humans think and remember things mostly in the form of stories, it provides more context for memory and decoding cues.

The article talks about how much computer power has increased over the last decade without any increase in transcription accuracy ("freakish" is the word it uses) without mentioning the fact that it is still enormously behind human processing capabilities.

noisedom · on May 3, 2010

Is the problem in guessing phonemes or guessing which word is being said based on the phonemes? I'd assume that phonemes are easy to guess at since it's such a small set compared to the number of possible words.

kvh · on May 3, 2010

I disagree. This is another example of underestimating how powerful the human brain is: it is an exquisitely designed piece of dedicated hardware with more power than even our super-compute-clusters. Hardware that's dedicated, to a large extent, to language processing. I think current statistical approaches, ramped up on future computing resources, will continue to chip away at 'word error rate'.

wglb · on May 3, 2010

My favorite hard-to-understand phrase is "Isle of Lucy", which is pretty tough to get right when spoken human-to-human.

jheriko · on May 3, 2010

I wouldn't draw so much from what looks like a local minimum - especially when the abilities of people imply that there is a suitable algorithm. The problem is its no longer a successful marketing gimmick - it won't help to sell software/hardware anymore until it works as well as people do.

motters · on May 3, 2010

So the question remains as to how to get that last 10-15% of transcription accuracy. It sounds as if existing methodology has run out of steam and that a paradigm shift is required to get to human-like speech recognition.

RyanMcGreal · on May 3, 2010

Was speech recognition software ever any good for anything other than having fun with your kids?

korch · on May 6, 2010

"It's hard to wreck a nice beach"

J3L2404 · on May 3, 2010

Well, so much for HAL reading lips - he can't even hear what I'm saying. I call bullshit - HAL's way smarter than that!

eru · on May 4, 2010

Perhaps they had better (and now forgotten) techniques back in 2001?

dmfdmf · on May 3, 2010

Great article.... Like a lot of things it all hinges on P=NP.

robertfortner · on May 3, 2010

Thanks for reading. But I think a key take-away is that even with infinite computing power, you only increase the chance of being right a little. It's still a guess.

That's the real surprise. It's not a question of computing power.

ahk · on May 3, 2010

Your articles seem to follow a theme of "no new ideas/breakthroughs". The anecdote of the NASA head imploring people to come up with ideas was striking. Any guesses as to why this is happening now? After all, we _were_ having critical breakthroughs through much of the last century.

robertfortner · on May 3, 2010

To some degree, I'm critiquing Ray Kurzweil who generalizes about progress. So, for better or worse, I take a lesson from that which is to be empiricist: look closely at each field to see whether it's vaulting forward or not.

It's not happening in space (although hypersonics seem to be getting more real). It's not happening in AI or speech recognition/understanding. It's not happening in medicine.

So the net empirical situation looks like the opposite of what Kurzweil says. Only IT and Moore's Law are going totally nuts. One can conjecture about why, but I think that's conjecture and there aren't deep fundamental principles about scientific and technological progress, at least that we've found so far.

mturmon · on May 3, 2010

Nice article with a lot of information. Honest question here: you make some strong claims, have you been involved in speech recognition research, or are you an informed outsider? It seems like the latter? Poking fun at AI optimists like Kurzweil is fun, but not insightful at this time. The field has taken a different direction since then and I'm not sure that people know where corpus based statistical methods are going to go. Again, honest question.

robertfortner · on May 3, 2010

Correct surmise: I'm not a speech recognition researcher. I'm a science and technology writer.

I was very surprised to find that recognition accuracy stopped getting better a long time ago. I do poke fun, but I also think Kurzweil (as I said in the piece) very reasonably believed that Moore's Law would get us a long way to AI. And surprisingly, it hasn't.

I like to think I don't have a stake in the argument and just let what's actual decide whether or not we're going to get to computer understanding of language and/or AI.

dmfdmf · on May 3, 2010

Yes, infinite computing power with current methodologies will never work. The article mentions that the number of possible sentences is of the order 10^570 and current methods attempt to crack that (or sub) solution domains (primarily) using probabilities of association. This method will never work, except on small sub-domains (e.g. numbers or medical field transcription) as the article mentioned.

However, if P=NP then there are better algorithms that can solve the problem. Does P=NP? Nobody knows but to me it seems to be true because when anyone interprets speech they definitely are not searching a 10^570 domain to solve the problem.

Tichy · on May 3, 2010

Obviously, if you run the wrong algorithm, even infinite computing power won't help. Hence I don't see how you arrive at that conclusion.

arohner · on May 3, 2010

Not really. The human brain is more or less a computer, and it does speech recognition in hardware.

We have empirical proof that a discrete amount of hardware (1 brain), is enough to solve the problem.

dmfdmf · on May 3, 2010

Yes, but what is the brain doing?

ugh · on May 3, 2010

Whatever it is, it plays by the same rules as everyone else.

dmfdmf · on May 3, 2010

I'm curious, what in my comment would make you think that I think otherwise?

Tichy · on May 3, 2010

A total horseshit article... I mean to conclude from a lack of progress for a couple of years that AI is impossible.

Tichy · on May 3, 2010

That was quote from the article btw.