Hacker News new | past | comments | ask | show | jobs | submit login
Google has developed speech-recognition technology that actually works. (slate.com)
101 points by pelf on April 8, 2011 | hide | past | favorite | 49 comments



Sorry to be a killjoy, but the premise of this article--that Google developed the speech-recognition technology--hurts my feelings (to say the least) and underestimates the contributions of the NLP community.

Speech recognition, like machine translation, is academic in origin, and much of the work is still carried out in academia. For example, Google did not "invent" machine translation. No, Google Translate is an adapted version of academic systems. Perhaps the phrase tables are sharded and so is the language model, but the general algorithms are the same. Sure, one of Berkeley's NLP grads is working there, but it's basically an adapted version of what's available. They publish papers like "Stupid Backoff" [1], but that makes them as much a contributor as any other member of the NLP community.

Speech recognition is the same thing. Google is the company that takes existing research and adapts it.

To claim that Google developed the speech recognition technology is to discredit the contributions of everyone else in the NLP community. Google has been generous at funding NLP research at the university level. Do you consider the results of those research "Google"'s?

Ultimately, the main difference is that Google has magnitudes more data and the physical capacity to handle that, not that it solved some systems or architectural bottleneck that has been limiting us. Someone once said that all you need is a crappy model and great data to build a good ML-based algorithm...

[1] http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf


Having done only the smallest amount of work trying to apply academic research, I have to say that the standard approach through a lot of AI is to develop good ideas about as far as needed to get a few papers done (though I don't know NLP in particular).

The work of putting this stuff together into a system which works consistently and at-scale is hard.

Basically, it is unfortunate that the standard academia is publishing papers (or pdf files) rather than publishing libraries. With this standard, the academics can't even readily use each others algorithms with this situation.

But when academics have this standard, it seems dumb to hear the complaint "uh no, you didn't do anything but apply our ideas..."


I am not an expert on speech recognition, but I am somewhat involved with machine translation. For MT, it's very architectural systems issues that limit us, first from trying different models and algorithms, but also in general.

How do you know ideas are "good" enough to be publishable unless you do plenty of experiments involving billion (or trillion) word corpora? I have a hard time imagining that research in other fields don't require validation.


I'm not saying that the papers that get published aren't good or valid. "Good enough to publish a paper" is indeed good.

It's just that once the paper is published, it becomes a cul-de-sac, a nice little city with no roads leading in or out, etc. Other researchers can only use the result by reproducing the idea by-hand (or at best through crufty Matlab code).

Yes, I'm sure the papers I've scanned involved considerable work and data (I worked in computer vision). But that work is often if not generally unavailable to the reader of the paper.

The point is that in creating a working system, Google has to do more than extend academic research, even if academic research involved good ideas that had been given some thorough tests in isolation.


> Basically, it is unfortunate that the standard academia is publishing papers (or pdf files) rather than publishing libraries. With this standard, the academics can't even readily use each others algorithms with this situation.

I am not sure code should be the uniform standard for judging (computer) scientific work. It is much harder to review/validate a theory when it is supplied in the form of code than paper.

Of course code can serve as an addendum to your work or may be useful as a demonstration of it. (Many academic authors do seem to do this)


Speech recognition, like machine translation, is academic in nature.

I don't understand why you're trying to make this an "us vs. them" thing. What is it about academia that makes it so that Google's research can't be considered a part? Does the fact that there are students around somehow make it different?

Google is the company that takes existing research and adapts it.

I also think you're missing an important differentiator, that between science and technology. The research you're talking about could be summed up as "learning why the world is the way it is"; that's what a scientist does.

But the novel things that Google has done are largely technological. They take that understanding that the scientists have given us, and find a way to apply that to the real world. By and large, Google is full of really good engineers, as opposed to scientists.


I don't understand why you're trying to make this an "us vs. them" thing. What is it about academia that makes it so that Google's research can't be considered a part? Does the fact that there are students around somehow make it different?

Perhaps I was being unclear with "...in nature". I meant that it was academic in origin, and still much of the work is academic. I did not claim that Google has not contributed, but that I was defending the community against a misconception that Google invented/developed something that is partially the work of NLP academics. Re-read my comment please. In particular:

The premise of this article--that Google developed the speech-recognition technology--hurts my feelings (to say the least) and underestimates the contributions of the NLP community.

They publish papers like "Stupid Backoff" [1], but that makes them as much a contributor as any other member of the NLP community.

I am not saying that Google has not contributed to speech. I am arguing that the premise of the article, that Google DEVELOPED the technology, is misleading and insulting.

As for science vs. technology:

NLP is very systems and technology heavy. If you haven't done much NLP research, you might think that it's very theoretical. No, it's very much about how to stuff a larger language model in memory, how to take an NP-hard decoding problem and produce a linear time approximation algorithm (decoding), how to take an optimization problem of multiple parameters and optimize over them computationally, etc.

I agree there is a distinction between science and technology, but to argue that NLP in the academic world is also not "largely technological" is misleading at best.


Perhaps I was being unclear with "...in nature". I meant that it was academic in origin, and still much of the work is academic.

Right, and I'm wondering why you don't consider Google's work to be academic? I'm trying to nail down your definition of "academic".

Is there something about Google's work that is qualitatively different? Is is simply that it's not performed on the campus of a school?

Beyond the scope of NLP, suppose we're discussing two chemists. One works at, say, Merck, while the other works at a university (perhaps on a grant from Merck). What's the difference between the work that the two are doing?

EDIT: in the interest of not being snarky, I may as well come out and say it: your comments gave me every impression of an elitist attitude.


Ah, I meant academic, as in "came out of a university." Sorry I left that unclear.

I wasn't saying it was "qualitatively different," but that it did not develop speech recognition. It was the work of many, MANY brilliant people, and to say Google developed it would be avoiding so many details.

The same with the chemists. There's nothing inherently different simply because of the fact that they work at different places. Suppose chemist at university develops chemical A for some process X and chemist at Merck develops chemical B for the same process X. Both are equal contributions; certainly it would be wrong to say that chemist 2 solved process X.

Is that more sensible?


Personally, I read it more as Google _applied_ -- developed useful software -- speech-recognition technology that works. I can recall testing countless softwares like Dragon Talk (mentioned in the article) for assisting my writing of school papers, to no avail. These softwares were sluggish and woefully inaccurate -- this may no longer be the case, as I haven't given them a second chance since.

I don't think the article disparages past and present work from the NLP community, but you have to expect Google to catch the limelight for the application of such technology. I don't really see anyone else attempting what Google is attempting at scale.

Why wouldn't Google be praised (or an "owner") for being a generous benefactor to a community? Would you not consider research at Googleplex (or any of the 100s of satellite campuses) to be Google's?


Oh yeah, I agree that Google should be praised for taking research and producing useful applications.

I am not disparaging Google in the least. I respect them for bringing NLP to the masses. I meant to dispel the myth that Google invented speech recognition, or that it developed it. The reality is that it is as much of a developer as anyone else in the NLP community.


Actually, google translate is the best academic system there is. They never lost a single nist evaluation (for an example see http://www.itl.nist.gov/iad/mig//tests/mt/2005/doc/mt05eval_... ), as far as I know, and they consistently publish intermediate results and details in their models.

If you go by your logic nobody ever writes software, as most of the building blocks were really invented by other people or adapted from other people's inventions.

And I say this as an NLP researcher.


This is a problem with reporting (and public perception) of computer software/services in general. I don't think it's widely appreciated that "breakthroughs" by consumer software/services companies generally are the tip of a decades-deep research iceberg.


Yes, Google is using other people's work. They didn't invent the computer, for example. So what? If their part of it was negligible, somebody else would have given me a useful voice recognition system by now. Why doesn't notepad have a microphone icon on it? If notepad added a microphone icon tomorrow, do you think it would work well? Check your priorities. Delivering a working solution that I can actually use matters a lot more to me than writing a paper.


The whole point of the article is that the data-driven approach turns out to be a lot more robust that existing academic initiatives. Could anyone else have done it? No. So I think it is fair to say that they developed this technology.


That's not the "whole point of the article".

The "approaches" are academic in origin; the difference is that Google has more data to train those approaches. Their "approaches" are not different from "existing academic initiatives." Academia often deals with issues like "how to deal with sparsity" of data (i.e. smoothing); at Google, those issues are less important. Having data doesn't not mean having developed the technology.


I tried it up. It doesn't work went well services dept.

Maybe this is my voice doesn't give me a couple of words wrong in all 50 states. Did welcome center for dental bar kansas city missouri.

----

I tried it out. It doesn't work quite as well as the article suggests.

Maybe it's just my voice, but it gets at least a couple of words wrong in almost every statement. "It", "welcome", and "thoughts", for example, are consistently misheard.


You can really see the "search term bias" in that snippet.


I've found the Android speech recognition works great for voice search and sucks for everything else. I've written a HN post or two with it; it either needed much hand editing or an outright translation to be intelligible.


Very interesting observation!

It didn't immediately seem obvious to me why there were references to "welcome center", "dental", etc.


I've got one years worth of transcribed Google Voice VMs that back you up on this. The way Google does "best fit" matching of words sometimes results in messages that are just hilariously mistranscribed. You can usually get the gist of what the person actually said, which makes the transcription useful, but once in a while the message just diverts off into total nonsense.

Here's a recent one:

Hey George, it's me does not seem to be working and it's going, but it doesn't seem to be having issue tonight and I was just wondering if that was. If there's anything that we did like If we're not getting any call. Walker, I don't if it's just been forever and it's not Brian anything okay if you get this call. Bye.


I tried it out it doesn't work what is well is the article suggest

baby it is just my voice how to get to least a couple of words wrong almost every statement it welcome thoughts for example are consistently miss heard

--

Worked pretty well for me. Not perfect but close.


It got "how much wood would a woodchuck chuck if a wouldchuck would chuck wood" right for me. I am impressed now.

But it couldn't handle "Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn" Maybe I'm not pronouncing it quite right.


"how much wood would a woodchuck chuck if a wouldchuck would chuck wood"

That isn't really the best example, considering how these systems tend to work. If their system has a giant bank of text it's using to predict the statistical likelihood of your next word, once you've said "how much wood ..", 'would' is a very high probability candidate for the next word, and once you get to 'woodchuck' the rest is statistically almost inevitable.

A better 'difficult' test would be something along the lines of 'colorless green ideas sleep furiously', although we can't actually use that one since that example is so famous it would likely turn up in a web-derived corpus many times.


How is that even a test for it? If you input garbage into a system like that your going to get garbage out. And for fun, walk up to a stranger and say something like that and see if the stranger can provide useful information.


Yes, it kind of failed on "movie times for 'source code'". That's a sentence that's hard to parse without context ;)


Are you sure you pronounced all the schwas?


It seems even strange to me that the article makes no attempt to survey other current examples of speech-recognition technology in order to support the unsaid implication of their lede. That is: "speech-recognition technology developed by those other than Google does not work."

I just downloaded the Bing app for iPad last night, and noticed it has a pretty decent speech-recognition engine from a company they acquired a few years back: TellMe. I tried all the examples given in the Slate article, and they were recognized just fine.

This makes me curious, are there a number of current-generation speech-recognition technologies that work at the level of Google's?

I should note that I didn't receive the desired behavior once my speech was recognized. When I asked the math question, a link to Wolfram Alpha was given, but I would have to click that link to get the answer. I had to go to maps to get any kind of relevant answer for "Directions to McDonalds," and I had to just say McDonalds and then click Directions to get the actual information. This failing appears to be a trait of the iOS app itself. Hand typing the math query into Bing on a proper browser did give me Google Calculator style results.


That's because it's Google PR (http://www.paulgraham.com/submarine.html). Nuance Communications[1] (who makes Dragon and N other speech rec systems), is really the leader in this space. They've done server-based recognition solutions ~forever, and is the solution Siri uses for their app, for example.

[1] Note: this is not the same as the Nuance Mike Cohen started, but a conglomeration of speech companies purchased over the the years which inherited the name.

EDIT: Added some more detail


Yeah, but... i've tried Google Voice Search and it works incredibly well; whereas Dragon on the iPad did not work well enough to actually use for its intended purpose (although it was fairly impressive - i had low expectations).


I wonder at what point Apple is going to have to start building these kinds of technologies. Purchasing Siri was supposed to be a step in that direction but nothing has come of it yet.

I worry that it's not in Apple's DNA to build products they can't directly charge for. Apple doesn't do freemium. The hope would be that third parties would pick up the slack.

However I'm not sure that startups can match Google in bigdata. So how will Apple catch up in voice recognition? In mapping (which also uses Android data to improve things like traffic and rerouting)?


Apple will make great speech recognition, maps and email when Google makes a great UI.


I honestly can't see Apple doing speech recognition. It's good, but it's only 99% good, and Apple are known for being perfectionists.

I'm not sure about mapping, it also doesn't seem like an Apple product to me, but I can't really articulate why.


That is a cliche, and an incorrect one. Take autocorrect for instance - hardly a perfect product. (http://damnyouautocorrect.com/)


You sometimes see one or two Apple people at speech conferences, at least it was true 2-3 years ago. I cannot imagine that not to be the case anymore.


Whatever anyone wants to say negatively about this article, it's bang on. I just got my first android phone, it has 2.2 and the little microphone is my new best friend. I talk out todos, write emails, texts, search youtube, search everything, it blows my mind. The only thing it's not good at is uncommon names and places (like my name "Micah" and "Quebec"). Rock on Google.


"It even works if you've got an accent."

I have an indeterminate accent and my voice is on the low side of Bass so I trip up google voice pretty badly - it's mostly useless for me. FWIW, Rockband also can't make sense of what I say. I do wonder when it'll be good enough to understand me.


I hate to break it to you, but Rockband doesn't score you based on what you say. It just scores you based whether you're hitting the right notes. :P


Does anyone happen to know if there are significant companies in the speech-recognition space besides Dragon and Google?


There's Microsoft. They purchased TellMe some time back and have had speech recognition in mobile Bing for some time now. I think it may predate Google (though not sure) with similar accuracy.

http://www.microsoft.com/en-us/tellme/

In Vista, MS included speech recognition as part of the shipping OS:

http://en.wikipedia.org/wiki/Windows_Speech_Recognition


TellMe is also used in Windows Phone 7 and it works great.


One assumes that the NSA is doing a lot.


When I took a speech rec class last semester, we had a guy from BBN (subsidiary of Raytheon) give a talk about large-scale, extremely fast audio transcription. As in, systems that could process audio 30-40,000 times faster than real-time. They traded off recognition accuracy to get this, so their accuracy was around 50-60%, as I recall. I asked why something like that would be useful, and he said if you're looking at a lot of data (which I heard: eavesdropping on an entire telephone network) then all you need is a general idea of what people are saying and a few keywords before you can zoom in on specific clips for more thorough analysis.

So yes, I'm sure NSA is interested.


Yes:

http://en.wikipedia.org/wiki/Room_641A

I doubt they're inspecting all that traffic just to watch the pretty patterns.



What are the chances that Google can pick up enough from our voices to biometrically identify us from a crowd in future?

E.g. On a Google phone conversation or when licensed to a surveillance company.


Hopefully they'll be able to apply this to Google Voice. The voicemail transcriptions are almost always hilariously wrong.


Google Voice actually uses the same technology and same dataset, AFAIK, which is why I was so confused by reading this article.

Google's stuff is pretty good some of the time, but they've hardly solved this problem to the degree the article suggests (as anyone who has actually used this for more than 5 minutes could tell you).


long-form transcription is a pretty different problem for language models than parsing search queries. There's lots of audio-processing overlap sure, but parsing a voicemail definitely has different, harder challenges.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: