Hacker News new | past | comments | ask | show | jobs | submit login
Google voice search: faster and more accurate (googleresearch.blogspot.com)
188 points by gok on Sept 25, 2015 | hide | past | favorite | 67 comments



Google voice search also has anaphora, or backreferences to previous searches. For example voice search for "Who was the first president of the United States?" then after you get the result do another voice search for "Who was his vice president?" and Google will infer you are still talking about George Washington.

One step closer to conversational interfaces!


I'm doing a lot voice searches recently. It's just easier and feels more natural. It's amazing how accurate Google voice recognition is. I'm non-native English speakers and sometimes I'm not even sure if I pronounced something correct but it gets it!


I've used it to figure out the spelling of a word I couldn't understand.

There's a reporter on NPR who sounds like he always introduces himself as "han zhi lu wong". I say that to google voice search, and it corrects it to "Hansi Lo Wang".

So google voice recognition works even if you don't know what words you're saying!


Well, sure, that's precisely how speech transcription works. When you say the words out loud you aren't speaking in letters, you're speaking phonetic sequences. It's a recognizer's job to decode the best word sequence spelling (from its list of known word spellings and their pronunciations) from the input pronunciation.


My point is, it's amazing that it's better at decoding phonetic sequences (which are presumably garbled by passing through a human being (me) who doesn't understand them), than a human being who has evolved to use language and has over 30 years of fluent experience.


Any language that uses the latin alphabet has different rules for how pronunciation is derived from or encoded with the letters.

e.g. with your example, the letter pair ‘si’ is pronounced differently in Mandarin that it is in English. So it's not surprising that you couldn't write it down properly in pinyin as you don't know how to transcribe Mandarin into pinyin. But Google does.

Another example - without knowing how French spelling works there is no way that as an English speaker you could work out how to correctly spell ‘peut’ (can) just from hearing it.


The last time I checked, it says Minot as minnow whereas the North Dakota town rhymes its name with "why not?" So you'd say it as my knot.

Now Google recognizes I'm talking about Minot but it says Minnow back to me.


That's really cool, but I don't know if I'd describe that example as being better at decoding phonetic symbols. You might still be better better at mapping phonetic symbols to words and names you know -- which I think is a separate skill than mapping them to new words or names -- it's just that you didn't know that name and the software did. And it might also be better at knowing lots of words, which is yet another skill/property.

If you both have learned the word, who is better at recognizing it? And who is better at learning a new word they don't know? These aren't very exacting questions because the comparison between human and machine knowing and learning hasn't been defined. Maybe the machine needs more examples and more contexts to learn a word as well as a human and correctly map to the word as often as a human, but it can also examine more examples and contexts than a human can per unit time and can do a lot of scaling in this regard using increased energy that a human cannot.


Except that his phonetic sequence probably wasn't entirely correct. We don't repeat what we hear, we repeat what we think we hear, after we put our native tongue's filter over it to try to make sense of it. So what he said to Google voice probably wasn't completely accurate, but it still knew what he was talking about.


Pretty Google taps into its gigantic web index to find senseful terms. I played with google translate by typing DragonBall Z names phonetically, Google suggested actual names at the top of the list. I can't help but thinking it cross reference queries, trends and such.


I've had the exact same experience with Lakshmi Singh. Pronounce the sounds, and get the result....


Yes, ditto, and Jian Ghomeshi, Sylvia Poggioli, and Neda Ulaby. Seriously, what is it with NPR commentators??


[deleted]


> Imagine being implicated for a crime because Google thought you said something, but you said something differently.

That seems really unlikely to ever happen. Note that by default Google stores the audio of your searches - including the few seconds before you said "ok google" - indefinitely: https://history.google.com/history/audio


The paper describing this work: http://arxiv.org/pdf/1507.06947v1.pdf

They trained on 3 million "utterances" of average duration of 4 seconds. These were distorted by noise to get 20 variations (so the training set was 60 million utterances total).

I don't understand if these were labeled somehow There's a section on clustering into 9287 phones, but it isn't clear to me if these were used as labels.


AFAIK current ASR systems uses grapheme transcriptions for training their acoustic models. So they have the speech and the transcription "Hello World" and during training, they are automatically converted to phones kind of "hələʊ wɜ:ld" using extensive phonetic dictionaries and some algorithms. 9287 is context dependent phones. like "a" but there is a "b" on the left and "r" on the right. Theoretically for 40 phones you end up 40^3 context dependent phones. But in practice this number is much lower.


This is great and all, because I love Google voice search, but the problem I REALLY wish they'd solve is false triggers of "Okay, Google."

I often listen to podcasts on my car bluetooth and on a bluetooth speaker at home. On my commute, I'll get at minimum 5-15 "Okay, Google" triggers in a 50 minute drive just from people on the podcast saying things like "and" or "okay" or phrases that sound nothing like "Okay Google". I have even done the voice training so it's only supposed to listen for my voice. On the other side of the coin, I'll sit in my car screaming "Okay Google!" over and over with no response.


You should consider trying to retrain your voice model. When I first set up my phone to only recognize my voice, there was people talking in the background and it actually made it behave exactly how your describing.

Shortly after I retrained the voice model in a quiet room by myself and now it works flawlessly.


I've retrained it in a quiet room in my house and it didn't help, but it's worth trying again. I'll give it a shot, thanks.


You can choose a different phrase in the settings. Wouldn't that help? Something without 'okay' maybe?


Unfortunately that's only a feature of Motorola phones that they built on top of Google Now. It's really brilliant and I wish Google would allow it on stock.


Wow! Now it would be great to have this speech recognition service available as an API.


Chrome provides it to in-browser applications via the Web Speech API. However, Google doesn't allow any other browsers or services to use that endpoint (except for Chromium development, and that only with an extremely limited quota).


There's an API for it on Android as well that app developers can use.


I've seen Android Wear apps on the Play Store that utilize it as well.


50 API reqs/day. Very small quota, which is sad, because it is over twice as good as the next best competitor I've found (IBM Bluemix).


That is really too bad. I'd have loved an offline application that could reliably recognize my speech, like the Amazon echo but not connected to the Internet.


Yeah good luck with that. I'd say Google's voice recognition is one of their most valuable things. Siri is the nest best but that isn't really close.

It's a shame they haven't done more with it really.


I don't know about Siri, but there are various means to integrate commands into both Google speech recognition (at least on Android) and Amazon Alexa (Echo). Amazon at least is looking to get third-party devices integrated. Probably not what you were hoping for in either case, but it is enough for a lot of potential uses.


Google voice search is impressive and anecdotal, for me, more accurate that siri and cortana in how it interprets my voice to text. Is there any insight into the hardware needed to run their neural network and store the learned material?


Will these improvements be used when using voice input in Google Docs?

http://gizmodo.com/you-can-now-type-with-your-voice-in-googl...


Yes.


While I'll appreciate the technological advancement that this will signify, I absolutely think that I sound like an idiot if I talk like I write.


It takes some practice to learn how to dictate a response the way that you would type it. However, it has a lot of advantages. For instance I was able to dictate this response to my smartphone in a lot less time than would have taken to type it.


> now used for voice searches and commands in the Google app (on Android and iOS)

Is this available offline or one must be connected?


It looks like they added offline speech recognition to Android. After the Google search app updated today, it downloaded speech files for default language. There is now a settings section for downloading languages. The voice recognition works when in airplane mode. Some voice actions, line opening apps, work offline.


>It's amazing how accurate Google voice recognition is.

I can't say how accurate it is, since I've used it very little so far. But adding my 2c:

I first tried it some ago (2 years+) on my mid-range (at the time) Android phone, and it was not really usable. Set it aside for a while. Then tried it recently - on the same phone, mind - which is 2 or more years older now, so not recent at all. Surprisingly, it worked a lot better than earlier (based on a small sample of tests, note.) Going to experiment with it more.

Something that might be known to many readers here, but mentioning it:

Peter Norvig, Director of Research at Google, has said in the past that by training the voice recognition software on huge amounts of data (at Google scale), they have managed to improve it a lot, by using statistical algorithms. (Similarly for spelling correction suggestions in Google Web Search.)

Related: A couple of simple experiments by me with voice recognition (speech-to-text) and speech synthesis (text-to-speech) using Python:

1:

https://code.activestate.com/recipes/578839-python-text-to-s...

http://jugad2.blogspot.in/2014/03/speech-synthesis-in-python...

2:

http://jugad2.blogspot.in/2014/03/speech-recognition-with-py...


Google Voice Search: broken.

Some time in September, Google made some server-side change to Voice Search which causes the Android Google Search client, at least some versions, to crash. Android handsets get a pop-up with "Unfortunately, Google Search has stopped."[1][2][3][4]. This also breaks voice dialing and texting. Some people who had voice input as the default found they could no longer text at all, until they disabled Google Voice Search. It's not a change on the client side; it's happening even for phones that don't have over the air updates enabled.

The usual suggestions, involving clearing caches and resetting various settings, have been made, and they're as useless as usual. The problem appeared a few weeks ago, and has been reported for at least T-Mobile and AT&T, and for at least ZTE and LG phones. So it's not carrier-specific or handset-maker specific.

Did this "faster and more accurate" change involve a change to the wire protocol? A recent change is clearly crashing the client side in the phone.

[1] https://productforums.google.com/forum/#!topic/websearch/0ZM...

[2] http://forums.androidcentral.com/general-help-how/582873-why...

[3] https://forums.att.com/t5/Android/Google-Search-has-stopped/...

[4] https://support.t-mobile.com/message/518061#518061


The rating on this posting has been going up and down every few minutes. It's amusing to see what happens when you criticize Google or Apple. Apple seems to have a response time of about an hour before criticisms get down-voted. Google is faster.


Google Search unfortunately removed a great feature on Android: "okay google, search <blah> on Spotify." Now instead of opening the native app with the search intent, it goes to the web result. :/


When was this removed? I'm running Android v5.1.1, just tested this and it works fine. Thanks for the tip!


I wish there was an easier way to report outliers or wrong results. For instance, I asked to see photos of Tony Cruz. It showed me photos of Toni Croos. Understandable that a soccer player may be more popular than a baseball player, so I restated and asked for photos of Tony Cruz of the St. Louis Cardinals.

It took the query and showed the same photos. lol

Collecting these sort of results into a larger data set could help refine the results.


Hah, That's a shame.

I had the same happen when I asked it to play Bulerias. Which it kept understanding as some common variation of that word, Blue rays, etc..

As soon as I provided context and said flamenco bulerias that fixed it though.


I just tried these exact same two queries and it showed me photos of Toni Kroos for the first one and Tony Cruz for the second.


I've been trying to use the voice-search feature, but my phone is a moto G 2 and maybe the phone is too slow or my internet connection is too weak, but I find the long delay after "OK Google" makes it just too clumsy to use naturally.


I'm curious how much of the accuracy gains come from only having to run the decoder when the LSTM emits a phoneme rather than for each 10 ms frame, which presumably allows the language model search to be much more aggressive.


Where are you seeing that? In the paper[1] it says:

Acoustic features are generated every 10ms, but are concatenated and downsampled for input to the network: 8 frames are stacked for unidirectional (top) and 3 for bidirectional models (bottom).

[1] http://arxiv.org/pdf/1507.06947v1.pdf


I'm not talking about the input, but the output:

"…predicted word sequence where the word with highest prob- ability is taken ignoring repetitions and the blank label with no language model or decoding."

Which I took to mean: when the acoustic model emits a blank symbol, they don't run the decoder again until a non-blank symbol comes out.


I think that is just a technology showcase that this technique is so good that for some small-mid size pronunciation dictionary, it can work quite well without using a language model. Similar works were reported recently (ASR systems without language models or with character models). But seems like real system is still using a graph generated with a language model and context dependent phones.


Please Google release an "Echo" and I'll buy it right away...


Is there a way to run this without an internet connection?


Yes, the model can be loaded onto your phone. Details are here: http://stackoverflow.com/a/21329845


I highly doubt the neural nets required would fit in the memory of whatever device you're using, high end PC or not. And, there's absolutely no way they would put these highly valuable neural nets on anything that would allow them to be copied.


I think this is great, previously it took long time to open after saying "OK, Google".


Does this mean that voice recognition is a solved problem?

If not, what problems are still left to be solved?


I just asked Google about the weather in my city tomorrow and it said it doesn't know any city named "Banana".

So no, it's certanly not a solved problem, especially if you want to use that kind of functionality outside the few english speaking countries.


I'm not sure this is a discrete, static problem that can ever be 100% solved. We're talking about the speech patterns of billions of people...different accents, vocabularies, dialects, health issues (ex, stroke victims), etc. I'm sure it'll surpass human-level recognition, but there will seemingly always be use-cases that will present unresolved challenges to the AI.


Not even xlose. This was a small incremental gain (maybe 10% relative reduction in errors). It's cool work but if you're an outsider this specific announcement shouldn't lead to any conclusions.

And this is close talk speech. The holy grail (20 feet away at a party with error rate below human) is decades away.


Seems like they are getting close. But some issues:

- Not yet Human level accuracy even for the clean speech.

- Vocabulary limit is still an issue (but they are much better than before)

- Recognition under very noisy environments.

- Recognition in the existence of multiple overlapping voices

- Heavy accents.

- Spontaneous speech.

- Languages other than English are also not in the same level.


French here, I can guarantee you it's not a solved problem !


It's heavily optimised for quick contextual queries. I hit the microphone and said "<supermarket name> <my town> opening hours" today and it simply replied (aloud) "<supermarket> is open until 21:00". This is great stuff, but it still feels like a voice interface to Google vs a personal assistant like Cortana or Siri


Feel free to say "Hello" to get a quick tutorial of the assistant features. The Google app understands searches as well as assistant-like features (like "send a text" or "open Facebook")


how to test this? just speak to android-search without any update needed?


yes. the speech processing is done server-side, there's nothing to update in the app.


It will be great once we can run the generated model locally. It would save a bunch of latency and bandwidth, not to mention the privacy implications of not having speech saved in the cloud.


What do you mean? Google Voice Search works for me with my phone in airplane mode.


I agree with you, but will it ever happen? I would imagine that Google benefits from capturing the audio, so they really have no incentive to allow you to use their voice recognition without them getting the audio.


Great, now can we please have Select All for our Google Voice inbox so those of us who forward our texts and calls to Gmail don't have a Google Voice page that says "Inbox (8821)"?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: