Hacker News new | past | comments | ask | show | jobs | submit login
Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant (machinelearning.apple.com)
134 points by gok on Oct 18, 2017 | hide | past | favorite | 81 comments



This is interesting and a feature that I didn't know about, and Hey Siri will often not trigger in the car for me while driving. Now I know to retry and I'll have a better chance at triggering Hey Siri.

"We compare the score with a threshold to decide whether to activate Siri. In fact the threshold is not a fixed value. We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time."


This is brilliant. Apple could also cut out the “hey” part to make it more natural.

“hey Siri... Siri”


"Hey Siri" has always seemed like such an awkward phrase to me. "Alexa" flows so much better, besides the general awkwardness if someone in the room has that name.


Alexa is actually a difficult word to pronounce for non-English speakers.


Really? Are these non-European languages that you're referring to, because many of the European languages have their own versions of the name Alexander for both men and women, so I'd imagine Alexa wouldn't cause too many issues.


The 'x' in 'Alexander' is (typically) voiced, where that in 'Alexa' is not. I can see that making a substantial difference for speakers of languages where those phonemes are differently composed.


What do you mean by voiced? I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short. I can imagine the vowels being pronounced differently in various accents, but I can't imagine an accent where the 'x' in those two names are different.


"Voiced" in the phonetic sense [1], i.e., spoken with the vocal cords vibrating. Voiced 'x' sounds like the /gz/ in "eggs", /ɛgz/; voiceless 'x' sounds like the /ks/ in the American English pronunciation of 'x' itself, /ɛks/.

Many, if not all, American English dialects pronounce the words in the fashion I describe. In them, the name 'Alexander' would be

    /ˌæ.lɛˈ(gz)æn.dər/
while 'Alexa' would be

    /əˈlɛ.(ks)ə/
- in both of which, the phoneme corresponding to the letter 'x' is parenthesized.

Generally in English 'x' is voiced when it precedes a stressed vowel, which it does in 'Alexander'; in 'Alexa', 'x' precedes a reduced vowel, and therefore would always take the unvoiced pronunciation. (It'd sound very odd to an anglophone ear otherwise - say /əˈlɛ.gzə/ one time out loud and see if you don't feel the same.)

That said, it wouldn't be incorrect to pronounce 'Alexander' in American English with an unvoiced 'x', as

    /ˌæ.lɛˈksæn.dər/
but, while I believe some dialects of English may default to this pronunciation, certainly not all do. (Neither of the dialects I speak does so, at the very least.) This pronunciation also produces a "hitch" or break in the word between the unvoiced 'x' and its preceding vowel, which would tend to make it a little odd both to hear and to say.

[1] https://en.wikipedia.org/wiki/Voice_(phonetics)


> I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short.

In “Alexander”, the normal pronunciation of the letter “x” is the voiced consonant cluster /gz/, in “Alexa”, it's usually the unvoiced cluster /ks/. (Also, the second “a” is usually different between the two, being /æ/, like the “a” in “pad”, in “Alexander” and /ɑ:/, like the “a” in “father”, in “Alexa”.)

The difference in the “x” is the normal way that the pronunciation of “x” differs when following a stressed vs. unstressed vowel in English.


Particularly difficult for Chinese speakers. I've heard Alessa a lot.


Works fine in Spanish, but I can imagine other languages having trouble.

I'm actually pretty impressed by Alexa's voice recognition so far.


Alexa definitely flows a bit better than Hey Siri, but being Norwegian, I find the phrase "ok google" to be almost comical. My mouth just can't make the sounds quick enough for it to not sound ridiculous. Hey Siri and Alexa both flow comparably very well.


I 100% agree, as a native English speaker (western Canada). "Hey Google" has a much nicer flow, although I say "Google" too much on a daily basis for me to want that to be a trigger (practically speaking).


"echo" is even better. So weird to me that they call the device "echo", already a pretty strong brand name, but then insist on the "alexa" naming of the assistant.


I'd imagine it's to give the product/voice a more human-like feel to it. "Echo" sounds like I'm talking to a robot, or a dog (or a robot dog). "Alexa" could be a human.


That seemingly simple repeat functionality would require a much less simple nn, from what I understand.

That would trade off both security and privacy for very limited gain.


I occasionally use the "hey Siri" feature while in bed to mess with alarms in the morning (i.e. cancel whatever I had set and set a later one in the morning ;) apparently my morning groggy voice always requires a second hey Siri.

As an aside, alarms are literally the only thing I've ever used Siri for.


same, I use Siri every day, and only ever for two reasons:

"Hey Siri, set an alarm for {seven,eight,nine} a.m." "Hey Siri, call {mom,dad,[friend]}"

I've created unambiguous nicknames for my friends so I don't get name collisions


Odd, Hey Siri always triggers for me in the car while driving.


With the iPhone 8 Siri is finally as useful as my Google Home speaker in terms of reliability and utility. Like to find your iphone say "hey siri, where are you," or talking to Siri in the car for changing songs or asking her to play similar songs. She finally understands almost every command I send her.


And yet, if you ask Siri for directions to your next appointment, she's still clueless! It's one of the only things that I would actually use Siri for, and for the last three releases of iOS I've hoped for it. How in the world does this not work?!?!

She knows my next appointment, and she knows how to give directions, but she refuses to give me directions to my next appointment. If I could throttle her for her stubborn insubordination, I would. lol


Maybe you've got a nicer (quieter car) =D.

It usually works for me too, but sometimes it doesn't, now I know to retry.


Heh, nah, lots of road noise in my car, actually. But it still works far more often than not.


I own multiple iOS devices, a few Echos, and a Google Home. One of the things I noticed after getting an Echo was how much more fluid and simple it was invoking the "assistant". Simply asking, "Alexa, what's the weather today" just seemed so much more natural than having to prefix everything with "Hey" or "Okay".

Using Siri or Google Assistant for more than one question at a time quickly makes me feel like I'm going insane. "Hey Siri.. Hey Siri.. Hey... Hey..."

I'm hoping Google and Apple fix these subtle annoyances. Or maybe it's just me.


I think the reason why the 'catchphrases' for Siri and Google Assistant are longer is because the cost for false positives is higher. Phones are subject to much more varied conditions than a cylinder that lives on your nightstand. Anecdotally, I have seen Alexa activate incorrectly much more than Siri, I just care a lot less.


Hey Google and Hey Siri are not longer than Alexa.. still three syllables.


Isn't part of the point of the prefix to avoid collisions with sounds that are part of everyday speech not intended for the assistant? "Alexa" becomes problematic when Echo is used in a office/home in which someone is named Alexa -- among the top 100 most popular female baby names since 1995 [0] -- which is why the wakeup word can be changed to "Echo" or "Amazon" or "Computer".

But the sound of the wake word, whether it's just "Alexa" or "Hey Siri" vs "Siri, doesn't seem to deal with the main issue of your complaint, which to me is how limited "conversation" is with the assistant.

If you ask Alexa for the weather, you'll still have to say her name for any followup questions within that immediate context, i.e. "Alexa, what's the weather today? Alexa, what's the weather this weekend?".

Though there are a few functional exceptions in which Alexa will prompt you for additional information without needing to be re-awakened, e.g.

You: "Alexa, set my alarm for 6 'o clock"

Alexa: "Is that 6 'o clock in the morning, or in the evening?"

[0] https://www.ssa.gov/OACT/babynames/index.html


I always felt that Siri was named with the intent that it could be used as a wake word (so you could say "Siri what's the weather today?") because Siri itself is a rare given name and [si ri] are sounds rarely said at the beginning of a sentence.

And then at some point Apple realized they had to make a longer wake word to cut down the number of false positives ("Siri" -> "Hey Siri", from 2 syllables to 3).

Google probably went through the same process ("Google" -> "Okay Google", from 2 syllables to 4).

Amazon probably deliberately chose a 3 syllable name with "Alexa" for the same reason.

I can imagine future improvements where we can have the originally imagined wake words "Siri", "Google" and "Alexa", and at that point I would be most happy with "Siri" because it would be short and not-corporate.


Siri (the company) was a spin-off from SRI – Stanford Research Institute.

The initial product, before the Apple acquisition, was an iPhone app with a chat interface. I don't recall that it supported voice input.

It is still possible that the founders were thinking ahead to voice input and wake words when they named the company.


The technology was initially developed under the DARPA CALO research program (https://en.m.wikipedia.org/wiki/CALO).


It's my understanding Siri came from SRI Labs and who sold the technology to Apple.


> and [si ri] are sounds rarely said at the beginning of a sentence.

seriously?


Yes, because the first two phonetic syllables of "Seriously" are [sɪ rɪə], which makes it different.


I have my Echo set to use "computer" as the wake word because it's just way more fun that way. But, of course, "computer" is a pretty common word these days. It hasn't yet been annoying enough for me to change it.


You can change the prompt it listens for


Plus, "Ok, Google" is just a terrible sequence of consonants/syllables. Just doesn't work in my mouth.


Glad I'm not the only one -- I just can't say that phrase without garbling it: "Okayggglglele"


"Hey Google" is slowly becoming a secondary option: http://www.androidpolice.com/2017/10/17/hey-google-hotword-n...


This is what I usually use for my Google Home these days, but by anecdotal experience it seems to fail to recognize it much more often than when I used "OK Google". I wish it offered more options in terms of activation words, or better yet, allow us to set and train Google Home to use our own custom activation words.


I’m still baffled that we can’t name our personal assistants the way we see fit. Why can’t I trigger it with « Hey Jarvis », « Alfred », « Marvin », « Come in Scotty », « Computer », or whatever hot word I fancy, especially when it’s supposed to respond only to my voice? Is that because of some pre-training of the NN?


You can train Assistant on your phone to a custom word, I have to imagine it's on the roadmap for Google Home at some point


If you're doing more than one interaction, you can just hold the home button to activate Siri for followups. And you don't need to prefix 'hey siri' if you're activating Siri via home button or screen button.

I've seen so many people prefix 'Hey siri' when they don't really need to:

Holds home, Siri activates "Hey Siri, how's the weather?"

If could just be "How's the weather?" In that instance.


I shouldn’t need to click a button to effectively interact with my -voice- assistant :)


But it seems a reasonable interface concession given the relative infrequency of multi-query conversations, plus the huge undesirability of having Siri AI figure out (I.e. continue listening and sending data to Apple's servers) for itself whether the conversation has actually ended. Also, for many conversations, you'll be needing to look at the phone anyway (i.e. have it in your hand) to see Siri's answers.


When one person makes a concession it's good business. When millions of users make a concession it's bad design.


I wonder how that works on the iPhone X


The sleep/wake button on the right hand side of the device now triggers Siri when you hold it down for long enough.


Then how do your turn the thing off?


Depends on what you need.

You can now reboot from Settings in iOS11 if the device is responsive and unlocked.

If it's locked or you just want to properly shutdown quickly, press and hold a volume button and the lock button to bring up the SOS mode, which includes the shut down slider.

If the device is unresponsive, it's much less obvious now: https://ios.gadgethacks.com/how-to/force-restart-iphone-x-wh... Volume up, then down, then press and hold the lock button.


They added shut down and reboot to the settings app in iOS 11.


Probably hold it down even longer.


> you can just hold the home button to activate Siri for followups

Unhelpful while I am driving...

A voice assistant that requires physical controls massively reduces its utility and core purpose.


“Requires”?


If this is the only designed in way to make multiple queries without saying "Hey Siri" repeatedly, then it is required for that operation. You literally have people above calling this the "solution" to the issue raised.

It is like Schrödinger's UI design. It is both being claimed as the solution to the problem being raised while at the same time being called entirely optional. Sounds to me like there is no actual solution, and this is a bandaid.


It'll do multi-part queries without pressing the button (or saying "Hey Siri") between each part. On the other hand, if you want to check the weather AND set an alarm, you need to activate Siri twice.


It'll do multi-part queries without pressing the button (or saying "Hey Siri") between each part

On the other hand, if you want to check the weather AND set an alarm, you need to activate Siri twice.


“If this is the only way to do except for that other way of doing it, then there’s only one way to do it”!


I think it comes down to the fact that for all the « conversational » publicity stunt, the current implementations insist on the voice assistants to be non-modal. When you think about it, every vocal interaction you have with people is explicitly (through voice) or implicitly (through body language) modal. How hard would it be for the mode to have to end with « thank you Siri » or a sensible time-out, or now with the iPhone X, loss of directed attention (and even activated again with return of directed attention within a reasonable timeframe)


Agreed. The constant hey siri is exhausting vs Alexa's approach. Though as others noted Siri was developed for a phone and devices on the go. Echo is a device that is in your home where you don't have to worry about someone saying Alexa unless a family or a friend is named that in which case you can change the word. HomePod will be the first chance for Apple to change that but pretty sure they are going to stick with Hey Siri for consistent and reliability.


what if someone in the room is named alexa


Then you change Alexa to one of the other options (Amazon, Echo, Computer)


Pretty cool how they reduce the power consumption - when it first came out, "Hey Siri" required your device to be plugged in:

> To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later). We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN). When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN.

> Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices. The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on. At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget. It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.

Another interesting nugget:

> There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else. The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern. The training process adjusts the weights using standard back-propagation and stochastic gradient descent. We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.


Just because (judging from the comments) non-iPhone owners may not be aware, this isn't a new feature in the iPhone at all, Siri has had voice-activated prompts since late 2015 [0]. What it looks like is the team finally got the OK to share technical details about how this works with the public, and that's what we're seeing here.

Apple doesn't usually do tech writeups; I imagine a pinhead at the corporate level decided it wasn't worth the risk of leaking any "secret sauce" until now.

0: https://www.cultofmac.com/390181/5-ways-hey-siri-will-change...


That pinhead was Steve Jobs.


Pretty cool of Apple to post such technical blog posts as this. I love it when companies do this.


Indeed. Quietly but surely, Apple is changing its corporate policies around developers. We used to never see Apple developers at conferences with an official affiliation, let alone having them onstage as a speaker. I once ran into an engineer at a Ruby conference in which the guy demurred who he worked for, and when he finally told me he worked for Apple, he had to remind me the whole "what I say does not reflect nor represent..." preamble. That was as late as 2014.

Great to see such corporate changes toward developer-friendliness.


Great explanation, this technique has a lot of applications for extracting event triggers out of audio streams. Even though I've trained my iPad repeatedly though it still is a bit to eager to answer to others who talk to it (either on purpose or by accident).

At some point I expect Apple to design an audio neural network processor to put on their CPU chips which will allow them to do both phrase recognition and highly accurate speaker dependent text to speech on their devices. It will be yet another way that people who don't build silicon won't be able to compete.


Does anyone know of papers and/or example implementations of similar DNNs for accoustic modeling using tf or some other framework?


Google's Deep KWS paper [1] is kind of similar, although they don't use an HMM.

[1] https://static.googleusercontent.com/media/research.google.c...


Is there any open source projects to do this? I mean just the triggering part, not the command recognition later.


There are Snowboy and uSpeech on Github. Haven‘t used them yet.


Snowboy isn’t open source, though.


Your connection is not private

Attackers might be trying to steal your information from machinelearning.apple.com (for example, passwords, messages, or credit cards). Learn more NET::ERR_CERT_COMMON_NAME_INVALID

Access Denied

You don't have permission to access ".../machinelearning.apple.com/2017/10/01/hey-siri.html" on this server. Reference #...


2012: OK Google / 2014: Alexa / 2017: Hey Siri


Hey Siri has been an iPhone feature since the 6s, which came out in 2015.


It's a feature on older iPhones as well (such as my regular 6) but requires being on a power source since those lack the specialized, low-power chip.


Not sure what you're trying to say with that timeline. Siri is available (on iPhones) since 2011.


I think the point is that on-phone DNN wake-word detection has been around for many years. Google shipped it like ... 3? years ago. A lot of Apple's technical ML posts feel a bit lackluster for that reason. Targeted at people who aren't active in the field.


Apple’s Hey Siri wake isn’t available just starting this year either. I think it started become available since the iPhone 6s.


For context, google published a paper on their on-device wakeword system 1.5 years before the 6s came out.

https://static.googleusercontent.com/media/research.google.c...


Exactly!Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: