Hey Siri: An On-Device DNN-Powered Voice Trigger for Apple’s Personal Assistant

kejaed · on Oct 18, 2017

This is interesting and a feature that I didn't know about, and Hey Siri will often not trigger in the car for me while driving. Now I know to retry and I'll have a better chance at triggering Hey Siri.

"We compare the score with a threshold to decide whether to activate Siri. In fact the threshold is not a fixed value. We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time."

dannyw · on Oct 18, 2017

This is brilliant. Apple could also cut out the “hey” part to make it more natural.

“hey Siri... Siri”

crooked-v · on Oct 18, 2017

"Hey Siri" has always seemed like such an awkward phrase to me. "Alexa" flows so much better, besides the general awkwardness if someone in the room has that name.

mikehines · on Oct 18, 2017

Alexa is actually a difficult word to pronounce for non-English speakers.

alexsb92 · on Oct 18, 2017

Really? Are these non-European languages that you're referring to, because many of the European languages have their own versions of the name Alexander for both men and women, so I'd imagine Alexa wouldn't cause too many issues.

throwanem · on Oct 18, 2017

The 'x' in 'Alexander' is (typically) voiced, where that in 'Alexa' is not. I can see that making a substantial difference for speakers of languages where those phonemes are differently composed.

arghwhat · on Oct 18, 2017

What do you mean by voiced? I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short. I can imagine the vowels being pronounced differently in various accents, but I can't imagine an accent where the 'x' in those two names are different.

throwanem · on Oct 18, 2017

"Voiced" in the phonetic sense [1], i.e., spoken with the vocal cords vibrating. Voiced 'x' sounds like the /gz/ in "eggs", /ɛgz/; voiceless 'x' sounds like the /ks/ in the American English pronunciation of 'x' itself, /ɛks/.

Many, if not all, American English dialects pronounce the words in the fashion I describe. In them, the name 'Alexander' would be

    /ˌæ.lɛˈ(gz)æn.dər/

while 'Alexa' would be

    /əˈlɛ.(ks)ə/

- in both of which, the phoneme corresponding to the letter 'x' is parenthesized.

Generally in English 'x' is voiced when it precedes a stressed vowel, which it does in 'Alexander'; in 'Alexa', 'x' precedes a reduced vowel, and therefore would always take the unvoiced pronunciation. (It'd sound very odd to an anglophone ear otherwise - say /əˈlɛ.gzə/ one time out loud and see if you don't feel the same.)

That said, it wouldn't be incorrect to pronounce 'Alexander' in American English with an unvoiced 'x', as

    /ˌæ.lɛˈksæn.dər/

but, while I believe some dialects of English may default to this pronunciation, certainly not all do. (Neither of the dialects I speak does so, at the very least.) This pronunciation also produces a "hitch" or break in the word between the unvoiced 'x' and its preceding vowel, which would tend to make it a little odd both to hear and to say.

[1] https://en.wikipedia.org/wiki/Voice_(phonetics)

dragonwriter · on Oct 18, 2017

> I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short.

In “Alexander”, the normal pronunciation of the letter “x” is the voiced consonant cluster /gz/, in “Alexa”, it's usually the unvoiced cluster /ks/. (Also, the second “a” is usually different between the two, being /æ/, like the “a” in “pad”, in “Alexander” and /ɑ:/, like the “a” in “father”, in “Alexa”.)

The difference in the “x” is the normal way that the pronunciation of “x” differs when following a stressed vs. unstressed vowel in English.

drak0n1c · on Oct 18, 2017

Particularly difficult for Chinese speakers. I've heard Alessa a lot.

GFischer · on Oct 18, 2017

Works fine in Spanish, but I can imagine other languages having trouble.

I'm actually pretty impressed by Alexa's voice recognition so far.

mort96 · on Oct 18, 2017

Alexa definitely flows a bit better than Hey Siri, but being Norwegian, I find the phrase "ok google" to be almost comical. My mouth just can't make the sounds quick enough for it to not sound ridiculous. Hey Siri and Alexa both flow comparably very well.

gmemstr · on Oct 18, 2017

I 100% agree, as a native English speaker (western Canada). "Hey Google" has a much nicer flow, although I say "Google" too much on a daily basis for me to want that to be a trigger (practically speaking).

pkamb · on Oct 18, 2017

"echo" is even better. So weird to me that they call the device "echo", already a pretty strong brand name, but then insist on the "alexa" naming of the assistant.

core-utility · on Oct 18, 2017

I'd imagine it's to give the product/voice a more human-like feel to it. "Echo" sounds like I'm talking to a robot, or a dog (or a robot dog). "Alexa" could be a human.

thisacctforreal · on Oct 18, 2017

That seemingly simple repeat functionality would require a much less simple nn, from what I understand.

That would trade off both security and privacy for very limited gain.

eric_h · on Oct 18, 2017

I occasionally use the "hey Siri" feature while in bed to mess with alarms in the morning (i.e. cancel whatever I had set and set a later one in the morning ;) apparently my morning groggy voice always requires a second hey Siri.

As an aside, alarms are literally the only thing I've ever used Siri for.

rjeli · on Oct 18, 2017

same, I use Siri every day, and only ever for two reasons:

"Hey Siri, set an alarm for {seven,eight,nine} a.m." "Hey Siri, call {mom,dad,[friend]}"

I've created unambiguous nicknames for my friends so I don't get name collisions

spike021 · on Oct 18, 2017

Odd, Hey Siri always triggers for me in the car while driving.

paul7986 · on Oct 18, 2017

With the iPhone 8 Siri is finally as useful as my Google Home speaker in terms of reliability and utility. Like to find your iphone say "hey siri, where are you," or talking to Siri in the car for changing songs or asking her to play similar songs. She finally understands almost every command I send her.

jtbayly · on Oct 18, 2017

And yet, if you ask Siri for directions to your next appointment, she's still clueless! It's one of the only things that I would actually use Siri for, and for the last three releases of iOS I've hoped for it. How in the world does this not work?!?!

She knows my next appointment, and she knows how to give directions, but she refuses to give me directions to my next appointment. If I could throttle her for her stubborn insubordination, I would. lol

kejaed · on Oct 18, 2017

Maybe you've got a nicer (quieter car) =D.

It usually works for me too, but sometimes it doesn't, now I know to retry.

spike021 · on Oct 18, 2017

Heh, nah, lots of road noise in my car, actually. But it still works far more often than not.

RKearney · on Oct 18, 2017

I own multiple iOS devices, a few Echos, and a Google Home. One of the things I noticed after getting an Echo was how much more fluid and simple it was invoking the "assistant". Simply asking, "Alexa, what's the weather today" just seemed so much more natural than having to prefix everything with "Hey" or "Okay".

Using Siri or Google Assistant for more than one question at a time quickly makes me feel like I'm going insane. "Hey Siri.. Hey Siri.. Hey... Hey..."

I'm hoping Google and Apple fix these subtle annoyances. Or maybe it's just me.

huebnerob · on Oct 18, 2017

I think the reason why the 'catchphrases' for Siri and Google Assistant are longer is because the cost for false positives is higher. Phones are subject to much more varied conditions than a cylinder that lives on your nightstand. Anecdotally, I have seen Alexa activate incorrectly much more than Siri, I just care a lot less.

hammock · on Oct 18, 2017

Hey Google and Hey Siri are not longer than Alexa.. still three syllables.

danso · on Oct 18, 2017

Isn't part of the point of the prefix to avoid collisions with sounds that are part of everyday speech not intended for the assistant? "Alexa" becomes problematic when Echo is used in a office/home in which someone is named Alexa -- among the top 100 most popular female baby names since 1995 [0] -- which is why the wakeup word can be changed to "Echo" or "Amazon" or "Computer".

But the sound of the wake word, whether it's just "Alexa" or "Hey Siri" vs "Siri, doesn't seem to deal with the main issue of your complaint, which to me is how limited "conversation" is with the assistant.

If you ask Alexa for the weather, you'll still have to say her name for any followup questions within that immediate context, i.e. "Alexa, what's the weather today? Alexa, what's the weather this weekend?".

Though there are a few functional exceptions in which Alexa will prompt you for additional information without needing to be re-awakened, e.g.

You: "Alexa, set my alarm for 6 'o clock"

Alexa: "Is that 6 'o clock in the morning, or in the evening?"

[0] https://www.ssa.gov/OACT/babynames/index.html

Darthy · on Oct 18, 2017

I always felt that Siri was named with the intent that it could be used as a wake word (so you could say "Siri what's the weather today?") because Siri itself is a rare given name and [si ri] are sounds rarely said at the beginning of a sentence.

And then at some point Apple realized they had to make a longer wake word to cut down the number of false positives ("Siri" -> "Hey Siri", from 2 syllables to 3).

Google probably went through the same process ("Google" -> "Okay Google", from 2 syllables to 4).

Amazon probably deliberately chose a 3 syllable name with "Alexa" for the same reason.

I can imagine future improvements where we can have the originally imagined wake words "Siri", "Google" and "Alexa", and at that point I would be most happy with "Siri" because it would be short and not-corporate.

osteele · on Oct 18, 2017

Siri (the company) was a spin-off from SRI – Stanford Research Institute.

The initial product, before the Apple acquisition, was an iPhone app with a chat interface. I don't recall that it supported voice input.

It is still possible that the founders were thinking ahead to voice input and wake words when they named the company.

woodson · on Oct 19, 2017

The technology was initially developed under the DARPA CALO research program (https://en.m.wikipedia.org/wiki/CALO).

athenot · on Oct 18, 2017

It's my understanding Siri came from SRI Labs and who sold the technology to Apple.

knodi123 · on Oct 18, 2017

> and [si ri] are sounds rarely said at the beginning of a sentence.

seriously?

Darthy · on Oct 22, 2017

Yes, because the first two phonetic syllables of "Seriously" are [sɪ rɪə], which makes it different.

wvenable · on Oct 18, 2017

I have my Echo set to use "computer" as the wake word because it's just way more fun that way. But, of course, "computer" is a pretty common word these days. It hasn't yet been annoying enough for me to change it.

analogmemory · on Oct 18, 2017

You can change the prompt it listens for

vollmond · on Oct 18, 2017

Plus, "Ok, Google" is just a terrible sequence of consonants/syllables. Just doesn't work in my mouth.

2bitencryption · on Oct 18, 2017

Glad I'm not the only one -- I just can't say that phrase without garbling it: "Okayggglglele"

timdorr · on Oct 18, 2017

"Hey Google" is slowly becoming a secondary option: http://www.androidpolice.com/2017/10/17/hey-google-hotword-n...

lewisl9029 · on Oct 18, 2017

This is what I usually use for my Google Home these days, but by anecdotal experience it seems to fail to recognize it much more often than when I used "OK Google". I wish it offered more options in terms of activation words, or better yet, allow us to set and train Google Home to use our own custom activation words.

lloeki · on Oct 18, 2017

I’m still baffled that we can’t name our personal assistants the way we see fit. Why can’t I trigger it with « Hey Jarvis », « Alfred », « Marvin », « Come in Scotty », « Computer », or whatever hot word I fancy, especially when it’s supposed to respond only to my voice? Is that because of some pre-training of the NN?

hammock · on Oct 18, 2017

You can train Assistant on your phone to a custom word, I have to imagine it's on the roadmap for Google Home at some point

hmage · on Oct 18, 2017

If you're doing more than one interaction, you can just hold the home button to activate Siri for followups. And you don't need to prefix 'hey siri' if you're activating Siri via home button or screen button.

I've seen so many people prefix 'Hey siri' when they don't really need to:

Holds home, Siri activates "Hey Siri, how's the weather?"

If could just be "How's the weather?" In that instance.

cjsawyer · on Oct 18, 2017

I shouldn’t need to click a button to effectively interact with my -voice- assistant :)

danso · on Oct 18, 2017

But it seems a reasonable interface concession given the relative infrequency of multi-query conversations, plus the huge undesirability of having Siri AI figure out (I.e. continue listening and sending data to Apple's servers) for itself whether the conversation has actually ended. Also, for many conversations, you'll be needing to look at the phone anyway (i.e. have it in your hand) to see Siri's answers.

sinnoh · on Oct 18, 2017

When one person makes a concession it's good business. When millions of users make a concession it's bad design.

tambourine_man · on Oct 18, 2017

I wonder how that works on the iPhone X

philo23 · on Oct 18, 2017

The sleep/wake button on the right hand side of the device now triggers Siri when you hold it down for long enough.

Gaelan · on Oct 18, 2017

Then how do your turn the thing off?

evilduck · on Oct 18, 2017

Depends on what you need.

You can now reboot from Settings in iOS11 if the device is responsive and unlocked.

If it's locked or you just want to properly shutdown quickly, press and hold a volume button and the lock button to bring up the SOS mode, which includes the shut down slider.

If the device is unresponsive, it's much less obvious now: https://ios.gadgethacks.com/how-to/force-restart-iphone-x-wh... Volume up, then down, then press and hold the lock button.

omarforgotpwd · on Oct 18, 2017

They added shut down and reboot to the settings app in iOS 11.

_9vzr · on Oct 18, 2017

Probably hold it down even longer.

UnoriginalGuy · on Oct 18, 2017

> you can just hold the home button to activate Siri for followups

Unhelpful while I am driving...

A voice assistant that requires physical controls massively reduces its utility and core purpose.

adamlett · on Oct 18, 2017

“Requires”?

UnoriginalGuy · on Oct 18, 2017

If this is the only designed in way to make multiple queries without saying "Hey Siri" repeatedly, then it is required for that operation. You literally have people above calling this the "solution" to the issue raised.

It is like Schrödinger's UI design. It is both being claimed as the solution to the problem being raised while at the same time being called entirely optional. Sounds to me like there is no actual solution, and this is a bandaid.

mattkrause · on Oct 18, 2017

It'll do multi-part queries without pressing the button (or saying "Hey Siri") between each part. On the other hand, if you want to check the weather AND set an alarm, you need to activate Siri twice.

mattkrause · on Oct 18, 2017

It'll do multi-part queries without pressing the button (or saying "Hey Siri") between each part

On the other hand, if you want to check the weather AND set an alarm, you need to activate Siri twice.

adamlett · on Oct 18, 2017

“If this is the only way to do except for that other way of doing it, then there’s only one way to do it”!

lloeki · on Oct 18, 2017

I think it comes down to the fact that for all the « conversational » publicity stunt, the current implementations insist on the voice assistants to be non-modal. When you think about it, every vocal interaction you have with people is explicitly (through voice) or implicitly (through body language) modal. How hard would it be for the mode to have to end with « thank you Siri » or a sensible time-out, or now with the iPhone X, loss of directed attention (and even activated again with return of directed attention within a reasonable timeframe)

TechRemarker · on Oct 18, 2017

Agreed. The constant hey siri is exhausting vs Alexa's approach. Though as others noted Siri was developed for a phone and devices on the go. Echo is a device that is in your home where you don't have to worry about someone saying Alexa unless a family or a friend is named that in which case you can change the word. HomePod will be the first chance for Apple to change that but pretty sure they are going to stick with Hey Siri for consistent and reliability.

azr79 · on Oct 18, 2017

what if someone in the room is named alexa

Gaelan · on Oct 18, 2017

Then you change Alexa to one of the other options (Amazon, Echo, Computer)

Isamu · on Oct 18, 2017

Pretty cool how they reduce the power consumption - when it first came out, "Hey Siri" required your device to be plugged in:

> To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later). We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN). When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN.

> Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices. The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on. At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget. It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.

Another interesting nugget:

> There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else. The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern. The training process adjusts the weights using standard back-propagation and stochastic gradient descent. We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.

ComputerGuru · on Oct 18, 2017

Just because (judging from the comments) non-iPhone owners may not be aware, this isn't a new feature in the iPhone at all, Siri has had voice-activated prompts since late 2015 [0]. What it looks like is the team finally got the OK to share technical details about how this works with the public, and that's what we're seeing here.

Apple doesn't usually do tech writeups; I imagine a pinhead at the corporate level decided it wasn't worth the risk of leaking any "secret sauce" until now.

0: https://www.cultofmac.com/390181/5-ways-hey-siri-will-change...

IBM · on Oct 19, 2017

That pinhead was Steve Jobs.

Viper007Bond · on Oct 18, 2017

Pretty cool of Apple to post such technical blog posts as this. I love it when companies do this.

ktamura · on Oct 18, 2017

Indeed. Quietly but surely, Apple is changing its corporate policies around developers. We used to never see Apple developers at conferences with an official affiliation, let alone having them onstage as a speaker. I once ran into an engineer at a Ruby conference in which the guy demurred who he worked for, and when he finally told me he worked for Apple, he had to remind me the whole "what I say does not reflect nor represent..." preamble. That was as late as 2014.

Great to see such corporate changes toward developer-friendliness.

ChuckMcM · on Oct 18, 2017

Great explanation, this technique has a lot of applications for extracting event triggers out of audio streams. Even though I've trained my iPad repeatedly though it still is a bit to eager to answer to others who talk to it (either on purpose or by accident).

At some point I expect Apple to design an audio neural network processor to put on their CPU chips which will allow them to do both phrase recognition and highly accurate speaker dependent text to speech on their devices. It will be yet another way that people who don't build silicon won't be able to compete.

georgehm · on Oct 18, 2017

Does anyone know of papers and/or example implementations of similar DNNs for accoustic modeling using tf or some other framework?

gok · on Oct 18, 2017

Google's Deep KWS paper [1] is kind of similar, although they don't use an HMM.

[1] https://static.googleusercontent.com/media/research.google.c...

ecesena · on Oct 18, 2017

Is there any open source projects to do this? I mean just the triggering part, not the command recognition later.

MrBuddyCasino · on Oct 18, 2017

There are Snowboy and uSpeech on Github. Haven‘t used them yet.

woodson · on Oct 19, 2017

Snowboy isn’t open source, though.

singularity2001 · on Oct 18, 2017

Your connection is not private

Attackers might be trying to steal your information from machinelearning.apple.com (for example, passwords, messages, or credit cards). Learn more NET::ERR_CERT_COMMON_NAME_INVALID

Access Denied

You don't have permission to access ".../machinelearning.apple.com/2017/10/01/hey-siri.html" on this server. Reference #...

allenleein · on Oct 18, 2017

2012: OK Google / 2014: Alexa / 2017: Hey Siri

gordyf · on Oct 18, 2017

Hey Siri has been an iPhone feature since the 6s, which came out in 2015.

Viper007Bond · on Oct 18, 2017

It's a feature on older iPhones as well (such as my regular 6) but requires being on a power source since those lack the specialized, low-power chip.

growt · on Oct 18, 2017

Not sure what you're trying to say with that timeline. Siri is available (on iPhones) since 2011.

hiddencost · on Oct 18, 2017

I think the point is that on-phone DNN wake-word detection has been around for many years. Google shipped it like ... 3? years ago. A lot of Apple's technical ML posts feel a bit lackluster for that reason. Targeted at people who aren't active in the field.

kccqzy · on Oct 18, 2017

Apple’s Hey Siri wake isn’t available just starting this year either. I think it started become available since the iPhone 6s.

hiddencost · on Oct 18, 2017

For context, google published a paper on their on-device wakeword system 1.5 years before the 6s came out.

https://static.googleusercontent.com/media/research.google.c...

allenleein · on Oct 19, 2017

Exactly!Thank you.