This is interesting and a feature that I didn't know about, and Hey Siri will often not trigger in the car for me while driving. Now I know to retry and I'll have a better chance at triggering Hey Siri.
"We compare the score with a threshold to decide whether to activate Siri. In fact the threshold is not a fixed value. We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time."
"Hey Siri" has always seemed like such an awkward phrase to me. "Alexa" flows so much better, besides the general awkwardness if someone in the room has that name.
Really? Are these non-European languages that you're referring to, because many of the European languages have their own versions of the name Alexander for both men and women, so I'd imagine Alexa wouldn't cause too many issues.
The 'x' in 'Alexander' is (typically) voiced, where that in 'Alexa' is not. I can see that making a substantial difference for speakers of languages where those phonemes are differently composed.
What do you mean by voiced? I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short. I can imagine the vowels being pronounced differently in various accents, but I can't imagine an accent where the 'x' in those two names are different.
"Voiced" in the phonetic sense [1], i.e., spoken with the vocal cords vibrating. Voiced 'x' sounds like the /gz/ in "eggs", /ɛgz/; voiceless 'x' sounds like the /ks/ in the American English pronunciation of 'x' itself, /ɛks/.
Many, if not all, American English dialects pronounce the words in the fashion I describe. In them, the name 'Alexander' would be
/ˌæ.lɛˈ(gz)æn.dər/
while 'Alexa' would be
/əˈlɛ.(ks)ə/
- in both of which, the phoneme corresponding to the letter 'x' is parenthesized.
Generally in English 'x' is voiced when it precedes a stressed vowel, which it does in 'Alexander'; in 'Alexa', 'x' precedes a reduced vowel, and therefore would always take the unvoiced pronunciation. (It'd sound very odd to an anglophone ear otherwise - say /əˈlɛ.gzə/ one time out loud and see if you don't feel the same.)
That said, it wouldn't be incorrect to pronounce 'Alexander' in American English with an unvoiced 'x', as
/ˌæ.lɛˈksæn.dər/
but, while I believe some dialects of English may default to this pronunciation, certainly not all do. (Neither of the dialects I speak does so, at the very least.) This pronunciation also produces a "hitch" or break in the word between the unvoiced 'x' and its preceding vowel, which would tend to make it a little odd both to hear and to say.
> I put the stress differently, but otherwise I would pronounce Alexa as Alexander cut short.
In “Alexander”, the normal pronunciation of the letter “x” is the voiced consonant cluster /gz/, in “Alexa”, it's usually the unvoiced cluster /ks/. (Also, the second “a” is usually different between the two, being /æ/, like the “a” in “pad”, in “Alexander” and /ɑ:/, like the “a” in “father”, in “Alexa”.)
The difference in the “x” is the normal way that the pronunciation of “x” differs when following a stressed vs. unstressed vowel in English.
Alexa definitely flows a bit better than Hey Siri, but being Norwegian, I find the phrase "ok google" to be almost comical. My mouth just can't make the sounds quick enough for it to not sound ridiculous. Hey Siri and Alexa both flow comparably very well.
I 100% agree, as a native English speaker (western Canada). "Hey Google" has a much nicer flow, although I say "Google" too much on a daily basis for me to want that to be a trigger (practically speaking).
"echo" is even better. So weird to me that they call the device "echo", already a pretty strong brand name, but then insist on the "alexa" naming of the assistant.
I'd imagine it's to give the product/voice a more human-like feel to it. "Echo" sounds like I'm talking to a robot, or a dog (or a robot dog). "Alexa" could be a human.
I occasionally use the "hey Siri" feature while in bed to mess with alarms in the morning (i.e. cancel whatever I had set and set a later one in the morning ;) apparently my morning groggy voice always requires a second hey Siri.
As an aside, alarms are literally the only thing I've ever used Siri for.
With the iPhone 8 Siri is finally as useful as my Google Home speaker in terms of reliability and utility. Like to find your iphone say "hey siri, where are you," or talking to Siri in the car for changing songs or asking her to play similar songs. She finally understands almost every command I send her.
And yet, if you ask Siri for directions to your next appointment, she's still clueless! It's one of the only things that I would actually use Siri for, and for the last three releases of iOS I've hoped for it. How in the world does this not work?!?!
She knows my next appointment, and she knows how to give directions, but she refuses to give me directions to my next appointment. If I could throttle her for her stubborn insubordination, I would. lol
I own multiple iOS devices, a few Echos, and a Google Home. One of the things I noticed after getting an Echo was how much more fluid and simple it was invoking the "assistant". Simply asking, "Alexa, what's the weather today" just seemed so much more natural than having to prefix everything with "Hey" or "Okay".
Using Siri or Google Assistant for more than one question at a time quickly makes me feel like I'm going insane. "Hey Siri.. Hey Siri.. Hey... Hey..."
I'm hoping Google and Apple fix these subtle annoyances. Or maybe it's just me.
I think the reason why the 'catchphrases' for Siri and Google Assistant are longer is because the cost for false positives is higher. Phones are subject to much more varied conditions than a cylinder that lives on your nightstand. Anecdotally, I have seen Alexa activate incorrectly much more than Siri, I just care a lot less.
Isn't part of the point of the prefix to avoid collisions with sounds that are part of everyday speech not intended for the assistant? "Alexa" becomes problematic when Echo is used in a office/home in which someone is named Alexa --
among the top 100 most popular female baby names since 1995 [0] -- which is why the wakeup word can be changed to "Echo" or "Amazon" or "Computer".
But the sound of the wake word, whether it's just "Alexa" or "Hey Siri" vs "Siri, doesn't seem to deal with the main issue of your complaint, which to me is how limited "conversation" is with the assistant.
If you ask Alexa for the weather, you'll still have to say her name for any followup questions within that immediate context, i.e. "Alexa, what's the weather today? Alexa, what's the weather this weekend?".
Though there are a few functional exceptions in which Alexa will prompt you for additional information without needing to be re-awakened, e.g.
You: "Alexa, set my alarm for 6 'o clock"
Alexa: "Is that 6 'o clock in the morning, or in the evening?"
I always felt that Siri was named with the intent that it could be used as a wake word (so you could say "Siri what's the weather today?") because Siri itself is a rare given name and [si ri] are sounds rarely said at the beginning of a sentence.
And then at some point Apple realized they had to make a longer wake word to cut down the number of false positives ("Siri" -> "Hey Siri", from 2 syllables to 3).
Google probably went through the same process ("Google" -> "Okay Google", from 2 syllables to 4).
Amazon probably deliberately chose a 3 syllable name with "Alexa" for the same reason.
I can imagine future improvements where we can have the originally imagined wake words "Siri", "Google" and "Alexa", and at that point I would be most happy with "Siri" because it would be short and not-corporate.
I have my Echo set to use "computer" as the wake word because it's just way more fun that way. But, of course, "computer" is a pretty common word these days. It hasn't yet been annoying enough for me to change it.
This is what I usually use for my Google Home these days, but by anecdotal experience it seems to fail to recognize it much more often than when I used "OK Google". I wish it offered more options in terms of activation words, or better yet, allow us to set and train Google Home to use our own custom activation words.
I’m still baffled that we can’t name our personal assistants the way we see fit. Why can’t I trigger it with « Hey Jarvis », « Alfred », « Marvin », « Come in Scotty », « Computer », or whatever hot word I fancy, especially when it’s supposed to respond only to my voice? Is that because of some pre-training of the NN?
If you're doing more than one interaction, you can just hold the home button to activate Siri for followups. And you don't need to prefix 'hey siri' if you're activating Siri via home button or screen button.
I've seen so many people prefix 'Hey siri' when they don't really need to:
Holds home, Siri activates "Hey Siri, how's the weather?"
If could just be "How's the weather?" In that instance.
But it seems a reasonable interface concession given the relative infrequency of multi-query conversations, plus the huge undesirability of having Siri AI figure out (I.e. continue listening and sending data to Apple's servers) for itself whether the conversation has actually ended. Also, for many conversations, you'll be needing to look at the phone anyway (i.e. have it in your hand) to see Siri's answers.
You can now reboot from Settings in iOS11 if the device is responsive and unlocked.
If it's locked or you just want to properly shutdown quickly, press and hold a volume button and the lock button to bring up the SOS mode, which includes the shut down slider.
If this is the only designed in way to make multiple queries without saying "Hey Siri" repeatedly, then it is required for that operation. You literally have people above calling this the "solution" to the issue raised.
It is like Schrödinger's UI design. It is both being claimed as the solution to the problem being raised while at the same time being called entirely optional. Sounds to me like there is no actual solution, and this is a bandaid.
It'll do multi-part queries without pressing the button (or saying "Hey Siri") between each part. On the other hand, if you want to check the weather AND set an alarm, you need to activate Siri twice.
I think it comes down to the fact that for all the « conversational » publicity stunt, the current implementations insist on the voice assistants to be non-modal. When you think about it, every vocal interaction you have with people is explicitly (through voice) or implicitly (through body language) modal. How hard would it be for the mode to have to end with « thank you Siri » or a sensible time-out, or now with the iPhone X, loss of directed attention (and even activated again with return of directed attention within a reasonable timeframe)
Agreed. The constant hey siri is exhausting vs Alexa's approach. Though as others noted Siri was developed for a phone and devices on the go. Echo is a device that is in your home where you don't have to worry about someone saying Alexa unless a family or a friend is named that in which case you can change the word. HomePod will be the first chance for Apple to change that but pretty sure they are going to stick with Hey Siri for consistent and reliability.
Pretty cool how they reduce the power consumption - when it first came out, "Hey Siri" required your device to be plugged in:
> To avoid running the main processor all day just to listen for the trigger phrase, the iPhone’s Always On Processor (AOP) (a small, low-power auxiliary processor, that is, the embedded Motion Coprocessor) has access to the microphone signal (on 6S and later). We use a small proportion of the AOP’s limited processing power to run a detector with a small version of the acoustic model (DNN). When the score exceeds a threshold the motion coprocessor wakes up the main processor, which analyzes the signal using a larger DNN.
> Apple Watch uses a single-pass “Hey Siri” detector with an acoustic model intermediate in size between those used for the first and second passes on other iOS devices. The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on. At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget. It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.
Another interesting nugget:
> There are thousands of sound classes used by the main recognizer, but only about twenty are needed to account for the target phrase (including an initial silence), and one large class class for everything else. The training process attempts to produce DNN outputs approaching 1 for frames that are labelled with the relevant states and phones, based only on the local sound pattern. The training process adjusts the weights using standard back-propagation and stochastic gradient descent. We have used a variety of neural network training software toolkits, including Theano, Tensorflow, and Kaldi.
Just because (judging from the comments) non-iPhone owners may not be aware, this isn't a new feature in the iPhone at all, Siri has had voice-activated prompts since late 2015 [0]. What it looks like is the team finally got the OK to share technical details about how this works with the public, and that's what we're seeing here.
Apple doesn't usually do tech writeups; I imagine a pinhead at the corporate level decided it wasn't worth the risk of leaking any "secret sauce" until now.
Indeed. Quietly but surely, Apple is changing its corporate policies around developers. We used to never see Apple developers at conferences with an official affiliation, let alone having them onstage as a speaker. I once ran into an engineer at a Ruby conference in which the guy demurred who he worked for, and when he finally told me he worked for Apple, he had to remind me the whole "what I say does not reflect nor represent..." preamble. That was as late as 2014.
Great to see such corporate changes toward developer-friendliness.
Great explanation, this technique has a lot of applications for extracting event triggers out of audio streams. Even though I've trained my iPad repeatedly though it still is a bit to eager to answer to others who talk to it (either on purpose or by accident).
At some point I expect Apple to design an audio neural network processor to put on their CPU chips which will allow them to do both phrase recognition and highly accurate speaker dependent text to speech on their devices. It will be yet another way that people who don't build silicon won't be able to compete.
Attackers might be trying to steal your information from machinelearning.apple.com (for example, passwords, messages, or credit cards). Learn more
NET::ERR_CERT_COMMON_NAME_INVALID
Access Denied
You don't have permission to access ".../machinelearning.apple.com/2017/10/01/hey-siri.html" on this server.
Reference #...
I think the point is that on-phone DNN wake-word detection has been around for many years. Google shipped it like ... 3? years ago. A lot of Apple's technical ML posts feel a bit lackluster for that reason. Targeted at people who aren't active in the field.
"We compare the score with a threshold to decide whether to activate Siri. In fact the threshold is not a fixed value. We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time."