According to an episode of Acquired[0], none of Siri's original technology is present in Apple's current products. Siri originally licensed Nuance's (also an SRI spin-out company) voice recognition technology and built some pretty (comparatively) simple tech on top of it, never changing out the speech-to-text engine. All current players (Google, Amazon, etc) use neural nets, so Apple scooped up many Nuance employees to double down and compete[1]. Rosenthal and Gilbert agreed that Siri was a failure: "the only thing I use Siri for is setting alarms." I'd have to agree from personal experience. They chalked the failure up to the disadvantages of a UI where the full extent of the functionality is hidden and difficult to understand, leading to bad experiences. I imagine this technology will have many of the same struggles.
> They chalked the failure up to the disadvantages of a UI where the full extent of the functionality is hidden and difficult to understand, leading to bad experiences. I imagine this technology will have many of the same struggles.
This is super hard. Not only do you have hidden functionality (what a great insight!) but they are trying to do something better than the current state of the art (which is human-to-human). What do I mean? Just listen to yourself when you get something over the phone. You don't call the restaurant and say "give me a reservation for four at 8." You typically use an interactive process of query and response ("can you fit four in at 8pm?" "Yes, today." "No, not five, four". Then they repeat it back to you just in case.
We get annoyed at how crappy the voice recognition systems are, and they are crappy, but human voice recognition reminds me of a TCP session negotiation, or maybe the baud rate negotiation of a bell-compatible Hayes-style modem. It's hardly a one-shot process, yet we want our machines to be.
Which is probably why solutions like Open Table exist and succeed. For me at least, avoiding this kind of inexact interaction is desirable. Simple forms on web pages can provide a smoother UX than talking to huamns for the this type of interaction. If these can provide that one shot process, then inevitably that is the expectation set of any replacement UI.
The solution to that problem will be AI to AI instead of AI to human, using standardized interfacing (likely one of a few big common standards, with each AI being capable of utilizing any of them) so the AI can easily understand each other.
In the next ten years there will be a large battle - which will play out on HN daily - over the AI to AI standards; it'll remind older developers of the XML / JSON days.
The beginning of the chain is a human -- the user! And at the end of the day this is all about the user!
Once you know what the user wants it's "easy" (well, somewhat easy anyway :-). It's the iterative process of understanding the user in an unconstrained opaque ad hoc UI that is big-H Hard.
The problem with these spoken interfaces is not just that we don't know what we don't know, we don't even know what we want to accomplish!
I've never seen anything suggesting Siri is any smarter than just looking for keywords. One thing I've tried using it for is a timer while cooking, already a bit rubbish because iOS only supports one. But here's a sample of how that goes:
Open Siri (hands free Hey Siri works for me 5% of the time at best)
"timer 20 minutes"
"Okay, 20 minutes and counting"
A little later, oops that'll need 5 minutes longer than I thought
"add 5 minutes to the timer"
"are you sure you want to change the timer?"
"yes, change it"
"Okay, 5 minutes and counting" Great...
Other than simple keyword commands like that its voice recognition doesn't seem to be good enough for things like transcribing messages anyway.
It's sometimes quicker than spotlight for a web search because of that weird way spotlight hides the search web button for a few seconds when it can't find anything to show.
I think the biggest problem with voice commands is that if the command doesn't work the first time, it would have been faster to simply input the command yourself. I'm not confident that "get me directions to <place>" will actually get me those directions. Because of that, I also use Siri exclusively to set alarms (which does work every time).
This is the #1 problem with anything based on voice recognition and natural language processing IMO. It has to work 99% of the time to overcome this issue, but it seems like the nature of the technologies ensure that it will only asymptotically approach this threshold, and is currently stuck at 75% reliability at best.
Plus, even if it did work, it's just a really inexact and inefficient way to do anything. You always end up trying to see through the abstraction layer (natural language in this case) to the much more well-defined hidden system underneath. It's like reverse engineering an expert system, as a UI paradigm.
Programming languages made to resemble natural language face a similar poison pill, AppleScript being a prime example.
I totally get the appeal of trying to make software understand people better instead of forcing people to understand software, but in practice it just ends up being one more abstraction layer users have to struggle to get past in order to unlock the functionality of the underlying system.
The funny thing is that voice recognition is no longer the problem. For years the reason you couldn't do voice command interfaces was that transcription didn't work, but today transcription works amazingly well. My commands are transcribed perfectly, even in my car when I'm doing 65 on the freeway with my phone stuck in a cupholder.
The problem now is Android randomly kills the "OK Google" background listening process, or it fails because it can't handle a handover between wifi and cell data, or it "can't open microphone" even though it just heard me say OK Google, or any one of many other Android problems rears its ugly head.
The reliability of voice transcription now is far better than the general reliability of Android on my Nexus 5X, and the Android team ought to feel pretty bad about that.
Exactly this - There is a bug where if your phone is locked, and in a position where it thinks it's in landscape, when you say 'OK Google' the phone will unlock in portrait, start listening, rotate to landscape and in the process kill the app listening to your voice, and then rotate back to portrait and lock itself again.
It's also incredibly easy to fall off the blessed path where you can interact solely with voice - when setting a reminder, for example, if it doesn't hear 'Yes' when it's expecting to it'll just sit there and you need to touch the screen to continue - defeating the whole point of voice interaction.
Google's voice search has incredible voice recognition and text to speech and can tap into an amazing amount of information through the knowledge graph, but they don't seem to be capable of fixing the basic bugs and UX issues preventing all that technology from actually being usable.
> It's also incredibly easy to fall off the blessed path where you can interact solely with voice...
I had this happen with Google Maps the other day. I was driving into San Francisco with the turn-by-turn directions when it said something like, "There is a faster route available in two miles. It will save 5 minutes. Tap 'Accept' to take this route."
So I had to fiddle with my phone in a hurry, in traffic (after all, traffic was why there was a faster route coming up), and hope I don't tap the wrong button or crash into anyone while looking for it.
Why couldn't I just have said "Yes! Please and thank you. Of course I want the faster route, why wouldn't I? Oh, sorry, I mean 'Accept'!"
The funny part was that the "faster route" was the way I usually go into that part of town anyway, but Maps had been sending me a different way because of congestion on the usually-faster route. Why did it even ask me to "accept" the faster route instead of just redirecting me the way it does automatically on most occasions?
Like the time I was heading south on 101 through Morgan Hill and Gilroy and Maps had me get off the freeway and take a side street for a couple of miles because traffic was stopped on the freeway due to a car fire. It didn't ask me to tap Accept then, it just gave me some very practical directions on the fly. That's how it should work.
I was recently trying to find why this happens. For me, sometimes the phone hears "Ok Google" but then just sits there as if there is no microphone, while being in portrait mode all the time, so, no screen rotation. Other times, as you said, being screen-locked makes it deaf. I have to manually unlock it to restart OK Google.
We're not suffering from too stupid AI, we're suffering from stupid app design and laziness when it comes to adding more variations to conversation. Just put a team of script writers to imagine thousands of replies, it is not that hard, people have been doing this kind of rudimentary chat bot scripts for decades now.
I'd like to be able to install an App and suddenly, new commands are available to me.
I have to disagree. Mostly your right, but I have a very specific fail case that happens every time with Siri. I can't call my wife by name. I have to always call her "wife", because it thinks I want to change my name. Now it's a name in the address book, but it never matches against it.
I can also use it dictate an email or search the web, and occasionally it fails (maybe 10% of the time?), but it almost never gets my wife's name.
Well, I wouldn't be surprised if Google's voice recognition is better than Apple's (which is powered by Nuance, right?). Also, recognition for search queries and voice actions tends to work better than completely unconstrained text transcription. Of course "perfect" is an exaggeration, but my point is that Google's voice query transcription has passed a threshold and it is no longer the least reliable part of the system.
In a somewhat related note, I've noticed people search on some websites fail with Vietnamese names because surnames are interpreted as prepositions, and thus dropped as a stop word.
More and more I get back my Moto G with the OK Google widget on because of something on TV triggered it. Meanwhile it rarely understand me when I want. Fun.
> It's like reverse engineering an expert system, as a UI paradigm.
The rate of adoption will skyrocket when we reach a certain threshold, people will see other people speaking to their phones and they will learn to form useful voice commands as well. It has a big social factor.
I also use "Call my wife" and "Text my wife" or "Tell my wife that ...". They work every time. Music playing is disappointing - if I want to open YouTube to play something, it can't.
I don't understand why they didn't at least add a few commands for YouTube. At least "play Band", "play Song by Band", "play Music-Genre", "play something like X", "I feel sad/happy/lonely/bored play some music".
If you want to understand the extent to which Siri is useful, you just have to look at the integration it has with the main apps. It's just Alarms, Calendar, Phone, SMS, Music, Weather and similar. It's not really groundbreaking AI, just a voiced command list to the main apps.
I am eagerly waiting for the moment when we could have more in-depth chats with our bots, but the current crop is dragging its feet when it comes even to simple commands sent to apps. Maybe they should ask the app creators to add hooks for voice commands, to make Siri more useful.
For example, if I can disgress a little, Google Now can't start playing the first video in the YouTube app on Android. If I am already in the YT app and say "OK Google, play the first track", it goes to web search instead and searches for "the first track" on the web! The same company made the chatbot, the app and the OS and they don't work well enough together. At least let users add new functions to the bot, if they can't be bothered to do it themselves.
>I don't understand why they didn't at least add a few commands for YouTube. At least "play Band", "play Song by Band", "play Music-Genre", "play something like X", "I feel sad/happy/lonely/bored play some music".
Because YouTube ain't a music player? [1] And besides they have their own music app.
[1] Seriously, even though some people use it as such, this is far from most. According to YouTube itself, the average user only spends 1 hour per month listening to music on YouTube -- consumption is mostly non-music videos.
This is why I'm very skeptical of this current craze for chat bots. It seems to me that they're really only useful if they can infer a wide range of intent from the user's input, and that seems like it's still very much an unsolved problem. Otherwise they're just glorified menu systems.
This is easy to do badly and very hard to do well.
It's not just about inferring intent. Most discourse systems have little or no model of the consequences of their actions. Systems that can act on the user's behalf need to know if they're potentially making a big mistake or a little one, as guidance of when to ask for clarification. This requires some degree of "common sense". AI systems still suck at that.
Systems that actually do something need to be much better at this than ones that just answer questions. There needs to be some cost/benefit analysis of whether it's necessary to repeat back info and ask for confirmation. Without that, the system will either ask for confirmation too much, or screw up too much. The system needs to have a model of how clear the user is in their intent.
Pizza ordering is relatively low-risk. Travel reservations are a higher-stakes item. Medical is a long way off.
As I wrote last time, I recommend watching "The Devil Wears Prada", the scenes of Andy's first day as Miranda's personal assistant. That's the level of performance you want.
Good point. I've been thinking so much about the difficulty of inferring intent I hadn't really considered the consequences of getting it wrong. The more responsible we allow these chat bots to be the greater the risks become.
This is why I think that eventually people will bring this technology up to a useful level it's only going to be the kinds of companies that can make huge time and money investments and have access to huge datasets. Not startup territory really.
Indeed. I'd like to see this the other way round: a nice web page representing a business's phone tree and scripts, some forms I can fill the answers into, and then I can have a smart text-to-speech robot wait in the queue and answer the questions on my behalf.
Your idea is mostly right. Instead of a robot waiting in queue though you should receive the call with the information you already entered presented to the agent. Some companies do this.
Its not just the hidden functionality, for me Siri will frequently assume I want to call somebody when I ask it something. Explaining why I accidentally called someone I hadn't talked to in years (and by the how are you doing?) makes playing with Siri a high risk endeavour.
I wanted to prevent this and only allow calling my favorite contacts (under 10) instead of my whole phone book which is over 1000 contacts. There is no way to set this option in Google Now or Siri.
I've been saved most of the time by having lots of similar sounding people in my address book. The other night I was lying in bed and was setting a reminder and it almost called one of my coworkers (or someone I trained with a few years ago).
So the obvious solution is either to have duplicate contact entries or only associate with people who sound like those you already know, so that Siri has to go to the "do you mean Person A or Person B?" state :)
It sounds like they still have to partner with the different services to create custom integrations with their APIs so they can interact only with these preprogrammed services.
But the real impressive AI thing would be if it could figure how to do the order of a service on its own: mimicking how humans do it by using forms on the webpages or writing e-mails or making phone calls for reservations. That would make the interaction look like just a regular order from the service providers perspective, so there would be no need to create partnership for the API integrations with each service provider.
Having bots fill out forms and write emails is optimistic, but Google already publish how to structure your (HTML) data to make it readable by their bots so that it can extract useful information for their knowledge graph.
Ideally we would have a standardized declarative service description language that “digital assistants” can use to not just answer questions (already possible) but also purchase services and goods.
Although we can’t just trust any “service” discovered on the internet, someone would need to actually vouch for it before it can be used by a digital assistant.
The main problem is that Siri wasn't opened up to third party apps to hook into, and thus remained rather limited and stupid.
Each app could have bundled a voice interface, and Apple's phone would have been the first futuristic and extensible "star trek computer". Oh well, opportunity missed. And by a company whose founder had the Mac speak onstage. I think Jobs would have gone in this direction and demoed the crap out of it!
As usual, there is a systems level challenge to solve for building a foundation for app developers. Namely, how to make a fair and EFFECTIVE way for app developers to all share the same namespace / tree of commands?
If I was in charge of the Siri team, I would have made the following changes:
1) Fork OpenEars or another open source package and spearhead it as a first pass on the phone to eliminate the need for internet connection.
2) Have apps register prefixes for commands
3) Have apps register for "voice intents" and verbs that connect like they have for inter-app audio and app extensions
4) When an app is open, have a way to speak to the app through using the iOS library. This can be used to issue commands or dictate an email etc.
5) Feature apps that make ingenious use of voice commands and have them pitch PR stories about how the iPhone is becoming like Star Trek and is far ahead of Android.
Hopefully they'll stop focusing on converstional UI, as it's wildly innefficient. I'd much rather interact visually and spatially.
I'll give these guys the benefit of the doubt, but it seems like all of these businesses miss the big picture and try to emulate the way humans communicate. In trying to deal with the UI fragmentation problem, they're greatly decreasing the user experience. We don't need a chat bot, we need an app that does everything, where the GUI is the language.
I don't want to describe a location verbally, I want to point to it in a map. Most of the time, I don't know what I precisely want, and prefer to browse options rather than have a tedious conversation.
There is no doubt in my mind that the UI of the future will be more like Akinator/Tinder (Yes/Maybe/No) than English.
All the pieces already exist. All we need is some interface over a knowledge graph. Why don't people see this?
Agreed 100%. I felt the same way 10 years ago when people would always bring up the Minority Report UI as some sort of holy grail we should all be working towards. It looks cool (or demos well, in the case of conversational UIs), but outside of a few very specific use cases it would be horribly inefficient to use all the time.
Conversational UIs almost seem like the ultimate form of skeuomorphism to me- it's only an illusion that it works like something from the non-computer world that you're already familiar with (talking to another person, in this case), but the reality is something far more complicated and far less robust than it leads you to believe.
Communication is subtle, and most post 2000s computer inspired attempts are very bad. I need most spatial, with a bit of tactile, visual and audio cues / feedback. Nowadays it's very much vision centered, high latency / focus (need to pay attention to the screen, and your movements). Bad.
I always find these articles (and the response to them) a bit strange.
The hard part of this problem isn't the voice-part - that is doable. The hard part is parsing the question or order! That is so, so far from being solved. Accuracy rates on QA tasks in academia are around 60-70% (depending on the task), which is much worse than the 95-99% accuracy achievable with speech-to-text.
That's why the pizza ordering thing is impressive - the software has to understand intent.
But it didn't sound any easier than calling the pizza shop, which is something I hated doing before the internet solved that problem for me. The unpleasant experience was caused by the cognitive load of having an unpredictable real-time interaction with something that wasn't laid out in front of me, not by the fact that the thing I was conversing with was human
The example of the pizza ordering is technologically impressive, yes... But from an end-user perspective it is at best "as hard" as calling the pizza place and talking to a human taking your order.
There are many use cases where it could actually provide value, but I found it funny that they chose to demonstrate one to the press that really afforded end-users no advantage.
The point they emphasize in the article is that Viv is supposed to allow you to say "get me pizza, flowers, a bottle of wine, and two tickets to the opera" and then it helps you figure out all that stuff and handles everything for you without you having to make any phone calls, type anything, or open any apps.
I can already order my pizza with an app or with Yo Dominoes... and they cost the same...
Not sure if it's because the person on the phone is doing other stuff at the same time or if it's more like eBooks where they cost the same as Paperbacks because "more profit".
That is the case. Only when the store is very busy will they have dedicated phone order takers. The rest of the time the drivers who are waiting for the next delivery, the pizza makers, and/or managers handle the phones.
Used to work at a Dominos, though it was pre-internet days. Not sure how much online business the typical store now gets, I know I still always call when I order pizza because it's faster than dealing with the website.
Annoyingly at Dominos they now also have an automated attendant answer the call initially, which reads off the daily specials. That's a big reason I don't often order Dominos anymore but instead call a competitor who still has live humans answering the phone.
When I was at Dominos we had a standard of no more than two rings before a person picked up. People would sprint across the store to get the phone before the third ring.
Why not get headsets like they have at fast food places? Edit: I feel like that would be a lot cheaper than even one worker comp claim slipping on a flour-covered floor in a pizza place.
I think it's optimistic to think the pizza making process at Domino's involves any loose flour. I would have imagined frozen and/or refrigerated lumps of premade dough on a rack.
I don't know about Dominoes, but when I worked for Papa Johns many years ago loose flour was involved in the process of shaping the refrigerated lumps of premade dough into a pizza crust.
The real savings would come if you have a pizza bot that made pizzas without much human intervention and a drone picked it up and dropped it off and flew back to base.
Automated pizza making machines already exist[1] (skip to near end of video to see it in action). Proof of concept: Simply take the end product and pick it up with a multi-rotor and have it autonomously deliver it to the next room.
I've definitely eaten some frozen pizza that were as good as or better than the $5 pizza's from Dominos, I'm in Australia, YMMV.
I imagine if this was more profitable than the existing technology - humans placing toppings and delivering them - it would already be done.
Dominos and Pizza Hut are awful in Australia and New Zealand. $5 here is pretty much poverty food; even McDonalds cost way more. In UK Domino's is ok, but cost around 15 pounds delivered.
I agree that Dominos is awful now. I always thought Pizza Hut was pretty bad.
I think when Dominos changed from real baking ovens to what they now use (effectively a conveyorized giant hair dryer) there was a marked change. They had to change the dough so it would not bubble, because the point of the automatic oven is that you would not need a skilled person to monitor the baking.
The conveyor ovens do save labor. When you have 600-degree baking ovens stacked in decks, and you have to watch for and pop bubbles and turn the pies halfway through, and the difference between a pizza being "done" and "burnt" is about 30 seconds, it requires an attentive and coordinated person (sometimes two or three if they are busy) to manage it all. But the pizzas taste better.
I wish them luck, and the ability to hook into 3rd party APIs is good, but ordering a pizza is essentially just straight forward form filling, and quite frankly, something I could do 30 years ago. SoundHound's Hound[0] shows refinement and retargetting, which is much harder because you have to entity resolution on implied subjects / objects, pronouns. Now maybe VIv can do that too, but that's not what they demoed. What they demoed isn't actually all that interesting.
> It was their first real test of Viv, the artificial-intelligence technology that the team had been quietly building for more than year.
I think they mean "first real demonstration." It would have been tested thousands of times before reaching the board room.
That gripe aside, the real test (for me, at least) of whether a service like this will be usable is whether it can run offline without uploading my entire life to the mothership. I won't use Google Now or Siri or anything similar because it would make my life (even more of) an open book to a single huge provider. It's bad enough that Google (through Gmail) has access to my emails, I don't want them having a log of my moment-to-moment location, voice conversations, routines, etc.
I have my doubts when it comes to replacing colorful, tactile user interfaces with voice conversations.
I don't like the idea at all. I want to interact with my device using a visual interface, which I can "feel" and interact with, rather than a metallic voice.
I want to see what that pizza looks like, not imagine it.
Voice interface is totally unusable in a place were there are several other people - an office, train, bus, doctor's office - places were we tend to dive into our devices.
You're right, I'm lumping them together, as per the article:
> Then, a text from Viv piped up: "Would you like toppings with that?"
That's not a very good UX in my opinion.
*
> Think star trek 'Computer show me a map of the omnicrom system and highlight possible habitable planets within 10 light years of our current position'
Yes, it would be a nice additional input method , although sometimes pointing things out (like selecting with the mouse or finger) is a lot more efficient than trying to explain it - "highlight that one.. no, no, not that one, the other one! ... stupid toaster!"
I can see voice input being useful when combined with all the other input methods plus good realtime visual feedback.
My prediction is that we'll have really interesting programming environments and programming languages based on voice input in the next couple of years.
Still, voice input is practical in an isolated space, were there are no other people using voice input as well, so although I see a niche, I don't see it replacing mobile devices just yet.
Another example of government funded invention being converted to dollars using capitalist innovation. People still don't understand how much our government is involved in bringing us the tech we have now. government invests and about 10-20 years we see it in our homes.
Because the list of words the speech recognition system has to understand reliably is extremely limited. There are probably what, 200-300 words the system has to understand for a pizza service? It's fairly simple to fine-tune speech recognition if you have a small dictionary of keywords you've given the engine extra training on. A pizza demo is perfect because it appears to be non-trivial, but I assume the engine has been fine-tuned a lot specifically for the demo. The interesting aspects of speech recognition are in handling natural sentences from the entire language, rather than a subset that has been heavily tuned to weigh specific words as being more expected for a given domain.
[0] http://www.acquired.fm/episodes/2015/12/14/episode-5-siri
[1] http://9to5mac.com/2014/06/30/why-is-apple-hiring-nuance-eng...