Hacker News new | past | comments | ask | show | jobs | submit login

The problem is that voice interfaces break down pretty quickly once you try to do anything complicated. The Echo has pretty solid voice recognition--far better than anything else I've ever used--but it's still hard to get it to do anything useful once you get beyond a pretty narrow script. (e.g. what's the weather forecast, play this artist, etc.)



I've found that the voice recognition on Android phones works well enough to be useful in a wide variety of circumstances. Navigating, getting directions, setting alarms, taking notes, sending text messages, sending emails, searching for things, and many more. When I was still using my Moto X I did the majority of every-day tasks with voice recognition.

The iPhone is catching up fast too...my wife's taken to sending emails via Siri (to avoid strain on her hands), and most of the time it gets things perfectly.

The biggest problem is privacy. One of the nice things about touchscreens is that you have a personal dialog with the device that can't be overheard by anyone nearby. That doesn't apply to voice recognition systems, and it can be pretty awkward to dictate an e-mail to a phone in a crowded place.


Being overheard isn't the only privacy concern. Most of these solutions offload the speech recognition and language parsing functions to corporate servers. I like texting with Siri but I'm not exactly keen on having Apple record everything. It also seems limiting in that I can't use voice commands without a network.

It would be nice for voice recognition platforms to start being built in. I know there's training data that's needed, but there's some convenience afforded.


I think the processing requirements for handling on-device Siri would destroy battery life.


This actually doesn't seem to be the case. Take a look at Google Translate's offline voice recognition AND translation - it's really amazing, considering it's all happening on your device.


I forget where it was, but they published something about training a very small very fast neural network that could fit comfortably in the phone's memory. Tricky tricky. :D


Plus the only way to train these things at scale is to upload the recordings once you have some usage.


Worse for battery life than firing up the radio?


And devices that listen to you 100% of the time is yet another privacy concern... even if they don't send everything to a remote server.


If you have a human assistant who does that job, he also listens 100% of the time.


But he or she is less vulnerable to being automatically hacked by a three letter agency, foreign government, and/or hacker gathering data for identity theft.

The privacy concern _isn't_ necessarily about having something to hide. It's about the consistent hacking of major systems, and exposure of personal data.


And you don't think there are privacy concerns with that? It is a /very/ intimate relationship, and generally requires some ritualized/formalized interaction, and a very high degree of trust.


Just on the note of hand strain, without knowing anything about your wife's condition, a way that could help alleviate it is to critically analyse hand position/technique. As a pianist, I have been trained to have a very supple hand position when operating any device but I notice this isn't at all the case for many people I observe in their day to day activities.

Historically probably wasn't much of an issue but given that most people will spend hours at a desk on a keyboard, it's likely to become more of a problem. Think of it akin to paying attention to your posture


The use of Google Now from my bluetooth'd helmet has really improved my motorcycling experience.

Real easy to say: "Okay Google... navigate to California Academy of Sciences."

What's missing for me is spotify/app specific integration.


> What's missing for me is spotify/app specific integration.

For that to really happen in a robust way, I think Google needs to open up Custom Voice Actions.

[0] https://developers.google.com/voice-actions/custom-actions


"Ok Google.... Play <artist> on Spotify" works for me.

I agree discovery of these magic phrases needs work.


Yeah, there's some that can be done through system actions (which I think that is) and it sounds like custom actions have been implemented by selected partners, I just mean they need to open up custom actions to enable more general app-specific integration.


I thought this already worked.

Okay Google... Play music will start Music app Okay Google... Start Radio will start NPR app


I can say "Open Spotify" and it will open the app. Then I have a button on the helmet that sends the Play command. But I can't do anything robust like playing a specific artist.

Perhaps if I used Google Music the integration would be built out.


On my phone "Play <artist>" uses Google Music. "Play <artist> on Spotify" makes it use Spotify.


On my Nexus 6p saying "OK Google play 'artist'" will open Spotify and start playing the top songs of that artist. This does not work to play specific playlists though.


Define work well? It doesn't work well if you're not connected to the Internet, if you speak quickly, if you interrupt it, it can only do limited follow up.


>The problem is that voice interfaces break down pretty quickly once you try to do anything complicated

I've done a fair bit of interface engineering for the web. Between that and using so much software over the course of my life, I'd say that this applies to GUIs just as much as voice interfaces.


Yes, but GUIs have two or three dimensions available (up/down, left/right, time) whereas voice just has the one (time). We humans can also full-duplex GUIs much more easily than voice-based interface. And GUIs at least can be hooked up to full-powered grammar-based interfaces whereas voice, somewhat ironically considering the nature of human communication, has more trouble with it.

(I'd suggest this is actually a combination of the still-non-trivial nature of NLP, combined with a lack of feedback, combined with the fact that giving instructions is quite hard. Humans overestimate human language's ability to communicate clear directions, as anyone who has done tech support over a phone understands.)


Just as the mouse input has evolved to include multitouch and 3d touch gestures, voice input can also evolve. The full range of tone, inflection, pitch, etc is available from the human voice.

I wonder if NLP research should have started as our ancestors did, with grunts and hoots and cries. Instead it's focused on recognizing full words and sentences while almost completely ignoring inflection.

Another dimension to add with vocal input is directional. If you have mics in all corners of a room, which direction you speak in can affect whether "turn off" operates your TV, your lights or your oven.


Very good points. I can't wait until devices can read my emotions or inflections in my voice. I can voice-to-text most of my short messages, but anything that requires punctuation or god forbid emojis still require manual input. And I don't want to have to say "period" or "exclamation mark" to indicate my desired punctuation. If I say it unusually loudly, insert an exclamation mark. If I pause at the end of a sentence (Word has known a grammatically correct sentence for decades) and don't say "um" or "uh", put a period. If my inflection goes up or there is a question word in the sentence, add a question mark.

There is a lot of improvement for voice processing in several dimensions of voice.


And copy and paste. People seem to always forget the power of it. It's the GUI equivalent of "Search for that on Google" or "Now, SSH to this IP I found digging through AWS." Copy and pasting of text from application to application is the clunky Unix Pipe. It's universal and deeply important.

Taking sections of the last response, or hell, even having every response essentially be wrapped up in some sort of object you can reference in your next query to the interface is what all of these lack.

Even Androids "Search this artist" doesn't quite get there. The lack of context between queries is what murders Siri for me. That and her seemingly random selection of what goes to Google and what goes to Wolphram Alpha. Sometimes even the "wolfram" verb prepended to a query just doesn't go to wolfram no matter what.


I've often postulated that copy and paste is perhaps the biggest productivity enhancement in the history of computing.


I know some software maintainers who might disagree. But I like PopClip (https://pilotmoon.com/popclip/) as an enhancement on top of that one.


I second PopClip as a fantastic product, incredibly useful. Their DropShelf[0] tool is also useful, but not nearly as much as PopClip. But definitely worth the money.

0: https://pilotmoon.com/dropshelf/


I use KDE Connect to enable seamless copy and paste between my PC and my phones. It's the single best thing I ever installed in the last 1 or 2 years.


Sure, but the difference is that it's (almost) always obvious what actions are possible in a GUI. With voice interfaces you're back to trial-and-error.


There is still a fundamental problem with voice: it has to understand your words.

A text field in contrast doesn't need any intelligence, nor do buttons. This is in particular important for instance for people living in non english speaking countries but using english in specific contexts (work, gaming, minor hobbies etc.). Switching language in audio applications are generally a PITA. Then even when you do the switch between languages every time, the engines are still have huge performance gaps between the languages.

Sofware has become way extremely tolerant for multiple languages IMO. Voice recognition interfaces are not so mature yet in my experience.


I'm not so sure about that. Check this out. One of the toughest fights in one of the toughest games performed with only voice commands. https://www.youtube.com/watch?v=5m2a2dLdZ0M

Now, granted, this is a specific use case, but, you know... "explore the space" and all that. (more cowbell!)


> One of the toughest fights in one of the toughest games performed with only voice commands. https://www.youtube.com/watch?v=5m2a2dLdZ0M

After 111 failed attempts :)

Still, it's a hell of an achievement.

EDIT: to be fair, Ornstein & Smough is a very tough fight even with normal controls.

Also notice the voice recognition fails to recognise some words like "item" even though they are spoken clearly. Almost gets the guy killed at one point.


The "play some good 60s rock" example isn't a VUI breakdown, it's a functionality gap in the backend. One that will probably be fixed pretty quickly, given the way things are headed.

A VUI breakdown would be inability to understand accents, or non-responsiveness to commands. As a user input, Alexa is pretty well buttoned up.


Sounds like the Enterprise computer:

Geordi: Computer, subdued lighting.

(computer turns the lights off)

Geordi No, that's... that's too much. I don't want it dark. I want it cozy.

Computer: Please state your request in precise candlepower.

(The scene: https://www.youtube.com/watch?v=OPZnR3Ue1n4)


There will certainly be some aspects of the computer training the human, too. Just using this as an example, I don't know how much candlepower I want, but computers don't get bored or annoyed by my requests. I could start with 1 candlepower and move up to 10 if it's not bright enough. 100 might be too bright, so now I know what range I'm looking at. Next time I could just say "computer, 12 candlepower lighting, please".

Computers train users on how to use the computer all the time. It's less ideal than having the computer know everything, but once you know what you can expect from a computer, it's easier to get a good result.


I think that cuts both ways. If the computer can be trained to understand the user's intent, that seems like a better solution than forcing the user to think a different way.

Which would you rather do? Be forced to state your lighting preferences in candlepower, or have the computer learn that when you say "subdued lighting", you mean "12"?


Very true, but this is one simple example. Look at what Wolfram Alpha tries to do for even more complicated examples. If I put in "if I am traveling at 60 miles per hour how many hours does it take to go one hundred miles" it gives me an answer of 6000 seconds (1.66 hours). Very intuitive, and it actually ruined my example because I did not expect the site to understand what I was saying.

But if I type in "how fast do I need to go to travel 100 miles in 6000 seconds", now it has no idea what I'm talking about and instead gives me a comparison of time from 6000 seconds to the half life of uranium-241.

Now, when I get that result, I don't usually just give up on trying to figure out the answer. Instead I try to figure out what the computer expects me to say. Through some trial and error, I can shorten the query to "100 miles in 6000 seconds" and boom, I get the answer of 60 miles per hour. Instead of natural language, I'm using the search engine like a calculator.

The computer has just taught me how to use it. Ideal? No, but we work within the reality we're given. 12 candlepower is dim for you but for someone with decreased vision, that might be completely dark. The computer doesn't know unless it's taught, and we know from looking at history that users would rather the computer train the user than the user having to train the computer.


You asked: "how fast do I need to go to travel 100 miles in 6000 seconds" Which is equivilent to saying "at what rate do I need to go to travel {rate}". It's a nonsense question, you already know the answer. You need to go 100 miles per 6000 seconds.

What you should have asked is: "100 miles per 6000 second to miles per hour", which it will happily convert the rate you gave, for the one you really wanted.

I guess what your saying is it should be able to figure that out, but at some point, the old phrase "garbage in garbage out" surfaces.. You never told it to convert the unit.


Wolfram is, and has always been, much more inclined to understand you if you work out what exactly you are trying to calculate before hand.

Some phrases exist as a "wow, 1 million people phrase this problem this way, let's throw that in." The fact it can take an easily dictated, albeit strictly phrased problem, and get you your answer is really what I love about it. Now if Siri would just stop sending stuff to Google. -_-


What if you could define the equivalent of Bash aliases via voice control? This would allow users to tailor their experience from the default (possibly complex/unintuitive) commands to their own personalized ones.

Example format: "Computer, define X as Y"

"Computer, define subdued lighting as set lighting to candle power twelve"

Then the VUI just adds a new entry to the voice commands where saying X results in Y.


So unrealistic. They'd use candelas.


You're thinking too much like an engineer :-) It's not a speech recognition breakdown but it's certainly a voice interface breakdown in the sense of I can't get the device to do what I want it to do. As a user, I don't care where in the pipeline my attempts to communicate a desired action break down. I just know that they do.


Exactly. We're used to dealing with either humans, who are intuitive and highly adaptive, and technology, which we manipulate and have total control over (so long as the system displays its status, we can find our way). We're not used to systems that expect us to interact with them in natural language, but have very specific criteria around what we ask for.

It still feels a lot like the old text-based RPGs, in that you spend most of your time trying to figure out how to phrase something to accomplish a basic need, while angrily thinking "it would have just been easier/faster to pick up my phone."

It's 2016. How are we still OK with the unreasonable constraints of technology that make us jump through a hoop like a trained poodle to get the treat?


Same can be said for GUI as well. Remove the search engine concept, you are only left with playlist, song/artist name on such sites.

We don't have audio search engine equivalent yet but that day is also not far.


That's the thing. It is a use case with voice commands that map to specific actions. In the case of music, I can give Echo the name of a specific artist or maybe a playlist. But it breaks down pretty quickly if I tell it to play "some good 60s rock."


Ok, that is pretty damn cool. I've played Dark Souls so I can appreciate how difficult that must have been. Very impressive.

Devil's advocate though: this seems more like a case of the guy being good enough at the game to win in spite of the voice controls rather than because of them. Compared to a regular controller/keyboard+mouse/whatever there's just no contest in terms of input speed and precision. Not all genres are a good fit for this either. I'd be really interested to see if anyone could make it work with, say, a competitive FPS game.


Never mind that in order to use a voice service, it requires you to speak at a rate slower than many can type, all while demanding that the people in the room hush up so it won't get confused. Repeat if there was a mistake.


Try Hound. It's faster than anything I've tried and it's context management is just impressive as hell. The echos lack of negative clauses is really really frustrating.


I just can't stand talking to a computer. Never liked the idea of it. I loathe voice-controlled telephone menus. I can type faster than I can talk (if you include the inevitable revisions -- even without it's pretty close). I don't even like to leave messages on voicemail. I don't think voice interfaces are anything I will ever use if there's another option.


That holds true with pretty much all first generation products of it's type. The first "smart phones" couldn't do a whole lot of things. Over time, the Echo will improve and you'll be able to hold conversations with it.


My children are quite young. The world is going to be an amazingly interesting place when they are my age.

I can recall the first time I ever saw a computer and how primitive they now look.

Now we have little bots that listen to you and reply with info.

When my two-year-old is forty - we will have ghost in the shell.

It's crazy beautiful and scary to me that we all grew up reading cyberpunk fiction and watching anime and not all of us did, but pretty much all of us are actually building that future.

There is a balance between dystopia and utopia though.

We are all working at the Great Game - and the future is going to be interesting, but we can never turn back. So hopefully we keep the balance and get it right.

My worry is that at this literal nascent stage of technology, that we don't fuck it up as we don't fight hard enough for privacy policy.

We need privacy policy that is thinking at least 50 years in advance.

The control of government apparatus is thinking in advance - I personally feel that the tech sector's vision is myopically focused on today's profits and not in the future where it should be viewing, with the exception of this most recent case between apple and the FBI. At least Cook's comments were salient and forward thinking and truly for the greater good... Let's hope that invigorates the tech industry as a whole to think about where we are headed.


Speech recognition has improved dramatically over the past few years through using cloud back-ends. It's actually usable for many tasks.

However, we seem to still be pretty far from natural language interfaces that make sensible inferences about actions you're requesting and perhaps join multiple data sources to answer your query. There have been a lot of advances--don't get me wrong. But it's a very hard problem that's been being worked on for a very long time.


Just like you hold conversations with Siri, Cortana and Google Now?


Are they not first-gen?


Well, I mean, they aren't fixed artifacts like a piece of hardware. I'm pretty sure they have been updated a few times.


is it better than google voice? Siri is completely useless for me but google voice recognize everything I said (love my new iphone 6s but I wish I could say "hey siri" and it would actually work).


The other issue is it becomes less useful when more than one person is active in the room. Small party? Interface no longer functioning as talking in the background interferes.


And if you do get beyond a narrow range does the user spend a lot of time thinking about how to craft a question so that the machine can understand it?


How complicated is controlling a TV or a radio? And voice is much easier for a variety of tasks than remote controls.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: