Hacker News new | past | comments | ask | show | jobs | submit login

The transition from primarily visual UX towards an auditorial UX is really powerful.

Looking at screens to get key information distracts me from my surroundings and seems archaic.

My wife is a sound designer who has opened my eyes to the importance of sounds both in film and in the world. It's not that I was unaware of sounds, but I didn't realize how important they are to centering me in this world and the made up worlds of films and games. Try watching a scary movie with the sound turned off, it turns into a comedy.

I think its unexplored territory that has huge potential to impact the way we interact with the real world, even more so then Glass or Hololens.

When I listen to music as I walk down the street I change, my mood, my posture and the way I look at the world. The music augments the reality around me in a way that visual UX never can because it's a lens between my eyes and the world.




The problem is that voice interfaces break down pretty quickly once you try to do anything complicated. The Echo has pretty solid voice recognition--far better than anything else I've ever used--but it's still hard to get it to do anything useful once you get beyond a pretty narrow script. (e.g. what's the weather forecast, play this artist, etc.)


I've found that the voice recognition on Android phones works well enough to be useful in a wide variety of circumstances. Navigating, getting directions, setting alarms, taking notes, sending text messages, sending emails, searching for things, and many more. When I was still using my Moto X I did the majority of every-day tasks with voice recognition.

The iPhone is catching up fast too...my wife's taken to sending emails via Siri (to avoid strain on her hands), and most of the time it gets things perfectly.

The biggest problem is privacy. One of the nice things about touchscreens is that you have a personal dialog with the device that can't be overheard by anyone nearby. That doesn't apply to voice recognition systems, and it can be pretty awkward to dictate an e-mail to a phone in a crowded place.


Being overheard isn't the only privacy concern. Most of these solutions offload the speech recognition and language parsing functions to corporate servers. I like texting with Siri but I'm not exactly keen on having Apple record everything. It also seems limiting in that I can't use voice commands without a network.

It would be nice for voice recognition platforms to start being built in. I know there's training data that's needed, but there's some convenience afforded.


I think the processing requirements for handling on-device Siri would destroy battery life.


This actually doesn't seem to be the case. Take a look at Google Translate's offline voice recognition AND translation - it's really amazing, considering it's all happening on your device.


I forget where it was, but they published something about training a very small very fast neural network that could fit comfortably in the phone's memory. Tricky tricky. :D


Plus the only way to train these things at scale is to upload the recordings once you have some usage.


Worse for battery life than firing up the radio?


And devices that listen to you 100% of the time is yet another privacy concern... even if they don't send everything to a remote server.


If you have a human assistant who does that job, he also listens 100% of the time.


But he or she is less vulnerable to being automatically hacked by a three letter agency, foreign government, and/or hacker gathering data for identity theft.

The privacy concern _isn't_ necessarily about having something to hide. It's about the consistent hacking of major systems, and exposure of personal data.


And you don't think there are privacy concerns with that? It is a /very/ intimate relationship, and generally requires some ritualized/formalized interaction, and a very high degree of trust.


Just on the note of hand strain, without knowing anything about your wife's condition, a way that could help alleviate it is to critically analyse hand position/technique. As a pianist, I have been trained to have a very supple hand position when operating any device but I notice this isn't at all the case for many people I observe in their day to day activities.

Historically probably wasn't much of an issue but given that most people will spend hours at a desk on a keyboard, it's likely to become more of a problem. Think of it akin to paying attention to your posture


The use of Google Now from my bluetooth'd helmet has really improved my motorcycling experience.

Real easy to say: "Okay Google... navigate to California Academy of Sciences."

What's missing for me is spotify/app specific integration.


> What's missing for me is spotify/app specific integration.

For that to really happen in a robust way, I think Google needs to open up Custom Voice Actions.

[0] https://developers.google.com/voice-actions/custom-actions


"Ok Google.... Play <artist> on Spotify" works for me.

I agree discovery of these magic phrases needs work.


Yeah, there's some that can be done through system actions (which I think that is) and it sounds like custom actions have been implemented by selected partners, I just mean they need to open up custom actions to enable more general app-specific integration.


I thought this already worked.

Okay Google... Play music will start Music app Okay Google... Start Radio will start NPR app


I can say "Open Spotify" and it will open the app. Then I have a button on the helmet that sends the Play command. But I can't do anything robust like playing a specific artist.

Perhaps if I used Google Music the integration would be built out.


On my phone "Play <artist>" uses Google Music. "Play <artist> on Spotify" makes it use Spotify.


On my Nexus 6p saying "OK Google play 'artist'" will open Spotify and start playing the top songs of that artist. This does not work to play specific playlists though.


Define work well? It doesn't work well if you're not connected to the Internet, if you speak quickly, if you interrupt it, it can only do limited follow up.


>The problem is that voice interfaces break down pretty quickly once you try to do anything complicated

I've done a fair bit of interface engineering for the web. Between that and using so much software over the course of my life, I'd say that this applies to GUIs just as much as voice interfaces.


Yes, but GUIs have two or three dimensions available (up/down, left/right, time) whereas voice just has the one (time). We humans can also full-duplex GUIs much more easily than voice-based interface. And GUIs at least can be hooked up to full-powered grammar-based interfaces whereas voice, somewhat ironically considering the nature of human communication, has more trouble with it.

(I'd suggest this is actually a combination of the still-non-trivial nature of NLP, combined with a lack of feedback, combined with the fact that giving instructions is quite hard. Humans overestimate human language's ability to communicate clear directions, as anyone who has done tech support over a phone understands.)


Just as the mouse input has evolved to include multitouch and 3d touch gestures, voice input can also evolve. The full range of tone, inflection, pitch, etc is available from the human voice.

I wonder if NLP research should have started as our ancestors did, with grunts and hoots and cries. Instead it's focused on recognizing full words and sentences while almost completely ignoring inflection.

Another dimension to add with vocal input is directional. If you have mics in all corners of a room, which direction you speak in can affect whether "turn off" operates your TV, your lights or your oven.


Very good points. I can't wait until devices can read my emotions or inflections in my voice. I can voice-to-text most of my short messages, but anything that requires punctuation or god forbid emojis still require manual input. And I don't want to have to say "period" or "exclamation mark" to indicate my desired punctuation. If I say it unusually loudly, insert an exclamation mark. If I pause at the end of a sentence (Word has known a grammatically correct sentence for decades) and don't say "um" or "uh", put a period. If my inflection goes up or there is a question word in the sentence, add a question mark.

There is a lot of improvement for voice processing in several dimensions of voice.


And copy and paste. People seem to always forget the power of it. It's the GUI equivalent of "Search for that on Google" or "Now, SSH to this IP I found digging through AWS." Copy and pasting of text from application to application is the clunky Unix Pipe. It's universal and deeply important.

Taking sections of the last response, or hell, even having every response essentially be wrapped up in some sort of object you can reference in your next query to the interface is what all of these lack.

Even Androids "Search this artist" doesn't quite get there. The lack of context between queries is what murders Siri for me. That and her seemingly random selection of what goes to Google and what goes to Wolphram Alpha. Sometimes even the "wolfram" verb prepended to a query just doesn't go to wolfram no matter what.


I've often postulated that copy and paste is perhaps the biggest productivity enhancement in the history of computing.


I know some software maintainers who might disagree. But I like PopClip (https://pilotmoon.com/popclip/) as an enhancement on top of that one.


I second PopClip as a fantastic product, incredibly useful. Their DropShelf[0] tool is also useful, but not nearly as much as PopClip. But definitely worth the money.

0: https://pilotmoon.com/dropshelf/


I use KDE Connect to enable seamless copy and paste between my PC and my phones. It's the single best thing I ever installed in the last 1 or 2 years.


Sure, but the difference is that it's (almost) always obvious what actions are possible in a GUI. With voice interfaces you're back to trial-and-error.


There is still a fundamental problem with voice: it has to understand your words.

A text field in contrast doesn't need any intelligence, nor do buttons. This is in particular important for instance for people living in non english speaking countries but using english in specific contexts (work, gaming, minor hobbies etc.). Switching language in audio applications are generally a PITA. Then even when you do the switch between languages every time, the engines are still have huge performance gaps between the languages.

Sofware has become way extremely tolerant for multiple languages IMO. Voice recognition interfaces are not so mature yet in my experience.


I'm not so sure about that. Check this out. One of the toughest fights in one of the toughest games performed with only voice commands. https://www.youtube.com/watch?v=5m2a2dLdZ0M

Now, granted, this is a specific use case, but, you know... "explore the space" and all that. (more cowbell!)


> One of the toughest fights in one of the toughest games performed with only voice commands. https://www.youtube.com/watch?v=5m2a2dLdZ0M

After 111 failed attempts :)

Still, it's a hell of an achievement.

EDIT: to be fair, Ornstein & Smough is a very tough fight even with normal controls.

Also notice the voice recognition fails to recognise some words like "item" even though they are spoken clearly. Almost gets the guy killed at one point.


The "play some good 60s rock" example isn't a VUI breakdown, it's a functionality gap in the backend. One that will probably be fixed pretty quickly, given the way things are headed.

A VUI breakdown would be inability to understand accents, or non-responsiveness to commands. As a user input, Alexa is pretty well buttoned up.


Sounds like the Enterprise computer:

Geordi: Computer, subdued lighting.

(computer turns the lights off)

Geordi No, that's... that's too much. I don't want it dark. I want it cozy.

Computer: Please state your request in precise candlepower.

(The scene: https://www.youtube.com/watch?v=OPZnR3Ue1n4)


There will certainly be some aspects of the computer training the human, too. Just using this as an example, I don't know how much candlepower I want, but computers don't get bored or annoyed by my requests. I could start with 1 candlepower and move up to 10 if it's not bright enough. 100 might be too bright, so now I know what range I'm looking at. Next time I could just say "computer, 12 candlepower lighting, please".

Computers train users on how to use the computer all the time. It's less ideal than having the computer know everything, but once you know what you can expect from a computer, it's easier to get a good result.


I think that cuts both ways. If the computer can be trained to understand the user's intent, that seems like a better solution than forcing the user to think a different way.

Which would you rather do? Be forced to state your lighting preferences in candlepower, or have the computer learn that when you say "subdued lighting", you mean "12"?


Very true, but this is one simple example. Look at what Wolfram Alpha tries to do for even more complicated examples. If I put in "if I am traveling at 60 miles per hour how many hours does it take to go one hundred miles" it gives me an answer of 6000 seconds (1.66 hours). Very intuitive, and it actually ruined my example because I did not expect the site to understand what I was saying.

But if I type in "how fast do I need to go to travel 100 miles in 6000 seconds", now it has no idea what I'm talking about and instead gives me a comparison of time from 6000 seconds to the half life of uranium-241.

Now, when I get that result, I don't usually just give up on trying to figure out the answer. Instead I try to figure out what the computer expects me to say. Through some trial and error, I can shorten the query to "100 miles in 6000 seconds" and boom, I get the answer of 60 miles per hour. Instead of natural language, I'm using the search engine like a calculator.

The computer has just taught me how to use it. Ideal? No, but we work within the reality we're given. 12 candlepower is dim for you but for someone with decreased vision, that might be completely dark. The computer doesn't know unless it's taught, and we know from looking at history that users would rather the computer train the user than the user having to train the computer.


You asked: "how fast do I need to go to travel 100 miles in 6000 seconds" Which is equivilent to saying "at what rate do I need to go to travel {rate}". It's a nonsense question, you already know the answer. You need to go 100 miles per 6000 seconds.

What you should have asked is: "100 miles per 6000 second to miles per hour", which it will happily convert the rate you gave, for the one you really wanted.

I guess what your saying is it should be able to figure that out, but at some point, the old phrase "garbage in garbage out" surfaces.. You never told it to convert the unit.


Wolfram is, and has always been, much more inclined to understand you if you work out what exactly you are trying to calculate before hand.

Some phrases exist as a "wow, 1 million people phrase this problem this way, let's throw that in." The fact it can take an easily dictated, albeit strictly phrased problem, and get you your answer is really what I love about it. Now if Siri would just stop sending stuff to Google. -_-


What if you could define the equivalent of Bash aliases via voice control? This would allow users to tailor their experience from the default (possibly complex/unintuitive) commands to their own personalized ones.

Example format: "Computer, define X as Y"

"Computer, define subdued lighting as set lighting to candle power twelve"

Then the VUI just adds a new entry to the voice commands where saying X results in Y.


So unrealistic. They'd use candelas.


You're thinking too much like an engineer :-) It's not a speech recognition breakdown but it's certainly a voice interface breakdown in the sense of I can't get the device to do what I want it to do. As a user, I don't care where in the pipeline my attempts to communicate a desired action break down. I just know that they do.


Exactly. We're used to dealing with either humans, who are intuitive and highly adaptive, and technology, which we manipulate and have total control over (so long as the system displays its status, we can find our way). We're not used to systems that expect us to interact with them in natural language, but have very specific criteria around what we ask for.

It still feels a lot like the old text-based RPGs, in that you spend most of your time trying to figure out how to phrase something to accomplish a basic need, while angrily thinking "it would have just been easier/faster to pick up my phone."

It's 2016. How are we still OK with the unreasonable constraints of technology that make us jump through a hoop like a trained poodle to get the treat?


Same can be said for GUI as well. Remove the search engine concept, you are only left with playlist, song/artist name on such sites.

We don't have audio search engine equivalent yet but that day is also not far.


That's the thing. It is a use case with voice commands that map to specific actions. In the case of music, I can give Echo the name of a specific artist or maybe a playlist. But it breaks down pretty quickly if I tell it to play "some good 60s rock."


Ok, that is pretty damn cool. I've played Dark Souls so I can appreciate how difficult that must have been. Very impressive.

Devil's advocate though: this seems more like a case of the guy being good enough at the game to win in spite of the voice controls rather than because of them. Compared to a regular controller/keyboard+mouse/whatever there's just no contest in terms of input speed and precision. Not all genres are a good fit for this either. I'd be really interested to see if anyone could make it work with, say, a competitive FPS game.


Never mind that in order to use a voice service, it requires you to speak at a rate slower than many can type, all while demanding that the people in the room hush up so it won't get confused. Repeat if there was a mistake.


Try Hound. It's faster than anything I've tried and it's context management is just impressive as hell. The echos lack of negative clauses is really really frustrating.


I just can't stand talking to a computer. Never liked the idea of it. I loathe voice-controlled telephone menus. I can type faster than I can talk (if you include the inevitable revisions -- even without it's pretty close). I don't even like to leave messages on voicemail. I don't think voice interfaces are anything I will ever use if there's another option.


That holds true with pretty much all first generation products of it's type. The first "smart phones" couldn't do a whole lot of things. Over time, the Echo will improve and you'll be able to hold conversations with it.


My children are quite young. The world is going to be an amazingly interesting place when they are my age.

I can recall the first time I ever saw a computer and how primitive they now look.

Now we have little bots that listen to you and reply with info.

When my two-year-old is forty - we will have ghost in the shell.

It's crazy beautiful and scary to me that we all grew up reading cyberpunk fiction and watching anime and not all of us did, but pretty much all of us are actually building that future.

There is a balance between dystopia and utopia though.

We are all working at the Great Game - and the future is going to be interesting, but we can never turn back. So hopefully we keep the balance and get it right.

My worry is that at this literal nascent stage of technology, that we don't fuck it up as we don't fight hard enough for privacy policy.

We need privacy policy that is thinking at least 50 years in advance.

The control of government apparatus is thinking in advance - I personally feel that the tech sector's vision is myopically focused on today's profits and not in the future where it should be viewing, with the exception of this most recent case between apple and the FBI. At least Cook's comments were salient and forward thinking and truly for the greater good... Let's hope that invigorates the tech industry as a whole to think about where we are headed.


Speech recognition has improved dramatically over the past few years through using cloud back-ends. It's actually usable for many tasks.

However, we seem to still be pretty far from natural language interfaces that make sensible inferences about actions you're requesting and perhaps join multiple data sources to answer your query. There have been a lot of advances--don't get me wrong. But it's a very hard problem that's been being worked on for a very long time.


Just like you hold conversations with Siri, Cortana and Google Now?


Are they not first-gen?


Well, I mean, they aren't fixed artifacts like a piece of hardware. I'm pretty sure they have been updated a few times.


is it better than google voice? Siri is completely useless for me but google voice recognize everything I said (love my new iphone 6s but I wish I could say "hey siri" and it would actually work).


The other issue is it becomes less useful when more than one person is active in the room. Small party? Interface no longer functioning as talking in the background interferes.


And if you do get beyond a narrow range does the user spend a lot of time thinking about how to craft a question so that the machine can understand it?


How complicated is controlling a TV or a radio? And voice is much easier for a variety of tasks than remote controls.


I think the main problem with voice interfaces is that it's not discoverable. You need a good understanding of what the system can and cannot do, its current state etc before even speaking.

CLI has the same issue, but at least you can man-xxx, which I imagine works a lot better in text than it does in audio.


I think the goal is that the system gets to be good enough that nobody worries about discoverability any more.

I think Google is quickly getting there with their search interface. I'm always amazed at what a good job Google does when I ask it a question like "what's the name of the instrument powered by steam" and milliseconds later it's showing me info about calliopes.


I really liked how this was done in the movie 'Her.' There's something especially nice about only having your attention distracted audibly and not visually, especially in public.

I wonder if the smartphone age will go away as quickly as it came. I picture a world where we just have smart wearables like a watch which has a tiny visual interface, but a powerful audio one (speaker, earpiece, put watch up to ear, etc). It seems a lot less intrusive. I imagine as we get better with AI and voice recognition, it'll be as practical as a phone. What I'm able to do with Google Now on my watch is fairly impressive today. We already have the technology to understand things in context like "Navigate to Katz's deli" brings up Google Maps to the deli as opposed to a google search results page about navigating to a cat themed deli, which was the status quo not too long ago with voice search.

I imagine carrying around this big selfie/facebook machine around, constantly charging it, whipping it out all the time, etc will be pretty gauche if wearable-only solutions become competitive.


For many functional tasks, I can see an auditory UI being superior. But currently most people use their smartphone to skim content. I don't want the equivalent of listening to voicemail for everything.

Not to say that content can't shift for the medium, just as it always does. What would an audio Facebook sound like?


Well, I do that now sorta on my watch with its small screen. I scroll through notifications, but no, I don't get the full FB web or mobile experience. I'm not sure how many people actually want that; I often hear complaints about how phones and apps aren't simple anymore. I also believe that we really haven't figured out the best way to use these small screens. I'm surprised at how usable my watch is sometimes with its 320x320 screen at 1.8". For reference the original iphone was 3.5" at 320x480 resolution.

For teens and such I can see the big phone never going away but for most adults, having an inconspicuous wearable just seems like a more refined experience. I imagine there's a logical procession here from desktop > traditional laptop > ultrabook laptop/convertible > tablet > mobile > wearable. You lose functionality with every step, but depending on the use case, it doesn't really matter. For people in my peer group, a wearable that could work without a phone would sell like hotcakes.


> The transition from primarily visual UX towards an auditorial UX is really powerful.

It's also less accessible. I'm sure auditory UI is useful in many cases, but it also seems to be more cumbersome in others. In any case, I hope that pervasive auditory UI doesn't become any sort of standard without an accompanying visual/physical interface.

> Try watching a scary movie with the sound turned off, it turns into a comedy

Allow me to be pedantic and say it is that being fully immersed in the context of the movie that really matters. You could probably achieve a similar suspenseful effect with silence+subtitles, although I'm sure the experience isn't identical. Otherwise, the deaf could never enjoy scary movies, including me.


>It's also less accessible.

For whom? To the blind this would be a godsend. From a practical medical perspective, audio is superior because we have decades of experience with effective ear implants to help the hard of hearing and the deaf, but the visual equivalent still eludes us.


> To the blind this would be a godsend.

Actually, I'd imagine that a good old-fashioned tty is pretty good for a blind person: it's TUIs and GUIs that get progressively more painful.

Source: am blind without my glasses; can imagine preferring ed to emacs, vim, Atom, SublimeText if I had to use an audio interface.


> To the blind this would be a godsend

For sure. Different interfaces disadvantage different classes of people. There is no silver bullet; I'm trying to point out that an exclusively audio/voice-driven UI would not be desirable.

> we have decades of experience with effective ear implants

The problem is multi-faceted. Hearing loss, especially from a young age, often leads to difficulty speaking -- it is no use if a voice-driven system can't understand you in the first place.

And while cochlear implant technology has helped a lot of people, it is by no means a cure, and there are many, many others that don't benefit enough from assistive technology to achieve functional equivalence (which is the key phrase when talking about accessibility). I have a cochlear implant and haven't worn it in years, because it really doesn't help.


> It's also less accessible

Well, I think blind people would disagree with you.

> I hope that pervasive auditory UI doesn't become any sort of standard without an accompanying visual/physical interface.

Any speech interface could be trivially translated to a text interface, right?


> Well, I think blind people would disagree with you.

Answered downthread.

> Any speech interface could be trivially translated to a text interface, right?

Pretty much, which is why UIs should not be exclusively auditory, that is, delivered without an accompanying visual interface (text or otherwise). Ordering the Echo Dot verbally is a cute gimmick given its premise, but it would really suck if otherwise useful products and services were only usable through audio.

Hopefully the audio UI trend does not follow the obsession over touch screens: a rapidly adopted, de facto standard driven by tastemakers that leave little consideration for others that might prefer an actual keyboard or other physical affordances.


> Allow me to be pedantic and say it is that being fully immersed in the context of the movie that really matters.

I hope to not be a super pedantic ass for pointing out that the 'immersive' media in films is the audio, not the visual components.


> the 'immersive' media in films is the audio, not the visual components

That's a non-falsifiable opinion, really (even if it does apply to the majority of the population). I'm living proof you can enjoy movies without the audio.

It's the sum of our experience that colors our perception -- almost irrevocably in this case, since I imagine it would be difficult for the typical person to really be able to enjoy something in complete and utter silence.


> I'm living proof you can enjoy movies without the audio.

I am not looking to equate immersion with enjoyment, and by no means do I intend to disrespect the manner by which you enjoy a type of media. My apologies for coming off that way!

When I refer to 'immersive media' I am referring to the 360-degree omnidirectional dispersion pattern of sounds and our similarly omnidirectional hearing of those sounds. This is 'immersive experience' as opposed to a 2-dimensional or stereoscopic experience, which is what we get with visual media. Television/film screens fire light directly at the eyes; even in iMax situations the film is never experienced behind us. That isn't immersive, whereas say a VR headset can potentially offer this type of immersion. But since this technology is still in its infancy I think it too early to call it fully immersive like audio is.


> 'immersive media' I am referring to the 360-degree omnidirectional dispersion pattern

Then that is splitting hairs over a definition of immersion, and quite unrelated to how the word was used in my original comment. Had I instead said "fully engrossed," my point would still hold, and you would not have one.

I understand you were being "super pedantic," but if you're going to do that, then you should be super precise in the pedantry, otherwise you're arguing a strawman.


>> auditorial

Don't you mean oral or aural?


You would probably be interested in what we've been building over at https://www.narro.co.


It would be nice if it could extract forum discussions, like YC and Reddit. Sometimes I like to hear the text I am reading, it helps with concentration.


Yes, I'd like to see the possibility to select text, right click and select "read out loud".


I think all the browsers on OS X support that using the system text-to-speech (edit: Safari and Chrome, not Firefox)


I'm using Linux. It seems that Linux is falling behind in the area of speech input/output. I hope they will catch up.


Voice will become an important, if not the primary, interface to home/car audio/video.


"computer lights on" "dimmer"

no thank you.. i will use my hand


A device to change the channel on my TV? No thanks; I'll just use the dial on the TV.


let me pick up my phone, open the app for light control, dial in some setting, and hope the app doesn't crash.

TV remotes are awesome because it has physical buttons, and it's fairly dumb... almost no chance of issues.


And if you're on the couch watching a movie and the light switch is on the other side of the room? Or you want to switch on the porch light for guests. Or switch off outside lights?


i get my non-lazy ass up.

For the few times where I may need to walk additionally around the house, it's a non issue


and what if you weren't so mobile?


They should announce Amazon echo for the deaf, which would just be a screen.


... with a couple of kinect type devices to monitor one's signs.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: