Who wants to have a black box in their home (or bedroom!) that listens to every sound, connects to the internet and over which you have no control? Sure I believe Amazon when they say they only send clips of the actual commands, but how long before these things get rooted left and right?
I build my own version of this, which I can customize as I want. I can get the weather (my central home automation server pulls forecasts from wunderground.com, current temp from my own sensors, the voice control unit then pulls data from there), open/close curtains and turn lights on/off, and it says the time. I have a small list of other 'dialogues' (as they're called in my system) I'm going to add when I have some time, but I'm still figuring out what functionality is worthwhile.
My software is based on pocketsphinx on raspi, so it's easy to put one in each room (I have only one now myself). I'm using a mxl ac404 teleconf mic, which works ok; you do have to speak up for it to pick up the commands. I'd love to get an echo to see how much better it works. I have a primitive 'tts' system that plays back prerecorded messages, and falls back to festival for unknown words. I paid a voice actor a bit through fiverr to get the 100 or so words I need. Sounds better than full synthetic tts systems, although I need to work on improving the timing between words.
This is only a few days worth of work, too. It's not hard to make. I'm not claiming my systems works as well as an echo, but I prefer the control mine gives me.
> Sure I believe Amazon when they say they only send clips of the actual commands
I worked on speech parsing software at Audible, where much of the data was originally gathered to make Alexa and the Echo possible.
They are not telling the truth when they say that they only send clips of the actual commands.
EDIT:
This is not to say that they constantly stream audio data, either. But they send much more than just the voice extractions of the commands themselves. They have to in order to build a profile of the users' voices, habits, etc., which aid in the quick processing of incoming speech data.
The service is not selective enough to only pick up on voice information proceeded by a correct utterance of "Alexa." Amazon is "customer-obsessed" - and one of the product execs asked on a phone call:
> I am an Alexa user, and call out her name but stutter slightly. I still intended to say "Alexa," so why shouldn't she respond to me? That's a bad user experience. Customers want to be understood, not ignored.
This is paraphrased, but the consequences of this question were enormous. It basically ensured that Alexa users would not ever have privacy again.
Does the audible/amazon stt engine train its own language model and thus 'learn' the user's speech peculiarities? If so, how does it know what the user really meant and thus should train towards?
I can't speak to the training of the models. I'm not a data science guy, I specialize in signal processing and feature extraction.
What I can say, though, which may give good insight into your question is this: The "Hushpuppy" project, which became known as the WhisperSync and Immersion Reading technologies, were not self-trained or self-reinforced. QA engineers in India were contracted to ensure synchronization between the book and the audiobook. These technologies, though, were the building blocks upon which Alexa was developed. The data collected from these operations became essential knowledge for the engineering team that developed Alexa.
EDIT: As an aside, Amazon bases all of its services around that Amazon account that almost everyone in the civilized world now has. Alexa is not the first, and certainly not the last Amazon service that learns and improves based upon the user's Amazon account.
", I specialize in signal processing and feature extraction."
In this context, do I speculate in the right direction when I think about isolating voice from background noise? Any tips to share that could help my diy tinkering?
Legitimately confused by the downvotes on this one... Is there another way I should express disbelief and request more information? A drive-by claim that Amazon is lying to its customers about what Echo sends to the Amazon mothership is a substantial accusation that should be supported with either evidence for or a method for independently confirming the veracity of said claim.
So, I'll try another pass:
Dear AndrewUnmuted: would you kindly elaborate on what else, other than the audio that follows the wake word, is transmitted to Amazon that ensures "[Echo] users would not ever have privacy again."
I think we can all stipulate to the fact that there is an audio transmission beginning after the wake word and ending when we see the spinning light pattern that indicates Echo is waiting for a server-side response.
While I can't speak to what AndrewUnmuted is referring to with Amazon, I know Google Now transmits more than just the command / audio after the wake word: it actually transmits 2 - 3 seconds before and including the wake word.
You can confirm this yourself: if you visit this page at Google, you can hear every Google Now voice query you've made, and verify it includes audio before & including the "OK Google" trigger phrase:
I believe that I have provided sufficient background from which you can acquire your own evidence. My involvement with Amazon's speech services is something I regret and wish to help others make better sense of, but I also am bound to contractual obligations that prohibit me from going into the explicit detail that you ask for.
If you've ever worked for Amazon, you'd know how dangerous it would be to go into the level of specificity that you are demanding. I refuse to satisfy some loud-mouthed internet user just because of the "seriousness" of my claim.
It would be extremely easy for you to test my claims yourself.
Please refrain from suggesting I'm some "loud-mouthed" internet user. You do not know me.
I, as well, do not know you, so I'm unable to accept a drive-by accusation that having an Echo means I'll "not ever have privacy again", a bold claim indeed. Lots of people shout out things that seem plausible, even flash a bit of confidence-inspiring bravado, and have no intention of actually helping clear the air on a topic like this.
I stand corrected on one thing; I do see that Echo transmits a fraction of a second of audio before the wake word. I also know, from when I was doing my own debugging/hacking a while back, that there are requests made in cleartext to download packaged code during firmware updates, and based on the limited console hacking I did way back then (though I don't know enough about signal processing to have gotten more than that from the test pads on the bottom), Echo appears to have two bootable partitions, and swaps between them upon a successful upgrade - pretty standard stuff, it didn't appear that one partition was hiding anything, just that it had older packages.
My network analysis at the time was inconclusive in some respects due to some traffic being encoded/encrypted in a way I didn't immediately recognize -- frankly, my efforts at that time were focused on what was coming back to the Echo, and I will certainly investigate more of what's going out.
Your commentary thus far does not suggest anything privacy destroying (they know who I am and what I've bought? So?) but I will be happy to share what I find if I find anything. I know of many people smarter than I who've intercepted and analyzed the traffic and didn't find anything that suggests users "would not ever have privacy again", so I don't have high hopes of finding enlightenment without the assistance of someone who professes to have the answers.
True, but for some reason I'm less concerned about this attack vector. I'll agree that that feeling is just that, and not so much based on rational arguments.
Every tech has its downsides. I guess the hackers would rise up to meet the challenge and provide viable alternatives... but as of late, like with most things, Corporations have come to dominate the day-to-day life of normal human-beings.
Who wants to host their content in the cloud on servers they don't control?
Who wants to buy a netbook with nothing but Google Apps on them, collecting data to show even more targeted Ads?
Who wants to install Chrome (a closed-source browser) when Firefox and others provide much more transparency and privacy-controls?
Who wants a phone that tracks your location and reports it to World's most powerful AI company at every instant?
Who wants an on-demand-taxi app that tracks your movement inch-perfect, stores your payment and personal information, and keeps the record of your usage till, probably, end of time instead of... well... hailing a cab right off the corner?
Who wants to trust entities that hold your money in their coffers and in return show "digits" on-screen as a proof of presence of "actual" money?
Who wants to fly in giant metal-tubes, over which one has has no control whatsoever-- from who's flying, to what food that would be served, to what kind of hygiene is maintained, to the safety procedures followed, to the transparency into ways in which your baggage is handled, or your booking information, for that matter?
Who wants to drive a relatively smaller metal-tube on roads with other drunkards and addicts? Why are tax-dollars wasted on initiatives that present such grave dangers-- where someone else's mistake ends up costing someone else entirely?
Sure, there's always a real and present danger. You gain some, you loose some.
--
For me, it was about time someone made ubiquitous computing main-stream. It can only be good for the tech-ecosystem. Google seems in a prime-position to better Amazon's offering, if it hasn't done already through its home-products division, Nest.
I do, because I paid for it, and I have no reason to believe my data is used for any purpose other than Amazon selling me more stuff, and that they're not interested in poisoning the well by surreptitiously recording audio when I'm not actively engaging in a dialog with my Echo.
That said, regarding the RasPi project, I did something similar for home automation before Echo/Hue/SmartThings/etc were around, and I used the Acoustic Magic Voicetracker I array microphone, which still works well for an updated use of that (now it sits on my desk for dictation and VoIP calls). Perhaps that microphone would be useful to cover an entire large room for you? The manufacturer claims 30 feet of usable distance between the array and the speaker, and in practice, I found that to be pretty accurate.
EDIT: it should be pointed out that the microphone (https://www.acousticmagic.com/products/voice-tracker-i-detai...) is more expensive than an actual Amazon Echo. The price hasn't changed in the 12 years that I've owned it, either. But I did test the audio samples my Echo sends with audio recorded via the array microphone, and the standalone microphone was far superior at all distances in terms of quality, would would be important if you're doing speech-to-text processing with PocketSphinx.
The mic I have works across the room, but e.g. our living room is arranged in such a way that sometimes you sit with your back towards where the mic is. In that case, you have to turn around and address the mic, as it were. Also, when you're all the way across the room, you have to speak up - not just talk at conversational level, or mumble under your breath as I tend to do.
How well does the Acousticmagic work in such cases? Can you speak away from it, and will it still pick it up? Furthermore, how well does it work wrt background sound filtering? Does it hear you when the tv is on, or when children are playing or similar cases? Those are the main cases in which my setup requires speaking up.
I've built most of a very similar thing, for similar reasons. I recently moved the software to a CHIP instead of a Raspberry Pi, so I could buy a bunch of them and put one or two in each room. The MXL AC404 is really expensive if you have to get more than one, though! Does anyone know of a cheaper microphone that's still somewhat decent for this purpose? (Doesn't have to be USB, audio in is ok too.)
I think the MXL is the cheapest conference mic around that is USB and gets good reviews. Also in my experience the quality of the mic is more important than the voice recognition software/model.
This is something I would both love to develop something for but could never allow into my house. I know I could do both but it feels kind of wrong to make anything that might encourage other people to have one given the terrifying privacy implications.
What I dislike about Jasper is how it deals with modality: you have to say 'Jasper', wait for it to recognize that and confirm to you it's gone into 'listen' mode, then say the actual command. This delay is what made it not acceptable from a ux perspectice. I prefer to have all my commands prefixed with the keyword (I use 'computer' but from what I read online, 'jarvis' is a more popular choice...)
Jasper also writes audio to disk, then runs command line tools on those files. I haven't tested if this is a significant source of latency.
I use this. Yes, latency is kind of big, but it's tolerable. The big difference for me is that you have to program every command. I've never used the Echo, but my understanding is that it has a ton of pre-built commands that you can use: set a timer, what is the weather, play somesong, etc.
For jasper (pocketsphinx) you have to manually program the action for all of these. So it's a lot more setup. I still like it and use it all the time though.
> The monitoring for keywords is done in the hardware or firmware. Nothing ever gets sent to Amazon until "Alexa" or "Echo" is recognized.
I think that what you mean is that the monitoring is done in the hardware or firmware using closed-source code that can and will be regularly updated remotely and hopefully securely. And that Amazon told us that it would wait until it thought it heard "Alexa" or "Echo" or anything that sounds sort of like it, or whatever they decide to change the software on your particular device to listen for in the future.
Yes, that's definitely a more precise statement of the facts as we know them today. Please allow me to expand on it:
Amazon has told us that this product we paid to have in our homes won't spy on us, and has (to my knowledge) given me or anyone else ZERO indication that they'd suddenly decide: "Privacy? Fuck that! Let's see if someone is saying something salacious in that bedroom in Watertown, NY; that customer seems to be buying a lot of lube." Or, less sarcastically, violate their paying customers' expectation of privacy to suit their own ends, whatever those may be.
Google, however, has "snuck in" code to actively listen to the microphone in their browser, which we don't pay for. I won't use the old "if you're not the buyer, you're the product" routine here, but I will say that I trust the privacy protections of a free browser with portions of black-box, closed-source code a hell of a lot less than I trust the same protections of a paid-for product with portions of black-box, closed-source code.
There's enough people out there hacking the echo and looking at the data getting sent back and forth that anything suspect would be all over HN and reddit within hours of the update that caused it. It has some closed source bits but watching the traffic is pretty trivial. Not saying they're not doing or won't do anything sneaky, but there's a good chance it'll get noticed if it does. Hell amazon already has so much info on me just from what I've willingly given them in account details and activity I'd almost be interested to see what more they think they'd get from eavesdropping and my everyday life. Maybe my ads will start being for things i want instead of things i just bought.
Is there something like echo I can build myself for a local network?
What is some advice you can give for me to build my own "Jarvis" like Zuck is doing but be reasonably sure that the components aren't going to "phone home" somewhere or be rooted?
I've wanted something like this for years, I can easily see why this is awsome. I considered the Echo seriously when it came out. However, even as an Amazon Prime member, I'm hesitant to have amazon in my living room. I don't want to tie that deeply into a single ecosystem. Therefore, I am eagerly waiting for my Mycroft[1]... One for every room! If your are willing to muck about with hardware, the Jasper project does a lot of this already.
My main reason for not getting one is that it's tied to services that are not available in my country (Amazon Prime, Spotify...). It seems like the rest of services ("Skills") are third-party citizens on this device. Although I can appreciate the sleek design of Echo, I'd rather get an opensource variant (Jasper and Mycroft are ones I heard of).
Ideally Echo would be able to control HomeKit devices. iPhone users now have no Nest integration, no Echo. Starting to kill the idea of an open ecosystem.
I bit the privacy bullet and got an Echo for my home theater. It's nice, but there is a mountain of unrecognized value because of Amazon's desire to control everything.
As a for instance, it would be nice if keyword enablement could mute the stereo. It'd be nice to Chromecast audio to it. It'd be nice if I could use it to play things on a fire stick.
If it really wanted to be a device of the future, it would link to other Echoes in the house, allowed for intercom, and localized audio tracking.
Some of these things will either come or never see the light of day because Amazon hates interacting with other companies (see: Android).
I've been trying to find a usb array mic/software combo that works better for far field sound capture than a single mic - to use with either OK Google on old android phones, or Jasper running on Rpi.. Any recommendations?
I would also love having something like this. I found the Playstation Eye camera for PS3 has an array of four mics, but there's no accompanying software, so it's not as useful. I'd buy a hardware/software combo that did far-field sound capture well.
Essentially, I want a better mic so I can run this:
Yeah, same here, PS3 Eye but no drivers/software for array mics or beamforming. I wanted to create something like in that video too. Looks promising with OK Google - a little laggy/glitchy but the idea definitely has legs.
I think in a couple of years this stuff will become ubiquitous
Yeah, I'm pretty sure we'll have state-of-the-art open source deep learning based speech recognition libraries this year that will run on local hardware without significant latency.
Baidu open sourced their warp CTC a few weeks ago, give it another few months before someone will release a trained English network for it
Would you be able to go into some detail about the setup? Are you using Rpi or another board? What speech recognition library/APIs you are using? Are you using the array mic setup on the kinect?
Judging from the questions listed in the article, I would assume that there's a way smaller ROI for an European homebody (fewer services that can be queried/integrated, usually worse language recognition, no need to constantly check the weather, public transport).
I've got one, I find it mostly useless as I can't set my location to London. The speaker is fine - but I have a much better system in the same room with an apple tv (v2) connected with a dac. I guess it's just me, but I also like controlling what I'm listening to and having some agency over it, so just mindlessly listening to playlists or automated radio isn't really my thing.
I think it is cool that it's programmable, but I'm not that impressed thus far.
(I should add, the reason I have one is that it was a gift from AMZN after attending an event of theirs last year, I didn't buy it)
I've had the Echo a bit longer (6 months? We were in the first batch that went out to Prime customers) and we love it as well. I also want to get one for the whole house instead of just currently the basement. In addition to music and trivia and timers, we also use our to control our lighting (via the Hue stuff from Phillips) and we love that as well. Overall it's amazing.
New AppleTV 4th gen has an IR blaster that'll turn your TV on and off even if your TV doesn't support CEC. It seems to figure out the right IR codes automatically.
I note that Siri on my iPhone does much of this, and is also always listening. Siri switches my lights on, tells me what time it is in London, makes calls, and sets timers for me, and with the newer phones, doesn't require the phone to be plugged in.
I build my own version of this, which I can customize as I want. I can get the weather (my central home automation server pulls forecasts from wunderground.com, current temp from my own sensors, the voice control unit then pulls data from there), open/close curtains and turn lights on/off, and it says the time. I have a small list of other 'dialogues' (as they're called in my system) I'm going to add when I have some time, but I'm still figuring out what functionality is worthwhile.
My software is based on pocketsphinx on raspi, so it's easy to put one in each room (I have only one now myself). I'm using a mxl ac404 teleconf mic, which works ok; you do have to speak up for it to pick up the commands. I'd love to get an echo to see how much better it works. I have a primitive 'tts' system that plays back prerecorded messages, and falls back to festival for unknown words. I paid a voice actor a bit through fiverr to get the 100 or so words I need. Sounds better than full synthetic tts systems, although I need to work on improving the timing between words.
This is only a few days worth of work, too. It's not hard to make. I'm not claiming my systems works as well as an echo, but I prefer the control mine gives me.