Google admits listening to some smart speaker recordings

owlninja · on July 11, 2019

Similar discussion: https://news.ycombinator.com/item?id=20402070

carlosdp · on July 11, 2019

I like to imagine sometimes how these kinds of "revelations" happen in tech newsrooms.

Reporter: Tell it to me straight, do you listen to the recordings? Google: Well yea, that's how we train the... Reporter: WE GOT 'EM!

It's like the "Apple admits throttling CPU when battery starts dying" story all over again. It wasn't a secret, you just didn't ask before.

danShumway · on July 11, 2019

In some ways, this is missing the point. To people working in machine learning, this isn't a revelation. To ordinary people, it is.

A common refrain that comes up in discussions about privacy is that ordinary consumers don't care about stuff like Google Home. They don't care about privacy, only weird tech people care about privacy.

However, the fact that articles like this get traction shows that a substantial portion of ordinary people don't understand what privacy they're giving up when they use Google Home. They didn't understand when they were installing the devices that a human was going to be able to listen to their recordings. And when they do understand that a human might be listening, that creeps them out.

This implies two things:

a) if properly informed and educated, normal people probably would care about privacy more. Part of the reason why it's mostly tech-people complaining about Google Home and Alexa is because it's mostly tech-people who understand what these devices do.

b) consumers aren't being properly informed about the privacy implications of devices like Google Home and Alexa, or else they wouldn't be surprised by any of this. If this news story is getting traction, it means that Google did not do a good enough job informing users about who had access to their data.

yongjik · on July 11, 2019

As an ex-Googler, I'd like to say that "ordinary people" still won't understand what privacy they're giving up. Before the news, they underestimated it; not they overestimate it, but still without clear understanding of what's happening.

A group of researchers listening to a random sample of audio clip with no way to identify actual speakers is very different from someone being able to look up your name and address and pull down your conversation for leisure. (To be fair, the latter is technically not impossible - it's just that such an act will likely trigger half a dozen alarms, and the perpetrator will be fired quickly. Unless it's the government secretly asking for your information - but then, if the government is specifically looking for you, all bets are off anyway.)

It's basically the same as Google search. If you type anything into Google's search box, your search will be recorded and preserved forever so that Google's engineers can analyze usage patterns. How else would they improve their search algorithm?

Edit: I probably shouldn't have used "forever" - I don't know exactly how long your search results will be preserved. If it helps, consider it replaced by "long enough that someone can write a TechCrunch article that enrages people".

reaperducer · on July 11, 2019

How else would they improve their search algorithm?

There are lots of ways. Ways that other industries test and refine without violating privacy.

One example: Having a group of people who sign up to be part of your testing. That way there's informed consent.

Zhyl · on July 11, 2019

>If you type anything into Google's search box, your search will be recorded and preserved forever so that Google's engineers can analyze usage patterns. How else would they improve their search algorithm?

Or only retain the data for a set period of time. You can refine indexes with 6-12 months of data. You'd only want to keep unaggregated data for more than that if you were building profiles of individual people.

ablatner · on July 12, 2019

You can delete your data from your Google account though.

UncleMeat · on July 11, 2019

That's what happens now. You you check the "use my data to improve xyz services" box.

GrapeFriedNiggr · on July 11, 2019

Well yeah, but it would be way more expensive and you would have orders of magnitude less data. There are also issues with biased sampling.

Point being, the search wouldn't be as good.

dang · on July 12, 2019

Trollish usernames aren't allowed on HN because they end up trolling each thread you post to.

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

I've banned this account for now, but if you want to email hn@ycombinator.com a new username, we can rename the account and unban it for you.

ryandrake · on July 11, 2019

This assumes a fixed definition of “good.” I would argue that the search would be better because it was done using data collected only from parties who have given consent.

You can define “good” in terms of cost, revenue, sampling bias, relevance, etc. I can define it in terms of ethical data collection and usage. Who’s to say which definition is correct?

nocturnial · on July 11, 2019

>A group of researchers listening to a random sample of audio clip with no way to identify actual speakers

Actually the journalists who obtained those audio clips identified several people. Those who were willing to talk on camera confirmed it was indeed their voices.

mtgx · on July 11, 2019

> not they overestimate it

I disagree with this because you're assuming that Google "just training its systems with this data" doesn't mean that it will do "anything nefarious" with it.

Oh, yes, it totally will. That's the entire point. I don't see Google as much more ethical than say Facebook these days. They'll screw their users for an extra buck in the next quarterly results just as much.

Why do you think they're now helping China spy on their citizens? They didn't have to do it. But the call of (more) money is irresistible.

everbane · on July 12, 2019

>if the government is specifically looking for you, all bets are off anyway.

That's not really how it's done though. The government would pipe all the data into something like Palantir, not request individual data piecemeal.

danso · on July 11, 2019

How would you explain to an ordinary user why it's necessary for "your search will be recorded and preserved forever" in order to improve the search algorithm? As opposed to a limited timeframe?

godelski · on July 11, 2019

I'd also like to add that most people don't understand what can be done with all this data. If you start explaining even simple examples so people they think you're a conspiracy theorist, unless they're already a conspiracy theorist then they "already know". But a simple example like the story of Target mailing coupons to a house with a pregnant teenage daughter. Many people still think that's impossible or the father hard to be really dumb not to notice the girl was pregnant. As techies we look at that sorry and say "well duh, that's not even that difficult" while non techies have a hard time comprehending how that's possible. This is a huge disconnect. So I'd argue that even though some people understand that they're giving up their data that they don't understand how it can be used.

My favorite example is just waiting for someone to mention how Facebook must be listening to them because they had a conversation in private with their friend about something, say an extremely new found interest that "they've never talked about before ever". I'll explain how the process works and how we can connect certain things together, how there is proximity, and knowing social groups and structures. That while the microphone would be useful, it isn't necessary for a good guess (and that it is a guess). I think most people here would say "well duhh" but try it with your relatives, see how crazy they think you are and that it has to be a microphone.

There's a big disconnect, this is a problem.

Ididntdothis · on July 11, 2019

"In some ways, this is missing the point. To people working in machine learning, this isn't a revelation. To ordinary people, it is."

The non-tech people I know assume that this data is used by highly qualified people in tightly secured areas and not by underpaid contractors somewhere in the world.

ben_jones · on July 11, 2019

Or to put it a little simpler: Way fewer units would've been sold if consumers knew third world contractors would be listening to their private home conversations. If the way you make your meat is messy, you have an obligation to share that at the dinner table if you are a good actor.

safog · on July 11, 2019

Why does it matter if it's Americans listening to it vs third world contractors?

Also Google and Amazon are a large American companies that mostly try to do the right thing in terms of user privacy. I'd be more concerned about the random IoT microwaves and fridges made by a small company in China or Japan with their own assistant features built in sending audio to god knows where. Atleast I can reasonably trust the wakeword implementations for Alexa and Home, you can't even trust than with other h/w.

It again comes back to, yes, researchers listen to a carefully chosen random selection of audio clips and they're not really targeting a single user. It's impossible to explain that nuance in marketing material.

malvosenior · on July 11, 2019

> Also Google and Amazon are a large American companies that mostly try to do the right thing in terms of user privacy.

Google has an atrocious record when it comes to user privacy. They're the people who put a hidden microphone in Nest thermostats for instance.

Someone1234 · on July 11, 2019

> To people working in machine learning, this isn't a revelation. To ordinary people, it is.

The product's tagline/marketing lead is literally: "Get answers from Google." Why would ordinary people who purchased a product to "Get answers from Google" be surprised when it sends the questions to Google to get answers from Google..?

You don't need to be an ML expert to read the product's marketing and use basic reasoning to conduce that in order to get answers from X then X will need to know the questions. That has nothing to do with ML/AI/speech recognition. It is just basic common sense.

> However, the fact that articles like this get traction shows that a substantial portion of ordinary people don't understand what privacy they're giving up when they use Google Home.

They gain traction in niche tech circles, where people pretend that the general public doesn't grasp that Google/Amazon/Apple is on the other end and claim they're defending other's privacy.

The traction isn't proof of anything, except a handful of people really dislike these conveniences (and use other's purported ignorance as justification for their whole argument).

danso · on July 11, 2019

You might need to talk to more ordinary people. Most people are very surprised at what data of theirs is stored, even for blindingly obvious features -- e.g. the list of everyone you've ever blocked on Facebook, which, obviously, is a list FB must store at some point in order to enforce it. Find an ordinary person and ask them to download their data from FB and Google and see if they aren't surprised.

Someone1234 · on July 11, 2019

Being surprised about things you've forgotten you did on a service, and reading a product's own marketing (and core functionality) aren't really comparable.

I just asked a non-tech person behind me if they thought that when you asked Google Assistant a question it was sent to Google to answer, and they shrugged and said yes like it was a stupid question.

What ordinary person doesn't understand that when you asked Google a question that Google knows the question you asked? It doesn't make sense.

danso · on July 11, 2019

I think ordinary people have a vague sense of how computer memory and computation works. Because online queries happen near-instantaneously (relatively speaking), people don’t assume there’s a need for this data to ever be stored, or be accessed by anything else besides the “Google” program at the point in time of the query.

Someone1234 · on July 11, 2019

Your point was that "ordinary people" don't grasp the concept that it is Google, or Amazon, or Apple that is providing the answers when they use a Voice Assistant. In spite of their entire marketing saying exactly that.

They don't need to understand "computer memory or how computation works" to grasp that the person/entity you ask a question to, has to know that question in order to provide an answer. In fact you could have never used a computer and grasp that concept.

You're trying to make this more complex to mask the fact it is a logically flawed premise.

danShumway · on July 11, 2019

"Being sent to Google" is absolutely the wrong way to phrase this question. Of course people understand that the device with the word Google written on it sends information to Google. That has nothing to do with the current story.

Better questions:

"Do you understand that when you ask Google Assistant a question, a human might listen to it and not just an AI?"

"Do you understand that the human might not be a direct employee sitting in a Google office -- that they might work for 3rd-pary company that Google just contracts out to?"

windexh8er · on July 11, 2019

There are three types of lies: lies of commission, lies of omission and lies of influence.

Just because nobody "asked" Google (your example) doesn't mean Google didn't lie (and/ or mislead) about it (lie of omission). Most people assume machines are doing the voice processing. The game changes when there's a human with a, potentially, subjective thought process to what they're hearing.

throwaway2048 · on July 11, 2019

The biggest self deception the industry has about computer algos and especially machine learning is that it is somehow objective.

cblades · on July 11, 2019

No one that gives it more than a passing thought could fail to realize that there has to be some human interaction. Someone has to do manual transcription to check against output for some portion of the data.

rayiner · on July 11, 2019

Why would ordinary people assume that this testing is done on live customer data? I understand that this is a common practice in the industry, it's not the only way to do the testing, and I'd bet it would surprise most people.

RHSeeger · on July 11, 2019

I would think most people would think that they write software that can understand human speech, the same way they write software that can make a web page.

For the people that do realize it needs to be trained, most of them probably think that it's trained using internal data. Data created at Google or such.

Only people that really understand ML are likely to realize they need the scale of data only their customers can provide to do a really good job (_if_ they even do need that).

NikolaNovak · on July 11, 2019

That is fundamentally incorrect.

My 74 years old father in law uses Google Assistant and Google Home way more than I do; they are a brilliant way for him to interact with technology.

He has absolutely no inkling that any of that leaves the little gray device in his kitchen, let alone any "Cloud" thing in the background, let alone that anybody is listening to it.

He would be genuinely, completely surprised, and is likely to be very uncomfortable, to be presented with any of these facts.

I will venture that he's more representative of an average user than you or I; as techies, we must consciously remind ourselves that most people are not techies, otherwise our view of society will be skewed in the extreme. :-/

windexh8er · on July 11, 2019

I don't buy that. If Google trains models with humans then that should be transparent. Data collection sorely needs policies to cover transparency of processing and disclose all of the forms in a very straightforward manner.

this_was_posted · on July 11, 2019

But is it also natural to assume that this is done without explicit consent? With telemetry data for applications you can usually disable this kind of inspection of your data.

davemp · on July 11, 2019

> No one* that gives it more than a passing thought

*who has an understanding of ML

rvnx · on July 11, 2019

the adjacent question is, are humans watching Google Nest cameras to train ML models ? the answer is not so clear

rando444 · on July 11, 2019

But we know how this revelation happened.

One of their "language experts" (subcontractors) shared the recordings with the news media.

He did this because the recordings he was concerned about were of people who didn't know they were being recorded or talking to the speaker.

This article and your comment manage to spin this entire story into Google talking points.

sp332 · on July 11, 2019

https://news.ycombinator.com/item?id=20402070 "People who install Google Home or Assistant are not advised that people can listen to the voice commands."

JLK_121416 · on July 11, 2019

It's probably a Belgian television news item that initiated this, was all over the news today. +1000 clips were leaked to the reporter, who found some addresses in the clips, and went to these locations to confront the people living there. They also had some Google employees anonymously speaking about the things they hear: identifiable information, private conversations or actvities... Nothing shocking really, if you are a bit into IT stuff, but apparently most people aren't.

https://www.vrt.be/vrtnws/nl/2019/07/10/google-luistert-mee/

inflatableDodo · on July 11, 2019

I'd love to see this tactic played out in divorce proceedings. - "I wasn't having a secret affair. You just never asked if I was sleeping around."

Medicalidiot · on July 11, 2019

Finding out how the sausage is made is uncomfortable.

vezycash · on July 11, 2019

If Google and it's outsourced partner can listen to the recordings, 3 lettered agencies can too. Even if they currently can't, they'll soon make a law to grant them access to Google's and Amazon's.

After this, a hijacking, explosion, or terrorist attack would occur. They would argue that it could have been prevented if place X had a listening device installed. And that's all it would take to push a law mandating the installation of such devices in open spaces, eventually private ones.

UPDATE

On second thought, there's a faster way to the goal. Just make smartphones listen all the time.

ehsankia · on July 11, 2019

> If Google and it's outsourced partner can listen to the recordings, 3 lettered agencies can too.

Just to be clear, if this were true, it would only apply to specific queries you make, not all audio data in your house. There is no evidence of any of these devices recording/sending voice data outside of when a query is going on.

okmokmz · on July 11, 2019

>There is no evidence of any of these devices recording/sending voice data outside of when a query is going on.

This isn't an example of a google device doing so, but there have been multiple cases of these "smart" home devices recording and/or sending audio when not prompted

https://techcrunch.com/2018/05/24/family-claims-their-echo-s...

cblades · on July 11, 2019

It was prompted, it's just that the match for the activation phrase was incorrectly identified.

That's not the same thing as the near-universally believed internet truism that these devices are always listening and sending recordings home.

okmokmz · on July 11, 2019

>It was prompted, it's just that the match for the activation phrase was incorrectly identified.

Yes, so it recorded and sent audio when the user did not want it to. That's not any better. It'd be like if your gun went off while holstered, and you said that it was prompted to fire because the hammer hit the bullet despite the trigger not being pulled. It's very clearly not the intended functionality, so to trying to argue that that conversation recording/sending was prompted when there was no intent from the user to activate the device is a major stretch

blueboo · on July 11, 2019

This analogy fails with the holster — as it is an explicitly deactivated state more aptly comparable to the voice assistants being muted/off.

Maybe you would be less surprised if the gun went off because you were waggling your finger inside the trigger guard rather than methodically squeezing it. The solution might be a higher trigger pressure threshold — akin to a more accurate voice trigger.

RHSeeger · on July 11, 2019

Mr Google Developer,

See here where you have it react and start phoning home when they say "Hey Google"? Right there, we're going to have you add some more terms; things like "jihad", "bomb", and a few others. We'll send you over the zip file of the full list later; it's only a few meg.

Yours Truly, TLA

Edit: I'm curious why the downvotes. It seems possible to me that the TLAs could have Google add more triggers for recording. If they did, we'd have no way of knowing. Sure, it's not likely, but it certainly seems possible.

UncleMeat · on July 11, 2019

Downvotes because the NSA isn't magic. We know how the legal boundaries of their power work from the Snowden leaks. It wouldn't be possible for them to force Google or whoever to do this.

RHSeeger · on July 11, 2019

The problem is that the Snowden leaks taught us that they completely ignore their legal boundaries on a regular basis.

UncleMeat · on July 11, 2019

That's not true at all. The key element of the Snowden leaks was that the legal boundaries were not where we expected them to be. The telephony metadata collection program, for example, was fully legal but has way wider impact that people considered to be right.

RHSeeger · on July 12, 2019

There's some muddy water between - What the written laws allow (or don't disallow) and - What the constitution allows

I would say the various TLAs have been working outside "the law" insofar as what the constitution allows.

ehsankia · on July 11, 2019

Worth noting that you still get a visual cue, and you can also enable an audible cue. You can also see exactly what it recorded and delete it in the dashboard for both companies. That's exactly how most of these articles actually find out that an accidental trigger happens, by seeing it on the dashboard.

These are also fairly rare bugs, so it's still very far from "device listening to everything you do", which implies malice. If it was trying to spy on you, it wouldn't light up and make a beep every time it was listening... That sounds like very bad spying to me.

twmahna · on July 11, 2019

>> After this, a hijacking, explosion, or terrorist attack would occur. They would argue that it could have been prevented if place X had a listening device installed. And that's all it would take to push a law mandating the installation of such devices in open spaces, eventually private ones.

I agree that this hypothetical is an accurate depiction of the future, but I think eventually we'll see this as the better of two evils. As the pace of technology accelerates, the average crazy man is going to be capable of killing more and more innocent people. Eventually, creating a safe society is going to involve closely monitoring every citizen and weeding out bad actors. That, or, eventually, some crazy man is going to build a nuclear bomb in their garage and obliterate the world.

okmokmz · on July 11, 2019

The irony of an account named smallgovt making this comment.... Personally, I'd rather risk get nuked by a crazy guy in his garage, as you say, than live in more of a police/surveillance state than we already do now

jefftk · on July 11, 2019

Depends on how high the risk is, no? If surveillance moved the risk from 90% to 10% I think most people would consider it worth it.

okmokmz · on July 12, 2019

malvosenior · on July 11, 2019

> On second thought, there's a faster way to the goal. Just make smartphones listen all the time.

The good thing about trying this is people would definitely notice the increased bandwidth usage of their phone streaming audio in real time, all the time.

cblades · on July 11, 2019

A large number of people already believe that devices and apps are constantly listening and sending recordings home, despite there being no evidence for that. It's a rampant belief supported by "I was talking to Jimmy about these shoes, and suddenly I got ads for those shoes!"

smattiso · on July 11, 2019

Not necessarily, speech to text engines are pretty decent so it's conceivable a listening device could just send the transcript.

crooked-v · on July 11, 2019

If the point is to train speech to text analysis, just sending the transcript doesn't help.

bluGill · on July 11, 2019

People would notice the battery drain though.

flattone · on July 11, 2019

People. Like 15% of hn users and the like

swiley · on July 11, 2019

Between the larger phone bill and the decreased performance and battery life I feel like most people would notice.

Not to mention most phones heat up when they stream anything.

hombre_fatal · on July 11, 2019

It's a shame this doesn't happen when people's IoT devices become botnet nodes on their "unlimited" wifi. It would totally change the DDoS market and IoT security for the better.

"0/5 stars. This product doubled my bandwidth bill."

swiley · on July 12, 2019

I will never ever ever go back to paying by the byte though.

No one writes sane enough software (or even freaking documents) to make that doable.

hombre_fatal · on July 12, 2019

I think it's necessary in the long run. It's the only way anything will ever change. And bandwidth is a finite, shared resource. You also pay per electron that flows through your blender and drop of water you drink from your faucet.

Right now the internet is basically built on a bunch of "gentlemen agreements" to play fair, all the way from BGP down to ISPs ("please filter egress, pretty please!") to end-users ("please don't actually use all of your bandwidth because we're overprovisioned!").

Of course, the fee has to be tiny. For example, you don't reconsider hydrating yourself to save some money. But you may reconsider leaving your heater on 24/7 when you can just wear a jacket indoors. It would have to be priced similar to that. Though other things will have to change as well, like we'd need the ability to shop between ISPs in a region.

Most people are already paying per byte, but in the most roundabout marketing-mislead way and I see that as a problem that only really helps ISPs. Imagine if it was a fully commoditized utility instead.

On the other hand, look at the effects of the status quo. People have so little insight into their usage that you can buy enough residential botnet egress to take down any site for $5 which empowers centralization like Cloudflare DDoS protection. I suppose you either don't see this as a problem or you can think of other solutions.

malvosenior · on July 11, 2019

I think everyone would notice the much higher data usage bill.

jascii · on July 11, 2019

Since said 3 letter agencies often work together with the telcos, I assume it wouldn't be too hard to keep that bandwidth outside of the users bookkeeping.

jstarfish · on July 11, 2019

More and more carriers are moving to "unlimited"/throttled plans. The days of worrying about data usage will soon join the days of worrying about how many minutes you have left.

imglorp · on July 11, 2019

There have already been individual subpoenas of this device data. Only question now is, constant feed for fishing or targeted for individuals?

https://techcrunch.com/2017/02/23/alexa-free-speech

https://nest.com/legal/transparency-report

Personally I'm okay with court-ordered disclosures scoped to single individuals. There's oversight. But I'm not okay with unlimited constant data feeds for dubious "precrime" sweeps.

slg · on July 11, 2019

>On second thought, there's a faster way to the goal. Just make smartphones listen all the time.

Which is why all the complaints about these smart speakers have always seemed a little silly to me. Tapping into the audio of the phone we all carry with us everywhere is infinitely more valuable than tapping into a stationary speaker that is probably sitting in a room alone 90% of the time. It is also much easier to hide all the bandwidth associated with capturing all that extra audio since it has access to a network that the user doesn't completely control. Plus it has access to sensors like a camera and GPS that can reveal info that is even more private than audio.

TL;DR - If you don't trust Google's smart speakers, why are your trusting their phone software?

Slartie · on July 11, 2019

> It is also much easier to hide all the bandwidth associated with capturing all that extra audio since it has access to a network that the user doesn't completely control.

However, it is pretty hard to hide the large drain on the battery if your phone constantly streams audio to someone via cell network. And I'd say, since most people are on a volume limited data plan, streaming gigabytes of data will not go unnoticed - people will wonder why their budget is being eaten up so ridiculously fast.

In contrast, smart speakers usually are connected to the power grid and WiFi, which effectively means unlimited bandwidth and power for eavesdropping purposes. The only way in which this might be noticed is if someone sniffs network traffic and wonders why this speaker sends so much encrypted data. But I'd argue that the probability for this case is much lower than the probability of raised eyebrows because of the aforementioned cell data plan being drained constantly.

slg · on July 11, 2019

The whole premise of the original line of thinking is that corporations are lying to you and/or they are compromised by the government or some other malicious entity. All bets are off once that is accepted as a possibility. Who's to say that the bandwidth used by the compromised phone will show up on your account? Furthermore why should we trust the battery percentage displayed on the phone? If the phone is lying to you about the microphone being on, maybe it is lying to you about the current battery percentage or what specifically is draining your battery.

The phone also doesn't need to constantly be listening and streaming audio 24/7. Phones generally have a lot more processing power available to them than smart speakers. Maybe some of the audio processing is done on the device. Maybe the phone is only recording and waits to process and upload the data once it is charging and the battery drain is easier to hide. There are countless ways to try to mask this type of activity.

hyperman1 · on July 11, 2019

Let's see. according to https://www.lifewire.com/megabytes-for-one-minute-conversati... , VoIP takes around 240 kb/minute, which is about 10GB/month.

Lowering quality and simply not recording when no one talks, should bring that seriously down, say 1GB/month.

Which is a lot today,but prices are falling rapidly.

If it's mandatory by law, implementing it in hardware would make sense, which will also lower the battery drain.

I don't like what I'm saying here, but recording and sending everything by law seems technically viable on the near future to me.

maemilius · on July 11, 2019

People have to charge their phones eventually. The easy way to do this is to store the recordings locally and only transmit when the phone is already charging.

If we're already considering government intervention, why wouldn't they just require that internet providers of all kinds _not_ meter data sent to them?

Unless our hypothetical government agency _really_ needs real-time data, there are plenty of ways they can mask their activity in a way that users won't ever notice.

newsbinator · on July 11, 2019

If the mobile network is in on it, one's only flag is the battery drain.

Apple could add a "microphone actively/recently in use" icon to the top of the phone.

jascii · on July 11, 2019

There is still such a thing as physics: having a continuous audio stream would deplete the targets phone battery in short order potentially alerting the target that "something's fishy". A plugged in device would not have such problems.

dev_dull · on July 11, 2019

Is this honestly surprising to any developer? How can you develop a product like Siri/OK Google without being able look closely at edge cases? And if you have to record conversations to troubleshoot even extremely rare edge cases, you still end up with a system that allows this eavesdropping.

stefan_ · on July 11, 2019

You recruit people for testing and obtain their informed consent?

What do people here think how say medicine or research is developed?!

thfuran · on July 11, 2019

I think medical software works on pretty much exactly model the parent comment describes, where software developers commonly have access to (anonymized) patient data and less commonly and less readily to unanonymized patient data.

davesmith1983 · on July 11, 2019

I worked for a year with Patient Medical Data in the NHS in the UK. Typically it is as simple as script to replace patient names after downloading a live backup. You then delete the backup of the DB that has the real names.

That is pretty much it.

UncleMeat · on July 11, 2019

And now you don't have broad coverage of accents.

dzader · on July 11, 2019

thats because medical research has real effects on people, someone listening to you ask google to the turn the lights on doesn't matter. at all. (save the omg well they might hear something important - no they wont, and no the person on the other end doesn't give a shit even if they do)

burkaman · on July 11, 2019

https://www.wired.com/2017/02/murder-case-tests-alexas-devot...

mherdeg · on July 11, 2019

Well, you could get your test data from paid participants who consent to constant whole-home recording. Like the "Nielsen Families" of old.

This would be worse data than just using all available inputs to train your model, but it's an alternative. There are alternatives to storing all input data for later training; they likely produce worse models and cost more, but they are available.

RHSeeger · on July 11, 2019

The same way people debug things like web pages and applications that don't call home. You get reports of issues, you reproduce it locally, and you fix it.

watwut · on July 11, 2019

I work on software without having access to customers data. Where I can access them, it is with their consent and NDA signed.

Wtf. Of course software can be made without listening to private data of customers.

lota-putty · on July 11, 2019

Tangent: I used to work for a Telecom S/W company a decade ago, SMS/MMS/USSD products. We were privy to SMS/MMS exchanged between customers.

PKI is a must these days. Albeit, anything that connects to internet now can be monitored in some way or the other, no?

Rapzid · on July 11, 2019

Doesn't seem much different than the disclaimer given when calling into a company that the conversation is being recorded for "QA and training" purposes(more like performance and CYA haha).

I wonder how sensationalized people will get with the headlines for click-bait purposes. While "recordings" is technically correct, "interactions" may be a slightly more precise word to use in the headline. I'm imagining a ton of headlines designed to make people believe Google Home is making random recordings of home conversations and letting people listen to them.

jdsully · on July 11, 2019

I don’t recall any google home device informing me that conversations may be recorded. Informing the user is the difference - not the actual recording.

davesmith1983 · on July 11, 2019

It is probably buried in their T&Cs.

jdsully · on July 11, 2019

Since I only see them at friend's houses - I'm not a party to any of those contracts.

davesmith1983 · on July 12, 2019

Yes however it is their property and their device so there is implicit consent.

jdsully · on July 12, 2019

I could consent to it using AI to answer. But how could I consent to it recording me when I was never informed?

colpabar · on July 11, 2019

I'd say the bigger difference is that a phone recording is initiated by making a phone call. Aren't Google Home devices listening 24/7/365 so they can know when to respond?

jdsully · on July 12, 2019

They are only listening locally - no data is sent until the control word is said. However I don't really mind sending the data to Google either, the trouble is with the undisclosed recording.

bork1 · on July 11, 2019

I'm struggling to figure out how they have a Security and Response team to deal with the fallout of these issues without having enough privacy/security/customer-focused developers/product folks to proactively bring up these concerns. Google _seems_ like the type of company to do at least a little bit of risk modeling before the release of software. If they knew they were going to listen to recordings, how did this concern not get brought up? If it was brought up, did folks just decided it wasn't important enough to protect against?

jhayward · on July 11, 2019

They have the security and response team activated because someone disclosed that they do this, not to investigate the fact that they do it. They're there to plug the leak.

azinman2 · on July 11, 2019

Except the privacy policy always warned about this. Everyone doing speech recognition is doing this — you have to in order to get any kind of QA.

The caveat is that it should be both anonymizes as well as only in respond to the wake up command. It seems to be both, so I don’t see the problem.

Actually I do — the editorializing of these headlines makes it seem nefarious when it’s not.

vokep · on July 11, 2019

If the privacy policy was written in a language I could read (just because its english doesn't mean its readable english) then maybe I would have known that

gibba999 · on July 11, 2019

It is pretty nefarious. In traditional research and product development protocols, you would have people opt into something like this, and optionally pay them for it.

If Google gave out a hundred thousand Google Home units for free to test subjects, with informed consent, there would be no big deal. It would cost Google $2.5 million, and it'd probably be enough data.

If my web site policy discloses "I may randomly send a thug to your house to shoot your children," and you come, visit, click through the license which warned you, and then I shoot your family, that doesn't mean I'm not doing something super-evil.

Google seems to be doing something super-evil here. Their response -- plugging the leak -- seems equally evil. People have a right to know what's being done with their data, and at least under European law, Google has a legal and ethical obligation to disclose things like this in language people can understand.

GDPR is rather well-written here. It looks like Google is breaking it, and currently trying to shoot the whistle-blower.

Thank you whistle-blower!

dataflow · on July 11, 2019

> If my web site policy discloses "I may randomly send a thug to your house to shoot your children," and you come, visit, click through the license which warned you, and then I shoot your family, that doesn't mean I'm not doing something super-evil.

You kinda had me until you lost me here. Analogies need to make sense. If you have to go this far with your analogy then that says more about your own argument than the other side's.

vharuck · on July 11, 2019

>If you have to go this far with your analogy then that says more about your own argument than the other side's.

I never got this argument. In mathematical proofs, reducto ad absurdum is an acceptable method of showing an assumption false. It shows that a statement ("Users agreed to TOS, so it's not malign") has an exception. The example is extreme to make sure nobody can argue the statement's still valid.

He's not saying the punishment should be on par with murder. He's just saying there is a line of moral acceptability, but where it lies is up for debate.

saagarjha · on July 11, 2019

You're missing the point, which is that you can slip anything into a privacy policy or other long agreement, no matter how outrageous it may be, and nobody will read it. Putting anything there does not make it ethical or legally binding.

UncleMeat · on July 11, 2019

It also doesn't make it unethical. Putting privacy related issues in a privacy policy makes sense to me.

gibba999 · on July 11, 2019

A privacy policy is definitely the right place for privacy issues. My point is exactly as vharuc made above: Putting something there neither makes it ethical nor unethical. A contract or license is not an excuse for bad behavior.

* If my privacy policy is a copy of HIPAA, that's an ethical privacy policy.

* If my privacy policy is as Google's here, it seems unethical without clear informed consent (which a disclaimer in a novel-long privacy policy doesn't provide).

* If your privacy policy says you'll collect incriminating information about me, and sell it to the highest bidder for use in blackmail, it's unethical even with attempts at informed consent.

saagarjha · on July 11, 2019

Putting it in a privacy policy that you expect nobody to read, and using language that they are not accustomed to, is unethical.

gibba999 · on July 11, 2019

You're confusing an analogy with a counterexample.

Analogies need to be analogous. Counterexamples can be extreme (and it is often helpful if they are; then they're obvious counterexamples).

Please take a minute to reread the discussion.

Coincidentally, I've noticed a pretty consistent pattern of downvotes on anything criticizing Google on Hacker News. Either a lot of readers from Google who drank the cool aid, or astroturf -- I'm not quite sure which.

azinman2 · on July 12, 2019

What’s specifically super evil about humans transcribing random and anonymous commands to the google assistant? They’re hired and expected to be professional with their own contractual agreements around their own behavior and ethical standing.

Literally all the major companies in speech rec (aka assistants) do exactly this. The accuracy of the speech models would be extremely poor otherwise.

maccam94 · on July 11, 2019

Come on, I'm sure Google's privacy policy allows them to listen to audio with no metadata in order to improve their service. The team is responding to the public leak of the audio, which is a violation of Google's privacy policy.

jhayward · on July 12, 2019

How does that contradict what I wrote above?

maccam94 · on July 15, 2019

> the security and response team activated because someone disclosed that they do this

They're not chasing down a whistle blower for notifying the public that human transcription takes place. That information was already in the public domain in Google's privacy policy. The team is investigating the source of the leaked audio files, which was a violation of user privacy.

mda · on July 11, 2019

What do you mean? All tech companies train their algorithms by annotating real world samples. It was not a secret.

mcbuilder · on July 11, 2019

I think that the outrage doesn't come from supervised learning, rather that it's contracted out to a third party and seems to be done in a irresponsible way. I think that the fact that most of the public that uses these devices would be surprised by the fact that their voice is being recorded and transcribed is the irresponsible part. Of course a ML engineer is going to want to go the route of human labeling the audio data, and these folks seem to have won. You can blame the public for being uniformed, but it's new technology, and most aren't going to have read up on the methods used, even if it's no secret. Many have suggested other opt-in methods, which would also provide (arguably less real world) data. I think that many would prefer to trade a less accurate service for not having strangers listen to their conversations.

Mystrl · on July 11, 2019

Honestly what risk is there? Outside of some internet echo chambers no one really cares and all this will be forgotten by tomorrow if it even takes that long.

danbruc · on July 11, 2019

Only 0.2 %? 1 out of every 500? That seems like a lot to me, especially given that there must be millions if not billions of interactions. How many of those things are out there? And how many interactions does the average user perform? And they keep them all? Forever? I could probably find the numbers myself or at least estimate them, I just don't care enough. I would however by happy to learn them if someone happens to know them.

danbruc · on July 11, 2019

In response to a deleted comment that said the following. [2]

Google is not training one language model, but many of them (I'd estimate ~70 language models from the voice settings menu on my phone). So 0.2% in total doesn't sound too unrealistic to me as this should be closer to 0.002% per language.

There is a large variation of the number of speakers between different languages. What would they want to do? Aim for the same number of training points for each language? Then for a language with 20 times fewer speakers - Thai compared to English - they would have to look at 4 % [1] of all interactions in Thai. Add to this that the distribution across languages is most likely very skewed, i.e. languages spoken in poorer regions of the world have a lot fewer users than languages spoken in richer regions.

Or maybe they want more training points for more frequently used languages, then, if they aim for a number of training points proportional to the number of interactions, every interaction has a 0.2 % chance of being used as a training sample regardless of the language. If you perform two interactions per day - and I will happily admit that I have not the slightest clue whether this is even on the right order of magnitude, I have never used any such system - then you reach 500 interaction within one year, which means that after one year of usage you have a reasonable chance that at least one of your interactions has become a training point.

[1] Probably not actually true because due the large number of English speakers the percentage for English would most likely be less than 0.2 % but right now I can not be bothered figuring out the correct numbers.

[2] Meta question - would this generally be consider acceptable without naming the user that made the comment? Or should deleted be deleted?

scarejunba · on July 11, 2019

I have a Google Home in every room of my home. This only records after the trigger word, right? And it isn’t tied to my account? Well, then, I’m fine with it.

nocturnial · on July 11, 2019

If you want raw data: out of the 1000 audio recordings the journalists could get their hands on, 153 were recorded without the trigger word.

Google provided those audio recordings to language experts for transcription without account information. The journalists managed to track down several people using only the audio and confronted those people with the audio. They confirmed it was indeed their voices.

scarejunba · on July 11, 2019

Like for the context-keeping? I know it’ll record post-response without an additional trigger phrase instance and that’s fine.

nocturnial · on July 11, 2019

It mentioned it was probably due to misinterpreting some words as the trigger phrase. They formulated it as "any word that remotely sounds like google could trigger it"

scarejunba · on July 11, 2019

Oh, that’s unfortunate. I’d definitely want them to first have humans view the trigger phrase time before they okay viewing the remainder.

punnerud · on July 11, 2019

As long as you don’t enable that Google Home can listen for some seconds after you are finished. This is for follow up questions without the trigger, but then other voice are also recorded a while after.

Hope they don’ listen without the trigger word. Easy to check with Wireshark (or similar) if they are sending enough data to equal sound transfer. Anybody checked this?

blueboo · on July 11, 2019

Is the implication that we as a society would be horrified by a “Chinese room” voice assistant — a human responding to voice commands and performing an assistant’s tasks?

A service whereby this function is automated 100% of the time and very rarely requests are transcribed for QC by trusted partners (a failure point apparently) seems .. reasonable?

A4ET8a8uTh0 · on July 11, 2019

With nest debacle I sent a message to 'my' senator ( this time I will send a postcard I guess ) after nest hidden microphone debacle. I got a non answer after a month. How much money you need to own a senator. I am not even joking. There has to be a way to crowdsource this.

pikapikamtf · on July 11, 2019

surely no one is surprised by this? why would you put a device in your home that is constantly listening to you? crazy.

blueboo · on July 11, 2019

Convenience outweighs infinitesimal downside

dvaun · on July 11, 2019

The posted article did cite their sources; the original report was made by vrtNWS [1]. I recommend reading the discussion which occurred yesterday for an array of arguments [2]. ----- One thing I personally believe — and find apparent about these discussions revolving around data privacy — is that the general public (at least in the USA) doesn't understand the capabilities of modern apps and other tech they use. Sure, there's news coverage of tech giants and adtech firms mining/utilizing user data for profit — the picture painted, though, seems to only show a black box to most people who aren't involved with the tech industry, and who don't have the background to really understand certain concepts (eg networking, ML, etc). There are definitely growing concerns about personal privacy [3][4]; continuing forward, however, the availability of user data and the growth with social media will probably rise and have an even larger impact on society as my generation (and younger ones) has already been tied into using these platforms [5].

I'm hopeful that this can be changed with changes to the education system. I would love to see computer education added as a general topic covered (ie it would be a very clear example of how useful math is...it would help others understand why they should learn beyond add/sub/div etc).

There is already a push to modernize schools and prepare students for the adult world; chromebooks are very common in schools that receive grants and funding for tech initiatives. Home econ, carpentry, and other courses used to be common (maybe still common in some places?) in highschool. How about computer education, specifically?

[1]: https://outline.com/2PmPtH [2]: https://news.ycombinator.com/item?id=20402070 [3]: https://www.pewresearch.org/fact-tank/2018/03/27/americans-c... [4]: https://www.pewresearch.org/fact-tank/2018/09/05/americans-a... [5]: https://www.pewresearch.org/fact-tank/2019/04/10/share-of-u-...

bprasanna · on July 11, 2019

Should we be surprised!