Ask HN: Does Google use the text inside gdocs and Gmail for training AI models?

Awelton · on Feb 15, 2023

I don't know how everyone else approaches it, but I just assume that anything uploaded to googles servers will be snooped using some legal loophole or another. I don't have time to study the TOS for every user hostile application in the world, so guilty until proven innocent is the only sane position I can see to take.

JohnFen · on Feb 15, 2023

This. And not just Google's servers, but anyone's servers.

lern_too_spel · on Feb 16, 2023

Large IT departments generally look for third party certification that their data will be handled in some defined way before using a service.

https://workspace.google.com/learn-more/security/security-wh...

https://learn.microsoft.com/en-us/compliance/regulatory/offe...

https://compliance.salesforce.com/en/soc-2

https://slack.com/trust/compliance

djur · on Feb 16, 2023

Right, you don't have to trust the good faith and intentions of these huge companies, just that they don't want to get sued to hell and back by the other huge companies that rely on their products. You can't trust them to do (or not do) anything that they haven't legally promised to do (or not do), but outright lying isn't as profitable as people think.

londons_explore · on Feb 15, 2023

Big companies generally do stick to their ToS.

consumer451 · on Feb 15, 2023

After Google wardrove the USA I permanently lost all faith.

https://www.darkreading.com/risk/google-wardriving-how-engin...

https://www.theregister.com/2019/07/23/google_wispy_payout/

TechBro8615 · on Feb 15, 2023

Yes, and you know when they've stopped following it, because they send you an email saying "we've changed our ToS" [after our legal department learned what our engineers have been doing all year].

Leeturtle · on Feb 15, 2023

> Big companies generally do stick to their ToS.

I am the co-founder of a small by design company offering cloud services, and we spend a lot of time to make sure that our ToS reflect what happens operationally. Also money since we review it with a lawyer specialised in open source software. There are some other entities from chatons.org that care about such documents as well.

stuaxo · on Feb 16, 2023

I know we all assumed this when Gmail was launched, I wonder how that flies under GDPR ?

stackbutterflow · on Feb 16, 2023

Maybe we need an open source effort to track and simplify popular services' TOS.

TLDR TOS.

advisedwang · on Feb 15, 2023

https://support.google.com/docs/answer/10381817 States:

> Google Docs, Sheets, & Slides uses data to improve your experience

> To provide services like spam filtering, virus detection, malware protection and the ability to search for files within your individual account, we process your content.

So yes, they do use your data for things like training AI. However it does not seem to be for general AI like Bard but for ML systems within the product in question.

dawnerd · on Feb 15, 2023

Well, gets a bit muddy though if you factor in Workspace terms which offer a completely different set of processing agreements.

londons_explore · on Feb 15, 2023

Workspace data is silo'd off and treated separately.

adt · on Feb 15, 2023

If you're talking about Google Bard, they were very clear in the LaMDA 2 paper that they only used public sources.

"...from public dialog data and other public web documents..."

LaMDA 2 paper: https://arxiv.org/abs/2201.08239

My overview of Google Bard including dataset: https://lifearchitect.ai/bard/

My overview of Google PaLM and Pathways family including dataset: https://lifearchitect.ai/pathways/

Compare with other models including the use of DeepMind's MassiveWeb/MassiveText and EleutherAI's Pile dataset: https://lifearchitect.ai/whats-in-my-ai/

knaik94 · on Feb 15, 2023

No it does not, it would be irresponsible to do that on private data. There's a very clear line between data posted publicly and data held privately, especially in terms of copyright. I doubt it will ever be default opt-in for something as sensitive e-mail and docs.

One exception to that is scanning for CSAM and Terrorism and DMCA. And with DMCA, it's automated based on file hash, and you still maintain access to your files, you are just limited from sharing them. Ads in gmail aren't based on content, but other online activity while logged in.

I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use locally while writing emails. I imagine it's also siloed per user.

EDIT: Not a google employee, I apologize if my assertions seem too strong.

EDIT2: https://en.wikipedia.org/wiki/Federated_learning

"We have always maintained that you control your data and we process it according to the agreement(s) we have with you. Furthermore, we will not and cannot look at it without a legitimate need to support your use of the service -- and even then it is only with your permission. Here are some of the additional measures we take to ensure your privacy: (reference: GCP Terms).

In addition to these commitments, for AI/ML development, we don’t use data that you provide us to train our own models without your permission. And if you want to work together to develop a solution using any of our AI/ML products, by default our teams will work only with data that you have provided and that has identifying information removed. We work with your raw data only with your consent and where the model development process requires it. "

https://cloud.google.com/blog/products/ai-machine-learning/g...

https://support.google.com/mail/answer/6603?hl=en

https://arxiv.org/abs/1906.00080

https://ai.googleblog.com/2017/04/federated-learning-collabo...

belter · on Feb 15, 2023

You mentioned would be "irresponsible to do that on private data" but Google it seems were doing that, until at least 2017. Or I am incorrect about it?

"Google Will Keep Reading Your Emails, Just Not for Ads" - https://variety.com/2017/digital/news/google-gmail-ads-email...

Edit: Updated comment. Misread it as being posted from somebody working at Google.

cldellow · on Feb 15, 2023

The person you are responding to spoke assertively and quoted Google websites, but I don't think they work or worked for Google.

SQueeeeeL · on Feb 15, 2023

We have no evidence either way. And referring to someone who's weirdly in favor of a multi billion dollar corporate opposed to the rights of individuals at least an agent of that group seems appropriate verbage

belter · on Feb 15, 2023

Thanks. I will rephrase my comment.

cma · on Feb 15, 2023

> it would be irresponsible to do that on private data.

If they do it on searches not meant for publication then are they irresponsible?

lern_too_spel · on Feb 16, 2023

People want spam filtering, so of course they will read your emails.

dogleash · on Feb 15, 2023

>No it does not, it would be irresponsible to do that on private data.

Doing irresponsible things on private data is hot business model of the day. I'm not saying it's google; I'm saying common expectations about "responsibility" are worse than useless.

>We have always maintained that you control your data and we process it according to the agreement(s) we have with you.

Ah the "we surveil you fair and square, get over it" clause.

karaterobot · on Feb 15, 2023

Nothing in the quotes you posted preclude them using my emails to train AI. They say things like "we process your data according to the agreements we have with you," but that isn't a denial. A denial would be "we don't store or process your data to train AI".

They're just implying they don't process your data, while actually saying "the answer to your question lies in the text of the service agreements". Evasive at best.

In the absence of a denial, and the presence of an obvious motive to do so, I have to say my guess is they do use your gdocs and gmail data to train AI.

rockemsockem · on Feb 15, 2023

Idk why you posted the GCP terms of service as evidence. I also don't think Google uses emails and docs, but there is definitely a higher bar with company data than with normal user data.

krunck · on Feb 15, 2023

> I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use while writing emails. I imagine it's also siloed per user.

So if there is an AI model for each user trained on the user's writing does that mean Google now also has the means to forge convincing emails?

ghaff · on Feb 15, 2023

>especially in terms of copyright

Not in general true. Something I write in a diary I keep under my pillow has the exact same copyright status as something I publish on my blog (assuming no Creative Commons, etc. licenses).

xiphias2 · on Feb 15, 2023

,, One exception to that is scanning for CSAM and Terrorism and DMCA''

It's enough to have one exception. We have to assume that a language model will be trained from them and government officials are using it soon. Just think about how much it can tell about the next planned terrorist events with their organizers.

greyface- · on Feb 15, 2023

> scanning for CSAM and Terrorism and DMCA

There is no such thing as "scanning for DMCA". DMCA prescribes a process for reacting to complaints of copyright infringement, not a process for preemptively scanning content for material that might potentially generate such complaints.

dogleash · on Feb 15, 2023

>No it does not, it would be irresponsible to do that on private data. [...] One exception to that is [describes two exceptions] I think the other exception to that is [...]

kek

anigbrowl · on Feb 15, 2023

tl;dr officially, of course not. You trust us, right?

mikek · on Feb 15, 2023

Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.

[1] https://arxiv.org/pdf/1906.00080.pdf

cloudking · on Feb 16, 2023

Yes, in a similar way to using voice data from Android and Home devices to train speech-to-text models.

askiiart · on Feb 15, 2023

Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.

jeffbee · on Feb 15, 2023

The Smart Compose paper does say that it was trained on user emails, and that each user gets a model trained for their personal style.

CrypticShift · on Feb 15, 2023

A related question: are the 40+ [1] millions Full Text Books used ?

OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.

[1] https://www.blog.google/products/search/15-years-google-book...

jll29 · on Feb 15, 2023

It's "BooksCorpus" (with an 's'), a 800M word dataset described in Zhu et al. (2015) IEEE ICCV, and also available on AWS at: https://aws.amazon.com/marketplace/pp/prodview-d3ghxqzkitn6y

The Google BERT paper (Devlin et al., 2018) also references it: https://aclanthology.org/N19-1423/

Privacy questions aside (as important as they of course are), it's very important to know what a model was trained on exactly: if Wikipedia was used in the training set, you can't use questions from Wikipedia to test it (as that would be cheating) - test data must be as "unseen" as a good exam.

adt · on Feb 15, 2023

Actually, it's 'BookCorpus'. OpenAI spelt it wrong in their GPT-1 paper.

It has also been analyzed here and here:

https://arxiv.org/abs/2105.05241

https://lifearchitect.ai/whats-in-my-ai/

kartayyar · on Feb 15, 2023

About emails:

No for ads.

Yes for training models for Smart Reply, from: https://arxiv.org/pdf/1606.04870.pdf

Privacy Note that all email data (raw data, preprocessed data and training data) was encrypted. Engineers could only inspect aggregated statistics on anonymized sentences that occurred across many users and did not identify any user. Also, only frequent words are retained. As a result, verifying model’s quality and debugging is more complex.

rsync · on Feb 15, 2023

A different question should be asked:

Do their Terms of Service, etc., allow them to use that text for training models ?

logicalmonster · on Feb 15, 2023

Call me crazy, but I don't think I'd trust a company like Google to necessarily follow their own TOS, or to not try and craft a vague TOS that gives them a lot of legal flexibility to do most of what they want.

BenjaminDyer · on Feb 15, 2023

I would assume so, all email content is surfaced for serving ads, so why not AI training.

dxf · on Feb 15, 2023

That's not true, the ads shown are related to your online activity and not the content of your mail messages. See https://support.google.com/mail/answer/6603?hl=en

"We will not scan or read your Gmail messages to show you ads."

guluarte · on Feb 15, 2023

https://www.theguardian.com/technology/2013/aug/14/google-gm...

syrrim · on Feb 15, 2023

> This article is more than 9 years old

xu_ituairo · on Feb 15, 2023

Google says that email content is not used to show ads

https://support.google.com/mail/answer/10434152

guluarte · on Feb 15, 2023

google was sued about 8 years ago and they stated they use emails for "tailored experiences" https://arstechnica.com/information-technology/2014/04/googl...

jsnell · on Feb 15, 2023

Your premise is incorrect: Email contents are not used for targeting ads.

hungryforcodes · on Feb 15, 2023

But...how do you (we) know?

jsnell · on Feb 15, 2023

Because they've made public statements to that effect, including in the privacy policy.

Now, your next question is presumably going to be "why don't they just lie?".

First, the lie would be very expensive. What do you think privacy regulators around the world would think of it? What do you think that lying to customers about how their data is used would do to the their $25 billion / year enterprise business?

Second, the lie would get discovered very quickly, because somebody would leak it. Plenty of much less serious things seem to leak weekly.

Third, there's very little to be gained by using the data and lying about this compared to the alternatives (say nothing; tell the truth; don't use the data). "Use the data but lie about it" is the absolute worst strategy possible.

JohnFen · on Feb 15, 2023

You have a great deal more trust in Google than I. I think that if Google lied about this and got caught, the end result wouldn't be that damaging to Google.

Not saying they are lying, of course, but I don't think the incentives against it are as strong as you do.

But then, I'm skeptical enough about privacy policies that I've stopped reading them. The vast majority of them, including the ones I've read from Google, leave more than enough room for the companies to do pretty much anything they want with your data.

jsnell · on Feb 15, 2023

But this isn't a in situation where the privacy policy is vague, ambiguous, nor leave room to do anything they want.

From https://policies.google.com/privacy?hl=en-US

> We don’t show you personalized ads based on your content from Drive, Gmail, or Photos.

It is a direct and unambiguous statement in plain language saying that Gmail content is not used for ad targeting.

Given your objection was based on an invalid premise, are you now willing to change your opinion?

signatoremo · on Feb 15, 2023

What is an example of Google’s privacy policies that make you skeptical?

JohnFen · on Feb 15, 2023

Like most privacy policies (I'm certainly not saying Google is unique here), they include phrases such as:

> We collect information to provide better services to all our users

Which is vague to the point of uselessness. Google could argue that pretty much any purpose they put that information to is to provide better services to all their users.

This sort of thing is why I ignore privacy policies. They tend to have quite a lot of verbiage of that sort. Even when they appear to solidly say they won't do something, often there's wording elsewhere that provides an exception.

The only way I'd put any stock in a privacy policy is if I have my attorney review it and explain to me what it really says (not being a lawyer, I am not capable of adequately interpreting contracts). Since it's entirely unrealistic to have an attorney review every privacy policy, the safest approach is to assume that they allow the company to do whatever they wish in the end.

And that's all assuming that companies put any real effort to adhering to their privacy policies.

BlueTemplar · on Feb 15, 2023

EU Privacy Regulators already effectively made Google (and most other US companies) illegal 8 years ago, after Google et al. gave personal data away to the NSA in breach of EU's fundamental rights :

https://en.wikipedia.org/wiki/Max_Schrems#Schrems_I

(You will notice that the enforcement has been very slow moving on that one, since despite this, US companies are still used quite a lot in EUrope.)

jsnell · on Feb 15, 2023

What does that have to do with using ads for email targeting?

hungryforcodes · on Feb 15, 2023

Ok. I'm convinced! :)

ReflectedImage · on Feb 15, 2023

If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.

Berniek · on Feb 15, 2023

And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?

whalesalad · on Feb 15, 2023

I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.

refulgentis · on Feb 15, 2023

That’s not true as of several years ago, at this point. Just a meme

lp0_on_fire · on Feb 15, 2023

It is a free product and we ALL know know how free products are "paid for".

joshuamorton · on Feb 15, 2023

Yes. They Gmail shows ads. Those ads aren't related to the content inside Gmail.

jeffbee · on Feb 15, 2023

I imagine the ad revenue on gmail is trivial and it's just subsidized by Workspace revenue. Initially and possibly now it was/is subsidized by the incremental search quality enjoyed by logged-in users.

lp0_on_fire · on Feb 16, 2023

I'm position that the lawyers for Google have ensured that ads shown in Gmail don't relate to your email's content.

Whether it's being scanned for other reasons is another question.

guluarte · on Feb 15, 2023

google uses all data it can for training their models, the only rule is the data should not track back to a specific user.

_2uwr · on Feb 15, 2023

Can you have anti-spam AI models?

throwayyy479087 · on Feb 15, 2023

I'm not sure how else you'd do it at this point.

manv1 · on Feb 15, 2023

Google says it doesn't use email text for targeting, but there are always "bugs."

revskill · on Feb 15, 2023

And bugs are features too.

lofaszvanitt · on Feb 15, 2023

They are literally training their future enslaver.

behnamoh · on Feb 15, 2023

Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.