Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Does Google use the text inside gdocs and Gmail for training AI models?
66 points by tikkun on Feb 15, 2023 | hide | past | favorite | 71 comments
Google searches = used for training AI models

Apple notes = private

Google docs = ?

Siri requests = used for training models

Emails you send in gmail = ?

I'm seeking to understand what things people might think are private, because they're not posted on the open web, but where they're used for training AI models.




I don't know how everyone else approaches it, but I just assume that anything uploaded to googles servers will be snooped using some legal loophole or another. I don't have time to study the TOS for every user hostile application in the world, so guilty until proven innocent is the only sane position I can see to take.


This. And not just Google's servers, but anyone's servers.


Large IT departments generally look for third party certification that their data will be handled in some defined way before using a service.

https://workspace.google.com/learn-more/security/security-wh...

https://learn.microsoft.com/en-us/compliance/regulatory/offe...

https://compliance.salesforce.com/en/soc-2

https://slack.com/trust/compliance


Right, you don't have to trust the good faith and intentions of these huge companies, just that they don't want to get sued to hell and back by the other huge companies that rely on their products. You can't trust them to do (or not do) anything that they haven't legally promised to do (or not do), but outright lying isn't as profitable as people think.


Big companies generally do stick to their ToS.



Yes, and you know when they've stopped following it, because they send you an email saying "we've changed our ToS" [after our legal department learned what our engineers have been doing all year].


> Big companies generally do stick to their ToS.

I am the co-founder of a small by design company offering cloud services, and we spend a lot of time to make sure that our ToS reflect what happens operationally. Also money since we review it with a lawyer specialised in open source software. There are some other entities from chatons.org that care about such documents as well.


I know we all assumed this when Gmail was launched, I wonder how that flies under GDPR ?


Maybe we need an open source effort to track and simplify popular services' TOS.

TLDR TOS.


https://support.google.com/docs/answer/10381817 States:

> Google Docs, Sheets, & Slides uses data to improve your experience

> To provide services like spam filtering, virus detection, malware protection and the ability to search for files within your individual account, we process your content.

So yes, they do use your data for things like training AI. However it does not seem to be for general AI like Bard but for ML systems within the product in question.


Well, gets a bit muddy though if you factor in Workspace terms which offer a completely different set of processing agreements.


Workspace data is silo'd off and treated separately.


If you're talking about Google Bard, they were very clear in the LaMDA 2 paper that they only used public sources.

"...from public dialog data and other public web documents..."

LaMDA 2 paper: https://arxiv.org/abs/2201.08239

My overview of Google Bard including dataset: https://lifearchitect.ai/bard/

My overview of Google PaLM and Pathways family including dataset: https://lifearchitect.ai/pathways/

Compare with other models including the use of DeepMind's MassiveWeb/MassiveText and EleutherAI's Pile dataset: https://lifearchitect.ai/whats-in-my-ai/


No it does not, it would be irresponsible to do that on private data. There's a very clear line between data posted publicly and data held privately, especially in terms of copyright. I doubt it will ever be default opt-in for something as sensitive e-mail and docs.

One exception to that is scanning for CSAM and Terrorism and DMCA. And with DMCA, it's automated based on file hash, and you still maintain access to your files, you are just limited from sharing them. Ads in gmail aren't based on content, but other online activity while logged in.

I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use locally while writing emails. I imagine it's also siloed per user.

EDIT: Not a google employee, I apologize if my assertions seem too strong.

EDIT2: https://en.wikipedia.org/wiki/Federated_learning

"We have always maintained that you control your data and we process it according to the agreement(s) we have with you. Furthermore, we will not and cannot look at it without a legitimate need to support your use of the service -- and even then it is only with your permission. Here are some of the additional measures we take to ensure your privacy: (reference: GCP Terms).

In addition to these commitments, for AI/ML development, we don’t use data that you provide us to train our own models without your permission. And if you want to work together to develop a solution using any of our AI/ML products, by default our teams will work only with data that you have provided and that has identifying information removed. We work with your raw data only with your consent and where the model development process requires it. "

https://cloud.google.com/blog/products/ai-machine-learning/g...

https://support.google.com/mail/answer/6603?hl=en

https://arxiv.org/abs/1906.00080

https://ai.googleblog.com/2017/04/federated-learning-collabo...


You mentioned would be "irresponsible to do that on private data" but Google it seems were doing that, until at least 2017. Or I am incorrect about it?

"Google Will Keep Reading Your Emails, Just Not for Ads" - https://variety.com/2017/digital/news/google-gmail-ads-email...

Edit: Updated comment. Misread it as being posted from somebody working at Google.


The person you are responding to spoke assertively and quoted Google websites, but I don't think they work or worked for Google.


We have no evidence either way. And referring to someone who's weirdly in favor of a multi billion dollar corporate opposed to the rights of individuals at least an agent of that group seems appropriate verbage


Thanks. I will rephrase my comment.


> it would be irresponsible to do that on private data.

If they do it on searches not meant for publication then are they irresponsible?


People want spam filtering, so of course they will read your emails.


>No it does not, it would be irresponsible to do that on private data.

Doing irresponsible things on private data is hot business model of the day. I'm not saying it's google; I'm saying common expectations about "responsibility" are worse than useless.

>We have always maintained that you control your data and we process it according to the agreement(s) we have with you.

Ah the "we surveil you fair and square, get over it" clause.


Nothing in the quotes you posted preclude them using my emails to train AI. They say things like "we process your data according to the agreements we have with you," but that isn't a denial. A denial would be "we don't store or process your data to train AI".

They're just implying they don't process your data, while actually saying "the answer to your question lies in the text of the service agreements". Evasive at best.

In the absence of a denial, and the presence of an obvious motive to do so, I have to say my guess is they do use your gdocs and gmail data to train AI.


Idk why you posted the GCP terms of service as evidence. I also don't think Google uses emails and docs, but there is definitely a higher bar with company data than with normal user data.


> I think the other exception to that is smart compose. AI models do use email content for training data, but the output of those are strictly for use while writing emails. I imagine it's also siloed per user.

So if there is an AI model for each user trained on the user's writing does that mean Google now also has the means to forge convincing emails?


>especially in terms of copyright

Not in general true. Something I write in a diary I keep under my pillow has the exact same copyright status as something I publish on my blog (assuming no Creative Commons, etc. licenses).


,, One exception to that is scanning for CSAM and Terrorism and DMCA''

It's enough to have one exception. We have to assume that a language model will be trained from them and government officials are using it soon. Just think about how much it can tell about the next planned terrorist events with their organizers.


> scanning for CSAM and Terrorism and DMCA

There is no such thing as "scanning for DMCA". DMCA prescribes a process for reacting to complaints of copyright infringement, not a process for preemptively scanning content for material that might potentially generate such complaints.


>No it does not, it would be irresponsible to do that on private data. [...] One exception to that is [describes two exceptions] I think the other exception to that is [...]

kek


tl;dr officially, of course not. You trust us, right?


Email contents are used to generate a model for Smart Compose in Gmail [1]. I assume that Google docs works similarly.

[1] https://arxiv.org/pdf/1906.00080.pdf


Yes, in a similar way to using voice data from Android and Home devices to train speech-to-text models.


Well, since Google offers an auto complete AI, I'd assume any text Google has (or at the very least, text that can be auto-completed) gets fed into their AIs. I have no evidence for this, and I should really read the ToS sometimes, but I digress.


The Smart Compose paper does say that it was trained on user emails, and that each user gets a model trained for their personal style.


A related question: are the 40+ [1] millions Full Text Books used ?

OpenAI is using book1 (BookCorpus ?) and book2 sources. By the number of tokens, this seems less than a million book in total.

[1] https://www.blog.google/products/search/15-years-google-book...


It's "BooksCorpus" (with an 's'), a 800M word dataset described in Zhu et al. (2015) IEEE ICCV, and also available on AWS at: https://aws.amazon.com/marketplace/pp/prodview-d3ghxqzkitn6y

The Google BERT paper (Devlin et al., 2018) also references it: https://aclanthology.org/N19-1423/

Privacy questions aside (as important as they of course are), it's very important to know what a model was trained on exactly: if Wikipedia was used in the training set, you can't use questions from Wikipedia to test it (as that would be cheating) - test data must be as "unseen" as a good exam.


Actually, it's 'BookCorpus'. OpenAI spelt it wrong in their GPT-1 paper.

It has also been analyzed here and here:

https://arxiv.org/abs/2105.05241

https://lifearchitect.ai/whats-in-my-ai/


About emails:

No for ads.

Yes for training models for Smart Reply, from: https://arxiv.org/pdf/1606.04870.pdf

Privacy Note that all email data (raw data, preprocessed data and training data) was encrypted. Engineers could only inspect aggregated statistics on anonymized sentences that occurred across many users and did not identify any user. Also, only frequent words are retained. As a result, verifying model’s quality and debugging is more complex.


A different question should be asked:

Do their Terms of Service, etc., allow them to use that text for training models ?


Call me crazy, but I don't think I'd trust a company like Google to necessarily follow their own TOS, or to not try and craft a vague TOS that gives them a lot of legal flexibility to do most of what they want.


I would assume so, all email content is surfaced for serving ads, so why not AI training.


That's not true, the ads shown are related to your online activity and not the content of your mail messages. See https://support.google.com/mail/answer/6603?hl=en

"We will not scan or read your Gmail messages to show you ads."



> This article is more than 9 years old


Google says that email content is not used to show ads

https://support.google.com/mail/answer/10434152


google was sued about 8 years ago and they stated they use emails for "tailored experiences" https://arstechnica.com/information-technology/2014/04/googl...


Your premise is incorrect: Email contents are not used for targeting ads.


But...how do you (we) know?


Because they've made public statements to that effect, including in the privacy policy.

Now, your next question is presumably going to be "why don't they just lie?".

First, the lie would be very expensive. What do you think privacy regulators around the world would think of it? What do you think that lying to customers about how their data is used would do to the their $25 billion / year enterprise business?

Second, the lie would get discovered very quickly, because somebody would leak it. Plenty of much less serious things seem to leak weekly.

Third, there's very little to be gained by using the data and lying about this compared to the alternatives (say nothing; tell the truth; don't use the data). "Use the data but lie about it" is the absolute worst strategy possible.


You have a great deal more trust in Google than I. I think that if Google lied about this and got caught, the end result wouldn't be that damaging to Google.

Not saying they are lying, of course, but I don't think the incentives against it are as strong as you do.

But then, I'm skeptical enough about privacy policies that I've stopped reading them. The vast majority of them, including the ones I've read from Google, leave more than enough room for the companies to do pretty much anything they want with your data.


But this isn't a in situation where the privacy policy is vague, ambiguous, nor leave room to do anything they want.

From https://policies.google.com/privacy?hl=en-US

> We don’t show you personalized ads based on your content from Drive, Gmail, or Photos.

It is a direct and unambiguous statement in plain language saying that Gmail content is not used for ad targeting.

Given your objection was based on an invalid premise, are you now willing to change your opinion?


What is an example of Google’s privacy policies that make you skeptical?


Like most privacy policies (I'm certainly not saying Google is unique here), they include phrases such as:

> We collect information to provide better services to all our users

Which is vague to the point of uselessness. Google could argue that pretty much any purpose they put that information to is to provide better services to all their users.

This sort of thing is why I ignore privacy policies. They tend to have quite a lot of verbiage of that sort. Even when they appear to solidly say they won't do something, often there's wording elsewhere that provides an exception.

The only way I'd put any stock in a privacy policy is if I have my attorney review it and explain to me what it really says (not being a lawyer, I am not capable of adequately interpreting contracts). Since it's entirely unrealistic to have an attorney review every privacy policy, the safest approach is to assume that they allow the company to do whatever they wish in the end.

And that's all assuming that companies put any real effort to adhering to their privacy policies.


EU Privacy Regulators already effectively made Google (and most other US companies) illegal 8 years ago, after Google et al. gave personal data away to the NSA in breach of EU's fundamental rights :

https://en.wikipedia.org/wiki/Max_Schrems#Schrems_I

(You will notice that the enforcement has been very slow moving on that one, since despite this, US companies are still used quite a lot in EUrope.)


What does that have to do with using ads for email targeting?


Ok. I'm convinced! :)


If they used text inside Google Docs for training AI models, they would leak confidential customer information in an incredibility obvious way. So I'm guessing no.


And a related question is do any/all of these programs use other copyright material as their training data and is it a breach of copyright?


I wouldn't be surprised at all - gotta read the ToS. The entire reason Google went heavy on GMail, despite it being a fun 10% project initially, was so that they could read your messages and use it to send more targeted ads.


That’s not true as of several years ago, at this point. Just a meme


It is a free product and we ALL know know how free products are "paid for".


Yes. They Gmail shows ads. Those ads aren't related to the content inside Gmail.


I imagine the ad revenue on gmail is trivial and it's just subsidized by Workspace revenue. Initially and possibly now it was/is subsidized by the incremental search quality enjoyed by logged-in users.


I'm position that the lawyers for Google have ensured that ads shown in Gmail don't relate to your email's content.

Whether it's being scanned for other reasons is another question.


google uses all data it can for training their models, the only rule is the data should not track back to a specific user.


Can you have anti-spam AI models?


I'm not sure how else you'd do it at this point.


Google says it doesn't use email text for targeting, but there are always "bugs."


And bugs are features too.


They are literally training their future enslaver.


Why would it matter? If you're concerned about your privacy, there are millions of other reasons to avoid Google.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: