Hacker News new | past | comments | ask | show | jobs | submit login
Docusign just admitted that they use customer data to train AI (twitter.com/nixcraft)
223 points by lopkeny12ko 6 months ago | hide | past | favorite | 107 comments



Repeat after me:

There's no sure-fire way to automatically anonymize arbitrary customer data.

There's no sure-fire way to automatically anonymize arbitrary customer data.

There's no sure-fire way to automatically anonymize arbitrary customer data.

You can't anonymize it by hand. You can't anonymize it by machine. If you're taking blobs of customer data and dumping it into a machine, customer data can come out of the machine. Neither humans nor machines are error-free. If you believe DocuSign is going to do a perfect job, I've got a bridge to sell you.


Hey, can you recommend a super-simple way to automatically anonymize customer data?


The heat death of the universe


Source? I don't think physicists agree. For a while there was an idea getting popularity, that black holes erase information, but in the recent year I've seen some articles and youtube videos (didn't actually care to read/watch) with titles saying otherwise.

To my understanding, the idea behind the heat death of the Universe is that nothing interesting happens anymore, not that there is absolutely nothing.


1. “The heat death of the universe” is my favorite HN comment of the decade.

2. The heat death of the universe does not mean one gigantic black hole. I’m just a hobbyist but my understanding of the theory is that black holes will continue to form, but through Hawking radiation, they eventually radiate out all their energy until it is all dispersed, ultimately leading to uniformity across the entire universe, max entropy, where “work” can no longer take place.

(It is an interesting question then whether information is actually destroyed through Hawking radiation?)


Sorry, perhaps I should have marked the black hole as a tangent. The heat death of the universe is somewhen after all black holes "evaporated".


> 1. “The heat death of the universe” is my favorite HN comment of the decade

Thank you for saying that :)


> nothing interesting happens anymore, not that there is absolutely nothing.

Sorry, are you talking about the heat death of the universe or the last five Marvel movies?


Yes sirs, they are the same picture.


Why do I never think of these things


Microwave ovens, strong magnets (depending on the medium), shredders, blast furnaces, the hydraulic press from that YouTube channel, firing it into the sun...


rm is a good start, but you also need to ensure the disks are not sold to anyone.


When you enable TRIM, it's zeroed automatically, so you're 99% there with an rm.


No, the point of enabling TRIM is to not induce further write amplification. The zeroing is deferred until the block that is trimmed is over written. It is released from the indefinite garbage collection that exists in a pure SSD implementation that does not know what blocks are important. Some firmwares answers with nothing when requesting a TRIMed block. Others answer with the content of the block as it is on NAND. Either way, the bits are most likely still on NAND.


I think it's dependent on the GC scheme of the SSD in question. From what I have seen, some drives defer it until they'll be written, some zero at the moment the sectors are TRIMmed and some wait for idle + timeout amount of time before working on these sectors.

I have a couple of external SSDs and USB flash drives start to flash their busy lights after letting them idle for a couple of minutes. These "bursts of activity" generally is proportional to the I/O work they have done before the idle period.

One of my flash drives did that GC/zeroing thing when it's ejected. After ejection and drive disconnect, its busy light flashed for some time depending on the work needs to be done. If you didn't hammer it when you used it, it did nothing, but if you did tons of work, it worked up to a minute or so after being ejected.


Step 1: generate uuid. Step 2: encrypt customer data. Step 3: associate encrypted data with uuid; keep uuid in encrypted if metadata. Step 4: use uuid instead of data.

Obviously there are potential hiccups. But a lot of good can be done with encryption and metadata.


Black hole and wait for radiate some out as Hawking Radiation


Not even certain this will be enough. Hawking radiation encodes the information that was thrown in. Sure at present it's a perfect mess, but who knows?


Yeah, right after I figure out how to make a regex match open tags except XHTML self-contained tags.

If you're one of the lucky 10,000 that don't know the reference:

https://stackoverflow.com/questions/1732348/regex-match-open...

And if you're one of the lucky 10,000 that don't know the other reference:

https://xkcd.com/1053/


I dunno, I read someplace that there’s no sure-fire way to do that.


There is, depending on the type, format, and volume of data.

Simply classify data into buckets for each value of each variable such that each unique combination of buckets for every variable contains at least N items for a suitably large number of N. It is then mathematically impossible to match an individual data point to any set of original data points less than N.


What if I've been the plaintiff for thousands of labor disputes, and my name and SSN appears in legal documents thousands of times? Couldn't that potentially expose my name and SSN both 5213 times, demonstrating them to be related?

I know this is a bit of an absurd example, but I'm trying to understand the mechanics more than to provide a realistic counterexample.


> Simply classify data into buckets for each value of each variable such that each unique combination of buckets for every variable contains at least N items for a suitably large number of N.

What are the variables and what is the value of N when you're talking about "freeform text contracts containing PII"?


I’ve read this three times now and I still don’t understand. Can you explain it in a different way?


https://en.wikipedia.org/wiki/K-anonymity Basically keeping only chunks of text which have are not unique and contributed by many users.


> Given person-specific field-structured data

That's not what this data is and that's broadly not what LLMs are trained on.


By chunking and binning the data by user contribution count, you are structuring it as described. Pretraining is often accomplished on similar chunked text. k-anonymity is not perfect though, or state of the art, the wikipedia explains several known attacks.


it's still leaking information. let's say there are 5-6 companies that all work in the mystical cloud-painting sector, you lump them together and publish data about the sector. before there was nothing available now there's an average.

and obviously partial information is very useful for exactly those with some other partial information.

this is why OSINT is very effective. it's possible to piece together enough constraints on these datasets that in the end the likelihood of a successful deanonymization is unexpectedly high.


There's no sure-fire way to make a human who is processing a document manually never leak the information they read.


Sure, but if there's a law forbidding you to exceed a particular speed on the road, you can't do it anyway and say "you can't be perfectly safe anyway".

The analogy here is: there are laws regarding confidentiality that probably were broken here.


I agree. But at the same time, insisting on 100% certainty/safety would mean to not do anything at all and stick to the status quo forever. It boils down to cost-benefit-calculations.

While I agree that it is unacceptable to use customer data without consent (as suggested by OP's post), I disagree with the implicit assumption behind the comment that I responded to:

Namely the implicit assumption that human/biological intelligence/agents are somehow superior to artificial intelligence/agents in face of secrecy/confidentiality.

It boils down to the question whether it's possible or not to create algorithms that outperform humans in tasks involving secrecy and confidentiality.

While I can't think of any reason why that should not be possible in general, I agree that the current SOTA of generative LLMs is not sufficient for that.

Is throwing lots and lots of data and RLHF training on an LLM enough in order to make the probability of customer data leaks small enough to be acceptable?

I don't know. But I don't trust MBAs who salivate with dollar signs in their eyes to know either. And I fear that their lack of technical understanding will lead to bad decisions. And I fear those might lead to scandals that make Gemini's weird biases in image generation pale in comparison.


> It boils down to cost-benefit-calculations.

Yes, the user bears the cost when their confidential data is leaked and the company derives the economic benefit of mishandling it, which is why this keeps happening.


I used to work with extremely sensitive data. My employer made it a point to hire people with memory disorders and intellectual disabilities to deal with raw data.

There was a young lady I had to reintroduce myself to every week or so. I think of her every so often.

I’m certain she doesn’t think of me.


Actually there is. But let us not go there.


What happened to differential privacy?


It works.

But understand what you are doing: you are adding noise to individual's data to make them less identifiable within a specified set of data. The amount of noise you need to add depends on the size of the set they are within.

The OP is also right: For arbitrary customer data (within arbitrary sets of data) you can't guarantee anonymization.

There are also multi-party computation systems, homomorphic encryption and federated learning approaches that all provide different approaches for parts of the private machine learning problem.

It's hard.


Worth investigation "k-anonymity" as well.

https://en.wikipedia.org/wiki/K-anonymity

k-anonymity is an attempt to solve the problem "Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."

One serious problem with that approach is that no matter how well you k-anonymise your data set, someone else's dataset may deanonymise yours.


It doesn't work on arbitrary data. Hand-crafted metrics on structured data, yeah. Free text, no. Data that you don't know anything about, also not really.


We already know this. How about you tell the government and public officials.


To be pedantic, you can trivially anonymize arbitrary data by returning a constant. You can't automatically anonymize arbitrary customer data and have the result be useful.


Meh, so it puts Robert Picardo Stewart somewhere.


Or it puts your name and the name of one of your confidential clients in someone else's contract. Or it oopsies your name and salary into a job offer for someone else. Or it sees the name of your ex wife and hallucinates details about your assets based on your previous divorce papers and she sues you. I can come up with these all day.


AI as the system copying notes for the test from another system.

Nothing in there is anywhere near impossible unless the system doesn't have the data.


And?

If reliability is zero, so is the risk.


Just because my ex wife might incorrectly sue me doesn't mean it's not an incredible pain in the ass. Just because someone incorrectly got a contract that hilariously includes all the private details of my employment doesn't mean I won't be royally pissed. The risk is "I don't fucking want this to happen". So no, it's not zero.


Why a xitter post of a screenshot with no link when the full article is public? https://support.docusign.com/s/document-item?language=en_US&...


Because that article is published by docusign and they could change it.


https://web.archive.org (take a screenshot option if you have a login and it's JS madness WBM can't handle), https://archive.ph, etc (hopefully captured more than one if possible) - random screenshots are proof of absolutely nothing.

(here's one for the page above: https://archive.is/e6K4K)


In that case a link to an archived copy is preferable to a screenshot of the partial document hosted by Twitter.


An archive link should be added as a comment automatically for all posts... in some cases it would even allow you to bypass paywalls...


Because twitter downranks any tweet with link, so now users put a pic/screenshot in first tweet with the link in the next tweet.


Also because xitter provides a broader discussion platform and reach compared to only HN (even though I highly appreciate the HN content). But yeah, the xitter post should preferably also contain that url to the source.


Why does docusign need such sophisticated and powerful AI models, to the point that they’re willing to risk their customers’ ire by training on their data (yes I know, even though it’s anonymized per “industry best practices”)? Are they in some additional line of business besides making it possible to sign a contract electronically?


The AI gods smile favorably upon worshipers that invoke them at the shrines of finance these days. They shower them with the same good fortunes that once came from the blockchain false gods.


And before that the Big Data false gods, and before that the Web 2.0 false gods, and before that the microservices false gods.


The obvious angle is AI contract generation. Docusign are uniquely positioned to build up a real training set of contracts.


Going to be absolutely wild the first time it spits out a specific enough paragraph to identify the company whose data was used to train the model.


Or spits out the terms of someone's confidential settlement agreement.


Yes, and extending existing contracts based on new requirements.

But also contract validation, like check if something missing or finding loopholes. If they are sneaky, they could sell that information to lawyers, who can definitely make use of it.

I'm not inherently pro-AI, but there are indeed valid and interesting use cases for such models, if the data can be anonymized correctly. And I'm not sure if it makes sense to single out DocuSign, because there are probably many companies which train their models on similar data.


If they're uniquely positioned to do it, it probably doesn't need to be done. Who is going to usurp them if they don't?


I find this comment interesting in the context of having read a lot of brain-dead takes on "capitalism", some of which it liken it to feudalism.

I find it interesting because it shows how different those two are.

Under feudal zero-sum world your question would make complete sense, but under a capitalist positive-sum world, it seems to me the answer is just "a lot of profit would be left on the table" (ie positive-sum mutual-benefit transactions), which explains why docusign would do it.


I think a lot of danger can come from competing against imaginary forces. You get things like atom bombs and paranoia.


You can get more people to invest in your company if you put out a press release that says you're using AI.


I feel so small when a company that... does documenting signing... already has investors thinking it's worth $11b (down from over $50b when many were convinced touching a piece of paper was a common covid vector of transmission...)


It’s not a “need” but they are just riding the coat tails of “AI”. Probably getting a fat check in b2b by selling to OpenAI or many of the VC funded AI companies.

Throw in any AI into a company and the vulture capitalists will load you up on cheap money


Data the new $


I think you can pretty much assume that all data in SaaS products is mined for commercial purposes and soon will be used in AI training if it isn't already, consent or lack thereof notwithstanding. They'll just do it, the temptation is too large and the task is too easy.

If you want your data to be private, encrypt it or better yet keep it in house.


Time for another California legislation. Let's put a "Don't train AI models on my personal information" link at the bottom of websites.


I propose a useless header to send with browser requests as a toothless opt-out mechanism.


Or other dropdowns for gdpr popups


GDPR explicitly states "opt out must be as easy as opt-in". Unfortunately enforcement has been asleep at the wheel.


The problem with your suggestion is that this is so ubiquitous you can not avoid it, unless you go live as a hermit in a cave.


I don't really see how this admits they are training AI on user data (edit: user documents, which I feel is what is implied. User app usage is probably placed in training data). They are processing data with AI, both off-the-shelf models and own models, sure.

This is no different from any "old school" ML fraud detection, customer journey analysis, etc., etc.


The way to fight this is to inform the legal counsel at a Really-Big-Company (or alteratively a Really-Litigious-Company) that uses DocuSign. There's absolutely no way they would approve using a platform like this for anything that has even a slight chance of divulging confidential or sensitive information to DocuSign, either about the company, its employees or customers.


Why is everyone so concerned about AI training on their data in the first place?

Has anyone really pondered the consequences or are we just repeating the concerns of others without thinking for ourselves again?

What are real actual consequences? Especially if I’m a small company? Except of course it repeating or regurgitating your content, which is a valid concern.


There's the matter of material concerns, but there's also the matter of consent. These companies have presented their products as being functionally identical to doing things in house, but with the advantage of not having to. Their whole selling point is privacy, compliance, security, etc.

It's like when people found out that Ring/Alexa staff were viewing their recordings and got upset. Even if they make super sure not to commit any further violations with their access, the act of consuming and making use of my data is a violation in of itself. It violates my privacy, and my ownership of my data. Even if my data can't be retrieved from the product you use it to make, it still feels icky that you changed the terms of our deal without my permission, and took advantage of our relationship to make extra money off of stuff that is fundamentally mine, and I have no recourse.

If my landlord surveilled my apartment and sold the footage on PPV but only in China, the material consequences to my quality of life would be negligible. There's virtually no way that any of the people watching the footage would ever contact me or interact with me in any way, especially if they did it with all of their tenants and randomized the footage such that it's hard to track one specific person. But it would still feel super weird and bad, and it would still be a violation of our relationship.


You already stated the problem in your last sentence. There's also the risk that the model could end up with your / your customers' PII in it which could lead to serious consequences.


Repeating or regurgitating your content. To anyone, somewhere. You got it. Think about it from the perspective of creatives and professionals who deal in words.


> Why is everyone so concerned about AI training on their data in the first place?

Unless I give my consent unhindered then my data is my data and cannot be used for any purpose at all unless I approve. This also means you cannot make blanket statements in Terms of use saying I grant you permission to use my data. I don't.


> Except of course it repeating or regurgitating your content, which is a valid concern.

That is exactly it.

LLMs are nothing more than engines that predict the next token based on a mind-boggling large set of statistics of what tokens follow what. They're essentially Markov Chains on a massive dose of steroids.

If the training data includes one document that includes password rules like "The password shall be at least 12 characters long. The password shall include [...]", and then you have another document that includes a line of "The password is fdFjkkl1@!#", then if you ask the LLM to generate a document that includes password rules, there's a chance it may output "The password is fdFjkkl1@!#" in that document, because it knows that "is fdFjkkl1@!#" sometimes comes after "The password".


I started at "oh boy this is gonna be bad for docusign" but after thinking about it I assume Docusign is removing any filled form fields / actual values and just training on the contract structure sans actual values. So all the complaints about "anonymization is impossible" are probably pretty specious.

I guess we'll all find out if/when people use the model to get it to disclose other users information but I won't be holding my breath.


"Will my data be used for AI model training and improvement?

If you have given contractual consent, we may use your data to train DocuSign's in-house, proprietary AI models. Your data will be anonymized and de-identified before model training."

This seems to be opt-in? Am I missing something?


Does "contractual consent" necessarily mean conscious opt-in?

I'd suspect it can mean some some clause buried in contract document.

Also, it's some party's interpretation of that clause, not what some other party understood it to mean.

However, in this particular case, maybe most relevant is that I'd assume Docusign is facing a lot of corporate lawyers among its customers. So I'd guess the company can't get away with behaving like a stereotypical sketchy startup, towards its customers, and I'm sure Docusign has legal advisors who would tell them that if they ever needed to be told.


Doesn't "contractual consent" simply mean that it was part of the contract, likely adhesion one?

In their privacy policy (https://www.docusign.com/privacy) they say their lawful basis is "Consent (where required under applicable law)" which certainly sounds like if applicable law doesn't require it they would be use it without consent.


"Contractual consent" != "opt in."

That just means you clicked "I agree" at the end of 10 pages of legalese.


There is no such thing as "contractual consent". Consent under GDPR cannot be "contractually" given as part of a blanket clause in a contract. It requires criterion like informed, easily retractable, uncoerced, etc.

GDPR also has the concept of a "contractual necessity". A legal basis where data must be processed for the performance of a contrast. So say, you buy a t-shirt from an American online store and they need to process your shipping address. That's contractual necessity.

They seem to be mixing the two into "contractual consent", as if you are able to consent in a contract. Under GDPR, you can't!

In no other contexts can there be a more legible interpretation.


Probably it is in the terms and conditions, so =yes


Feds asleep at the wheel, again. US government sold out to private companies.


It's by design. That's how the US works. We the corporations.


A few years ago, my band was putting our manager on our LLC account. The credit union, then Alaska USA but now Global , sent us all docusign links to add her.

She had moved out of state, but either way I think they would've done it that way. Even skimming the EULA there was no way I was gonna esign through their service.

This confused the bank, and it was honestly a bit of a pain in the ass with some emailing back and forth. But they got me the forms and I signed analog, and it went through eventually.

Just sent them my I Told Ya So heh


Yes, but they already have a copy of your contract/document, yes?

By which I mean to point out the problem that the people signing the contract/document aren’t the ones choosing the signing service company.


Docusign response:

> Thank you for the feedback. We are updating our AI FAQ to be more clear: We only collect data for training from customers who have given explicit contractual consent as part of their participation in AI Extension for CLM, AI Labs and select beta programs using AI. When we train models using customer data to improve the accuracy of our AI features, we only use data from customers who have given consent, and that is de-identified and anonymized before training occurs.

https://x.com/DocuSign/status/1763647141875192065?s=20


This is especially concerning because many businesses use Docusign for job paperwork. If you want/need a job and they use Docusign there’s no getting around this. If you need to eat you’re forced to sign away your data to train AI.


When I worked at a company that built software for lawyers, this was one thing I remember having multiple discussions with different people including the CEO as something that would be monumentally stupid for us to do.


If they have admitted to do this then it's reasonable that they have done other vile things with these super sensitive informations. Don't ask for proof, ask yourself.


Alternatives to DocuSign that aren’t doing this? How would we know?


https://www.docuseal.co/

You can self-host


So... What? They must provide up-to-date mechanisms to stop malicious actors and find anomalies to protect against cyber-attacks. Fake signatures? Cropped photo submitted multiple times under different names? Photoshopped materials? AI-generated materials? How do you detect this without training on real data?


Docwesign


I’ll love how all those incompetent lawyers’s work will go and train the supposedly super AI and somehow it will spit out lawyery bs


The headline seems far out of step with the content: they use OpenAI/MS for chat AI, and they use the standard paid API that doesn't retain data.

Beyond that "If you have given contractual consent, we may use your data to train DocuSign's in-house, proprietary AI models." -- I'm open to hearing that "contractual consents" actually means "literally all of you mwahahaha", but that's not borne out by the content.


[flagged]


The literally millions of people who have sensitive information in DocuSign, including things like:

- salaries

- proprietary business information

- divorce agreements

- non-disclosure agreements

Do you think people don't put private information into contracts?


You're in digital advertising. How would you feel if they did a poor job of de-identifying one of your contracts (maybe because the client's company name is also a common noun) and their AI regurgitated some private business dealings word for word to one of your competitors?

Yeah, it would suck.


Do you actually think this is a non-issue? Or do you think just knee-jerk dismissal of anything related to AI makes you look cool or edgy?


> I have lots of experience with digital advertising

Yeah, you wouldn't.

Meanwhile I have ever yet to see any ad that actually made me buy something because of the ad. The whole advertising industry seems like a racket, at best a case of the Emperor With No Clothes, and a front for spyware at worst.


>I have ever yet to see any ad that actually made me buy something because of the ad

That's a somewhat superficial understanding of how advertising works. Sure, Barry might see an ad for McDonald's and go out and buy a Big Mac, but a lot of the work is subliminal.


The goal of the advertising industry is mostly to get their brand into your brain for when they do become relevant because you remember or recognize them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: