Hacker News new | past | comments | ask | show | jobs | submit login

I have yet to see a real whistleblower report by someone deep into the ad industry revealing whats really going on there. There are almost no details about what's really happening on a technical level (beyond the little things we already know), and whether large data sets are abused in the sense that they get routinely de-anonymised.



I worked some in marketing and targeting modeling within retail finance.

From what I saw, the data privacy rules and client contracts are followed scrupulously, and infractions are noted and remediated. Encryption slowed us down at least 10X,and bureaucracy another 10X.

While frustrating at times, I appreciated that most if not all actually wanted to stay well within the legal guardrails.

Of course, this is one person's experience, and obviously can't apply to every company. Genie and cork, toothpaste, and all that.


> legal guardrails

What many companies do is they sprinkle some magic "anonymous" pixie dust on their data and then tons of laws no longer apply, even if the data is trivially identifiable stuff. And location data is often of that nature.


The will is there sometimes to attempt this, which is why following risk procedures is tied to performance ratings at better companies. Companies also delete data after specified time, as required by numerous legal entities.

And engage with third party matching services so that personally identifying information is salted and hashed until the print shop

Etc.


Do you know whether there are actual audits happening by third parties?

I've never had anything to do with ad tech, but the (few) certification processes I've been involved with were largely documentation and "yes, we do have backups. no, unauthorized persons cannot access our servers" promises without anybody actually auditing/testing.

Do they have independent auditors regularly look at the tech and operation to verify that they actually do conform with the laws and don't just claim to?


Many. It represents a massive cost. Hence the emphasis on employee incentive to do the right thing in the first place.

Further, any time there is a broad news story (e.g. Wells Fargo's bogus accounts) or legal item in pipeline with broad impact, you can be assured every company looks to make sure they aren't in bad shape.


What the author is referring to is usually branded, so it's difficult for people on the sales side to give details without revealing who it is unless you're in the weeds (technically), and I'm sure you can appreciate why they might not want to, so here's the gist:

- Companies collect a number of "records" keyed to an email address. How they do this varies from running microsites/forms that collect data directly from people, to scraping linkedIn and resumes.

- Data vendors will share this data with each other by hashing the email address. Almost always with md5, and rarely salted.

This means you can enrich data both anonymously (by relying on the fact the email address is unsalted) or in an identifying way (because one of the parties has the real email address). In both cases, it's just a LEFT JOIN.


It's much worse than that.

I've been approached by a company that trades e-commerce transaction data, for personal data tied to IPs.

As in, you give me your customer data, and I'll tell you who visited your site based solely on the IP address.

We declined, but it's tempting.


No, that's exactly what I'm talking about:

They'll buy the md5-email-to-cookies from one provider (e.g. Lotame, Liveramp, etc) then use that to onboard email+contact data they've purchased from companies that have email address-to-personal data (e.g. MVF, ZoomInfo, etc).

IF that was done as purely a lead qualification step, it's a good way to take legitimate content syndication done by a third party into direct marketing, but there's no technical reason they need to be leads or have any level of qualification -- and only a weak market force (poor conversion rate) that prevents it from being more widespread.


This comes from discussions and lectures from people working on engineering in some ad aggregstors - at least one said that in the end, they have a table with 300 million IDs for each person in the country and they key all the data they can link to that person with this id. In principle this data is annonymized. But does that make a difference? At least in health care people worry about hipaa and do audits to minimize reidentification risk but I'm not sure if adtech companies do anything like that. So yes, a good data scientist can find any person they want from the data but even otherwise I think these companies can work on a fairly meaningless definition of annonymization to get away with all this crap.


It's basically impossible to anonymize data. There a numerous papers about how little data is enough to uniquely identify people. Things like the zipcode where you start your commute and the zipcode where your commute ends are enough to identify the vast majority of people.


In our case, the postal code (Canadian) and almost any other piece of data is uniquely identifiable. Through a quirk in the layout of our street, my wife and I have the only house in our postal code. Add age, gender, birth month, hair colour, t-shirt size... pretty much anything, and you’ve reduced from 2 possibilities to 1.

I still want to try dropping a letter in a mailbox from a different city with just our postal code written on it and see if it arrives.


It's impossible to anonymize some data. If you're including demographics and locations then yeah, it's going to be hard or impossible to anonymize. If you're using surveys on emotional state or perhaps newsgroup comments? That's not so hard.


> I have yet to see a real whistleblower report by someone deep into the ad industry revealing whats really going on there.

Its because nobody cares until Congress starts hauling in software developers and "data scientists" from tech startups, everyone is content watching Zuckerberg be the face of society's discontent.

As long as they keep asking the wrong people the wrong questions there is no need for everyone else to acknowledge their role in the problem.


There's nothing to whistleblow.

Advertising is applied sociology. As such, advertisers want to aggregate large data sets into large segments that are easy to manipulate statistically. (Where the central limit theorem starts working.)

There is no demand for personal data or de-anonymization because that stuff doesn't sell.

The personal data collection is done by Google, Facebook at al not for advertising purposes. They're collecting it because they view it as a resource and a currency in the future de-anonymized world. (Think China's "social capital" except on a larger scale.)

Source: I've worked in the ad industry for over 15 years.


> There is no demand for personal data or de-anonymization because that stuff doesn't sell.

Say what?? I’ve also worked in the ad industry and deanonymized personal data is shared and sold routinely. You speak of statistics and large segments but every advertiser I’ve interacted with is either doing individual-level targeting or striving towards it.


To wit, A few weeks ago there was a discussion here about a method by which you could figure out how fast a browser/machine could compute an SHA 512 hash, and that this was being used to fingerprint users even who had cookies, images, JavaScript disabled.


Hence why technical solutions have been and always will be the wrong approach. If you are worried about privacy, then work to make tracking illegal. That's what this article is doing.


Was it just a proof of concept demonstration, or was there evidence that this method is being used in the wild by advertisers?


They stated that they were using it in production for that purpose.


Gotchya, thanks. That seems... kinda wild to me. That method has to be super imprecise, and wastes the resources of everybody involved.


I would like to see more. How can you get their computer to compute the hash? Is it somewhere in the https interaction?


I don't recall the details, but it had something to do with a would-be security feature in the browser that computes the hash of something before following a link.


> every advertiser I’ve interacted with is either doing individual-level targeting or striving towards it.

Only if they're clueless.

For example: Nike really wants a dataset of "people who buy expensive sneakers for fashion purposes".

This dataset is probably hundreds of millions of anonymous people, and not personal data. If there was a way to get this dataset directly, Nike would do that in a heartbeat.

Unfortunately, as of 2019 the only way to get something like this today is by, e.g., crossreferencing credit card purchase info with Twitter browsing logs, which leaks a shitload of sensitive private data.

For ad purposes personal data collection is a bug, not a feature.


There are many sites that have quite accurate PII for large fractions of the American population. Think job boards for instance. One approach that I have seen used successfully is simply buying whatever data such companies are willing to sell.

I’m not necessarily talking about demographics, but rather clickstream data, and anything categorical that you can get your hands on. You join that to your CRM and build a model to predict buyers. A really good predictive dataset for marketing purposes is simply a list of time stamps and names of visited domains. With the right feature engineering, that becomes an excellent proxy for demographic data, current buying appetite, and a whole lot more.

At the end of the day, you don’t even necessarily need to know what the data means as long as it’s predictive. And there are plenty of brokers out there who will let you test their data for free with an agreement to pay if you end up using it at scale. All of this revolves on using PII for matching.

I’m sure what you’re saying is true for some marketers, but there are billions being made on PII keyed data.


Are you reading what I'm writing? No?

Let me repeat again. PII is a crutch used for matching, because current matching/segmenting technologies are crude.

Advertisers don't want PII. What they want is target audiences with predictive power, which means data sets where the central limit theorem holds sway. (I.e., thousands and millions of people lumped together.)

If advertisers could get at these segments directly without PII, they'd do it in a second.


I think in some circles the meaning of the term "whistle-blowing" has drifted enough that people use it interchangeably with "reveal."

Given your 15 years of experience, what resources do you recommend to HN so that we can learn more? Can you give us a "life of an advertising bit," eg: A person visits a website on their phone, that information is accompanied by x data on their phone, goes to the initial ad server, this information is compiled against data from sources a,b,c, etc ...


My question to someone from the ad-industry would be if there are known "intersections" between these anonymised data sets ad-tech is using to sell as much as possible, and companies who buy these data-sets to connect them to real identities.

Especially because the web is full of Privacy notices people agree to, and I guess in some of those people actually agree to have their anonymous browsing data connected to their real identities.


"Known"? Probably not.

Like I said, knowing real identities is the last thing on the list of ad tech priorities.

If shadowy entities are collecting "real identities" then it's not for ad purposes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: