Hacker News new | past | comments | ask | show | jobs | submit login
Apache Kafka and GDPR compliance (landoop.com)
84 points by Antwnis on Dec 4, 2017 | hide | past | favorite | 46 comments



I'm interested in the right to be forgotten section but I'm confused as too what this article is saying...

How exactly do you "forget" the data on the logs?

One interesting solution that kills two birds with one stone is if you encrypt the personally identifiable information then delete the private key if there is a request to be forgotten. Has the added benefit of also effectively destroying the data in backup copies too.


> How exactly do you "forget" the data on the logs?

If we think around the options, you can have either:

i) eventual deletion (log retention policy) ii) compacted topics (and push null values) iii) expensive re-processing of the entire log iv) expensive segment re-write operation

with each option bringing in a new set of challenges


Encrypt with a user specific key when the data enters the log. You can effectively delete all the user specific data by throwing the key away. No tracking down files or reprocessing necessary.


Today’s encryption is strong. Who knows about tomorrow?


Is that acceptable within GDPR requirements?



The GDPR contains exceptions when deleting individual user data is infeasible or if the data in question is a backup, in which case you must only keep a log somewhere so you don't forget to delete again.


Dear System Administrator,

We've just hacked your server and wiped the crypto keys for your users. As you know, all your backups are now useless.

Send us $1 million in Bitcoin to get your crypto keys back.

Sincerely,

Hacker McHackface


If somebody managed to hack into your servers deep enough to access private keys, you are f*cked anyway (they can as well delete/encrypt all data), so it's not an argument against user data encryption. Actually, storing private keys safely is easier than bulk data, because you can use dedicated hardware for that - HSM.


This "right to be forgotten" requirement is quite staggering in scope. Do I need to dig out all of my offsite tape backups and re-transcribe them to edit out my user's data every time a user requests to be forgotten?

Sibling comments mention a cunning scheme with encryption, but that doesn't really help an enterprise with an existing non-GDPR-compliant backup archive.


I asked a GDPR consultant we hired about backups. The answer was basically that the letter of the law requires data to be deleted from backups, but people aren't going to do that, and there is some language about "reasonable" effort or something like that.

This might be useful: http://www.davidfroud.com/does-right-to-erasure-include-back...


Deletion doesn’t have to be immediately complete; if you rotate backups on a six month period then that’s okay.

At least, that’s what I heard about how a BigCo is doing GDPR.


My understanding is that the GDPR “right to be forgotten” does not cover backups. There may be some exceptions, but there are practical limits on its reach.


I believe your understanding is incorrect. GDPR certainly includes storage and processing, both of which backups probably trigger.

Anyway, think about the spirit of the law, and then think about how that interacts with backups. If someone asks to be deleted from your system, you do so, and then you restore a backup with their data, you have clearly violated the intent.


Keep a log of deleted users and re-delete upon restore.

The GDPR contains exceptions for data storage for which it is infeasible or outside reasonable effort to delete individual records or you have legal compliances to uphold.


Isn't the log of deleted users subject to the GDPR then?


You can make a log of deleted users without it containing personally identifiable information, by just storing the IDs.


No since the GDPR exempts things you need for legal compliance, thus a list of users who have asked to be deleted is fine if it's being used to ensure compliance.


Given the fact that backup is a must for almost any system, it would be silly for GDPR to not "cover" backup files.


Interesting, if that's so, can we redefine the underlying Kafka topics as "backups" and achieve compliance by having the stream processors drop "forgotten" records when replaying a topic?


Good luck selling that one to a judge. Let us know how it goes!


We've been doing a lot of thinking about how to support GDPR at Snowplow (Kafka and Kinesis but plenty of other logs and stores) - for our first phase we're just going to support irreversible pseudonymization of tagged PII:

https://github.com/snowplow/snowplow/issues/3472

For later phases, yes user-specific encryption of PII or hashing-with-lookup table are the way to go...


I wish you wouldn’t call it irreversible. Every large public claim of that sort has proven false. Consider the Netflix case, where the separate IMDB review dataset allowed reconstruction of pseudonymous movie watching records.

These approaches may help with compliance, but they’re the opposite of real safety.


I'm wondering if anyone thought about a GDPR extension that would include machine learning extension, ie. being forgotten meant "unlearning" to the model from my data (or relearning it on dataset from which my data was removed).


I would consider that already covered under the GDPR. Most machine learning approaches today make little to no guarantees about differential privacy and allow for (partial) extraction of the training dataset, which would mean that the request for deletion was never fully fulfilled.


So do you mean that GDPR allows for a request for removal from model or of there is an exemption from data mining results?


I think that it allows for a request for removal from the model unless it can be proven that the PII cannot be retrieved from the model.

(This should not be considered legal advice by me.)


if your PII has been incorporated into a model, let's say giving a likelihood to buy red cars based on 100 data points, then it's fairly safe to assume your PII is anonymised - i cannot imagine a way back from model that each input


With GDPR do we need to get consent from users before we can set any cookies?


GPDR itself doesn't specify cookies use. "Cookie law" is defined in ePrivacy Directive (2002/58/EC) which to be replaced by ePrivacy Regulation which is an addendum to GPDR. Actually, it's going to be much saner approach than the joke the current "cookie law" is:

"Simpler rules on cookies: the cookie provision, which has resulted in an overload of consent requests for internet users, will be streamlined. The new rule will be more user-friendly as browser settings will provide for an easy way to accept or refuse tracking cookies and other identifiers. The proposal also clarifies that no consent is needed for non-privacy intrusive cookies improving internet experience (e.g. to remember shopping cart history) or cookies used by a website to count the number of visitors."[0]

To answer your question "do we need to get consent from users before we can set any cookies?"

It depends: yes for tracking cookies, no for others. How to tell them apart is another question..

0. https://en.wikipedia.org/wiki/EPrivacy_Regulation_(European_...


Isn't that already an EU requirement?


>The right to be forgotten, becomes one of the hardest challenges because of data immutability. Apache Kafka does not support deleting records, and although some eventual deletion is supported, it requires

This always seemed like an incredibly toxic decision to me. It's one that crops up in all sorts of systems, large and small. What, none of these people /ever/ foresaw the need to delete some data?


As everything in life, to gain something, you need to sacrifice something else. With RDBMS you get mutability; but to go 10x or 100x faster/larger you need to make hard decisions.

HDFS, S3 and other systems have immutability in-built. Immutability is not bad per-se, as it give (some) assurance that data has not been tampered with, and although it could be implemented, the system cost could be significant.

Stricking the right balance is the challenge


It's not that simple. For example in my business, we may give some money to help someone "once in its life" (the law says so). Therefore, if the persons asks to be deleted, then we might not apply the law anymore because it'll mean we won't remember the decision... I think GDPR is a good thing, but at some point, in my business, those who write the laws will have to be aware of it (and the legal teams is miles away from the IT stuff, sadly).


The GDPR offers exceptions to the right to erasure, this mostly includes legal compliance (banks) or in the interest of legal claims or when data cannot be easily deleted as individual record. It also does not affect any non-digital documents which aren't filed. This is all laid out very thoroughly in the legal documents relating to this.


I must recognize I didn't read the section about removal thoroughly. But I did read the articles about the "categories of data" which are the major pain point right now 'cos it forces you to, well, find appropriate categories of data. It's a very interesting thing to do but, in my organization, it leads to many loooong discussions :-)


GDPR has an exemption related to the legal requirement to process data that might cover this (and related) scenarios.

> ...(unless) processing is necessary for compliance with a legal obligation to which the controller is subject;


Does this mean that someone can game 1-time special offers by repeatedly signing up and then demanding to be forgotten?

There's probably no legal obligation to enforce once-only cashback sign-up offers, so the right to be forgotten would presumably have to be followed.


There is an exception category for “legitimate business interest” so we’ll probably have to wait and see what the courts have to say.


Where integrity matters, you never want data to be mutated with no trace. An audit trail is almost always needed - it’s not an extreme leap, then, to say “why don’t we just replay the audit trail to arrive at the current state?”


>What, none of these people /ever/ foresaw the need to delete some data?

It's a performance trade off, and not a very surprising one. Hard Disk Drives have always been known to never actually delete data (if you want the data gone, you overwrite it with 0s). It's not unimaginable that this performance trade-off found its way up the stack.

And just like a regular HDD, you can forcibly delete the data, it's just a very expensive operation that isn't needed 95% of the time.


>And just like a regular HDD, you can forcibly delete the data

Except apparently not, because the linked article is literally saying it's not supported.

I get wanting an audit trail, and I get wanting to not delete data if you don't have to for performance reasons, but neither of those things is the same as saying "it's literally not possible to delete stuff".


> Except apparently not, because the linked article is literally saying it's not supported.

Not entirely true. Kafka, out of the box (and as far as I know, I'm no expert) will keep the records for 7 days and delete them afterwards.

Most people I know (including myself) use Kafka to keep records for longer and a good option is to use what the article describes, which is to compact the logs. In that case the log, after a configured period of time (or when it reaches determined size) gets compacted and only the latest message with an id gets saved, all previous messages with the same id get deleted (that's why the process needs a message with a null velue to perform the "delete").

Only in the case when you want to keep the data forever and can't use compaction (compaction assumes that your messages always contain the full state of an entity, so the last message will always contain the current state and the previous can be deleted with no side effects), then there's no way to delete a specific message. I'd have to read the exceptions for backups included in GDPR, but you could make the case that, in this case, the Kafka log is maintained only as a backup of the data, to be able to replay it again in case something downstream gets broken.


>Except apparently not, because the linked article is literally saying it's not supported.

I see your point, but you are mistaken (or you took the wrong impression from the article) - it is supported, you just wouldn't want to do it day-to-day, and for the context of the article it might as well not exist. In Kafka, if you want to forcibly delete the data, you could simply just force topic compaction after a delete. Depending on the size of your data, a regular delete could take hours, which would likely blow the resource usage on any decently sized deployment.

I bring this up because a lot of shiny "BigData" databases use Log Structured Merge Trees, which are immutable and deletes are mostly "soft-deletes" until a "compaction".


What is your thought on this?

> Encrypt with a user specific key when the data enters the log. You can effectively delete all the user specific data by throwing the key away. No tracking down files or reprocessing necessary.

from https://news.ycombinator.com/item?id=15847674


From a technical perspective, that seems like a sensible compromise to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: