I'm not sure if this is relevant, but Ben Goldachre in the UK is working on how to get access to NHS patient data in a privacy-preserving manner [0]. My understanding is that you essentially submit your analysis and it is run against live data but you only recieve summary results. I'm not sure if this could be adapted to training.
The problem with this approach is it is very hard to do data science with messy clinical data when you have no mechanism of investigating the data yourself.
Yep, I'm in the rare disease space. "impossible" is pretty appropriate.
It's tricky. On the one hand, it's obviously not appropriate to be flippant about patient privacy. On the other, it's clearly that advancements in human health are being hindered by our current approach to (dis)allowing researchers access to data.
For me it's a situation of "once bitten, twice shy". What are the odds the medical data intended for research will be handled correctly and not used outside of its intended purpose?
What are the potential downsides to misuse of health data? Genuinely asking - I'm not sure what someone malicious would do with my health records, especially if it's anonymized.
Insurance companies refusing to pay because of $reason based on deanonymized data. Ad companies or bigPharma bombarding you to get new pills because they want to sell more pills. Black mail because you have embarrassing disease.
There's a lot of money being spent on deanonymizing data, and I would never count on it ever being able to remain anonymous with that much incentive.
There are several examples of anonymized data not being so anonymized after all and able to be traced back to the person. As far as what could someone malicious do with health records, you have a contingent of people in some states hunting down women for having abortions so that might be something you don't want getting out there. Or you might be someone in a very religious area and you don't want people finding out you're getting AIDS treatment.
I understand the theoretical concerns in these cases, but IMO it does not weigh heavily against the (conservatively) hundreds of thousands of annual deaths due to hindered medical research.
It's hard to overstress enough how impossible it is to do even basic research across institutional health datasets, even you're a giant organization with a compliance team. It's soul-draining and frankly the reason a lot of smart people jump ship and work in finance or crypto or whatever, where you can accomplish something even if it's goofy.
You're not addressing the root concern which is that healthcare is notoriously insecure. Approaching this as "who cares if things get leaked" instead of improving security of records is why getting data is impossible.
What's my or your biggest concern is irrelevant. Patient data security is why you can't get the data you say you need. That is just a fact and I would think energy is better served towards improving the handling of patient data if you want easier access to that data for research purposes.
Those are both theoretical examples of what people might want do with re-identified medical data. They are not demonstrated harms of things that happened in real life.
Develop tools for health insurance companies to abuse patients. Instead of denying coverage to patients based on real life symptoms, they can deny coverage due to model outputs that are “based” on real life data.
Since these models are black boxes, it’s easy to hide biases within them
Or worse, people with conditions similar to you have shown to develop, so we're going to charge you now for what we think you might develop later.
Same negative attached to pre-crime in policing because people that wear the same clothes, drive the same car, listen to the same music, and other sames have committed crimes, we think you will too. someday
In the USA, health plans aren't allowed to deny coverage to patients based on genetics or pre-existing conditions. They aren't stupid enough to try to break those laws. Employees can't keep a secret. And most of the claims costs are directly passed on to employers (group buyers) anyway, so the major health insurance companies have little direct incentive to deny coverage; with minimum limits on the medical loss ratio it's rather the opposite.
Companies can just not hire you for "culture fit" or something else based on leaked data about your health problems in order to keep their premium payouts low or just to avoid hiring certain types of people (fill in your blank here).
And there's the problem. In theory, this is possible. But in reality, there is no such thing.
You also assume that your data will be correct. If data integrity was so easy and common, people wouldn't be encouraged to repeatedly check their credit reports for mistakes.
One bit flip and you go from "perfectly healthy" to "about to die" and suddenly you can't get life insurance, your credit score tanks and you can't get a job.
The downsides are there, of course, and you have already been provided theoretical risks by other users. Unfortunately the discussion only ever centers around the downsides, with fear mongering aplenty, rather than treating the situation the same as any other situation in life: a risk-benefit trade off.
"Your OneMedical account by Meta-Amazon LLC has been deactivated due to suspicious activity based on analysis of your genome and online browsing habits. Please proceed to the nearest fresh location for mandatory euthanasia."
If a researcher can get the data, then so could someone else with less altruistic motives. So the good actor is slowed because of the bad actor. Unfortunately, there's very little way to prove the good is good and not crossing their fingers behind their back.
I interviewed with a company that built a Electronic Medical Records system optimized for cancer treatment, licensed it for cheap, then took the data and paid for medical professionals to spend time normalizing the free-text notes data from the doctors (this was ~a decade ago, when LLM's were not really a thing, so they got trained professionals to do it instead). That was all a play to get the level of data they wanted for training AI to make new discoveries.
They got acquired by a major pharmaceutical company, but I don't know of any major discoveries made from their data. Because even with real data this is a hard problem.