Ask HN: What medical datasets do you need?

throwaway4103 · on April 5, 2017

Please oh please, Chronic Fatigue Syndrome. The potential for ML is absolutely huge. Some people have already collected or are working to collect genetic and other data on a large scale [1] [2] [3], so it does exist.

CFS is interesting because:

a) Patients' symptoms appear to fluctuate "randomly" but are actually typically a complex function of genetics, blood markers, exercise, diet, medication and other factors.

b) There is considerable low-hanging fruit for pattern recognition, since despite the prevalence of the disease almost nobody has done serious ML work in this space.

c) Huge market opportunity - prevalence is comparable to HIV, and specialists often cite CFS as causing more disability [4] [5].

[1] http://simmaronresearch.com/

[2] http://www.nova.edu/nim/research/mecfs-genes.html

[3] https://med.stanford.edu/chronicfatiguesyndrome.html

[4] https://consults.blogs.nytimes.com/2009/10/15/readers-ask-a-...

[5] Dr. Daniel Peterson (Introduction to Research and Clinical Conference, Fort Lauderdale, Florida, October 1994; published in JCFS 1995:1:3-4:123-125)

rosegold · on April 9, 2017

For many autoimmune-related conditions patient’s symptoms also appear to fluctuate randomly, with symptoms such as pain and fatigue coming seemingly out of the blue. This includes chronic diseases such as Lupus, Rheumatoid Arthritis, Fibromyalgia and a long tail of other conditions [1].

People with these lifelong illnesses typically experience a roller-coaster of recurring symptom flare-ups, wreaking havoc with their lives. Yet there are patterns to the flare-ups. This is an opportunity to make a big difference for millions of people [2].

[1] https://www.aarda.org/disease-list/

[2] https://www.aarda.org/autoimmune-information/autoimmune-stat...

mclide · on April 10, 2017

The key to make sense of the data for these diseases is a record of patient’s symptoms. Assembling useful datasets is not only a question about access, but also about resolving human factors to successfully collect the essential information from patients.

A major challenge is to get a large number of patients to continuously track their symptoms. Most want to know what’s in it for them. It takes substantial incentives for people to regularly report outcomes and use wearables for data collection. Until we can make the marginal cost hit zero, they need to benefit from their efforts and investment, preferably instantly.

hkiely · on April 6, 2017

If you are interested in working on ME /CFS, please contact me.

kolinko · on April 5, 2017

It also goes hand in hand with chronic anxiety and ADHD.

Entangled · on April 4, 2017

Dermatology, eye conditions, blood cells, tissue, viruses, urine, saliva, everything that can allow an app to give you a first diagnose before heading to the doctor.

I foresee in less than ten years we will have a doctor in our pockets. No, it won't cure us and it won't replace a doctor, but it will give us all the information we need to have a 99% certainty of our condition.

--

Second batch for animals and their conditions.

Third batch, agriculture. Take a pic of a plant and tell me all the info, fertilizers, cultivation, etc, bonus for pest id and treatment.

Pocket computers should be able to diagnose every living creature.

fratlas · on April 5, 2017

While I think it'll happen eventually, medicine is not at all black and white.

Anecdotal, but I had a suspicious mole looked at, the doctor couldn't decide, got a second opinion from their colleague, he still was only 90% sure. And that's a relatively simple example. Doctors are some of the smartest/hard working people in society, and if they can still make mistakes, medical-grade AI is a long way off.

iheartmemcache · on April 5, 2017

I'm with you, but moles in particular are pretty difficult to identify, especially on a first visit. (One of the 5 major criteria for a melanoma diagnosis is "evolving" which, by definition, requires more than one visit to identify.) Then you have complexities like basal or squamous cell carcinomas, UV induced AK, etc.

The benefit of medical-grade "AI" (in particular, multi-layered convolution neural networks) is that, given an aggregate of information (say, if you equipped every derm and oncologist with high resolution cameras and a set of parameters to standardize each datum), a trained professional[1] would be able to use that corpus as a very useful resource (used, obviously, in conjunction with their formal training and years of medical experience).

That being said - this is a question you should really be asking those who practice medicine or are actively in research for a living. Go pick up the last years issues of Nature Methods to see what problems they're encountering, and which technological gaps[2] (if any) they may have where YC AI might be applicable.

----

[1] Needless to say, this is something I'd be very reluctant to release into the populace's hands, lest you have someone whip together some node.js backend and React iOS app :: "Do I have Skin Cancer? $10 to find out!". One shudders to think what sort of hysteria might follow.

[2] You would be surprised at how technologically adept some of the members of those research teams are -- within a month of DeepMind making the headlines on our tech blogosphere, I was seeing RNN being applied to their industry.

fratlas · on April 6, 2017

I agree, moles are a special case, and they are hard to diagnose with 100% certainty.

I've had this discussion with my SO (who is a doctor) many times, she strongly believes the breadth+depth of knowledge required for such an AI would be too great, but perhaps as a tool for GPs, or for specific, easy ailments (i.e. telling a patient what they don't have, and if it's serious enough to raise the issue to an actual doctor).

staticmalloc · on April 5, 2017

"Medical-grade AI" has been possible for some diagnosing tasks since 1968. A simple linear model that takes in some features from an X-ray was able to outperform the best doctor in one study: https://goo.gl/yMP7sU

brut · on April 5, 2017

I think you will see much less "disagreement" between AIs than between doctors. I.e., subjectivity of diagnosis can be much more easily accounted for in an algorithm than in a person...

adrianN · on April 5, 2017

If you train all of them on the same data, you will get similar answers. That doesn't mean that those answers will be more right than a less sure doctor.

brut · on April 5, 2017

Indeed. However this post is a "request for data" and I assume the algorithms of the future will be trained adequately :)

adrianN · on April 6, 2017

I suppose all algorithms will be trained on all available data, making the results pretty similar.

surgeryres · on April 5, 2017

One potential problem with this - the question of liability, and who is responsible for diagnostic accuracy? In this case, for some "Lab on a Chip" device providing a patient directly with diagnostic information without the vetting of a human doctor, liability would sit with the company.

IBM's Watson at MD Anderson Cancer center did not work out real well for them. In other words, using AI in the realm of medical diagnostics is very difficult.

mitchellst · on April 5, 2017

What about the treatment side? Once you have a diagnosis, could we use AI to review the patient's medical record, compare outcomes of past patients with the same diagnosis and similar histories, and suggest adjustments of personalized treatments to optimize outcomes?

Overall, of course, you're right. Liability is the problem with my suggestion. Doctors prescribe to treat, they also prescribe to meet the legally mandated standard of care and minimize second-guessing later. Looking at each patient as a unique snowflake-- or at least, part of a thinner-sliced group-- helps with the first, but directly undercuts the second goal. Such an approach would probably need to originate outside the U.S.

surgeryres · on April 6, 2017

Fair points.

Extracting data from the EMR is very difficult because all EMR was originally intended to only be a storage place for data - not designed to output data back to a user.

thedailymail · on April 5, 2017

I agree that the current excitement over medical AI seems wildly optimistic, but the problems with Watson at MD Anderson were more due to gross mismanagement on the MD Anderson side. https://arstechnica.com/science/2017/02/ibms-watson-proves-u...

BigJono · on April 5, 2017

This seems like an easy question to answer. The liability lies with the user until they seek the professional advice of a doctor.

Say a user decides to self-diagnose and damages themselves (either through inaction or self-medication or whatever). What difference does it make whether they diagnosed themselves by browsing symptoms on wikipedia or using an advanced diagnosis AI on their phone?

PeterisP · on April 5, 2017

It's plausible that providing an advanced diagnosis AI on the phone (unlike purely passive information e.g. wikipedia) falls under existing laws that regulate medical services, and would make the provider liable for violating these regulations even if noone has damaged themselves yet.

ska · on April 5, 2017

Watson at MDACC isn't a good example, it's not the technology that the (main) problem.

eragone · on April 5, 2017

> everything that can allow an app to give you a first diagnose before heading to the doctor.

Why do you want to have the diagnosis before seeing a physician? (I'm legitimately interested in the answer.)

In my experience, it's about 50-50 with helping versus hurting making the correct diagnosis and providing accurate treatment.

mitchellst · on April 5, 2017

Maybe the fact that you know your diagnosis is not the important part. If the app can verify the diagnostic data is accurate (e.g., temperature, AI image recognition of visible indicators like rashes and pupil dilation, audio analysis of heartbeat and breathing) then it can save the provider time taking these metrics in the office. Case notes are full of "subjective temp" vs "objective temp," meaning, "what the patient reported" and "what I measured on the visit." Successfully removing this distinction would be difficult, but would save provider time and paperwork. All this info could be sent directly to the doctor for diagnosis. (Possibly with AI giving a suggestion?)

robbiemitchell · on April 5, 2017

Seeing a doctor can be expensive and inconvenient (another form of expensive).

tyfon · on April 5, 2017

I can understand inconvenient, but expensive?

Even in the US where you actually have to pay for medical stuff, it can't be more than $10 or so for a quick consultation?

xapata · on April 5, 2017

Not everyone has that kind of insurance.

You also have no idea of the cost of various diagnostic actions they might take. I once had blood drawn and got charged $600 for it because the lab they chose was out of network. I didn't even know there was a cost, let alone which labs to request.

huehehue · on April 5, 2017

$150 to have a nurse put a popsicle stick on my tongue and tell me that my throat infection would probably go away on its own.

Lordarminius · on April 5, 2017

> $150 to have a nurse put a popsicle stick on my tongue and tell me that my throat infection would probably go away on its own.

Or you could see it this way :

$150 to have a nurse put a popsicle stick on your tongue and tell you that you are very lucky to have presented today because you have a suspicious growth - which on examination turns out to be a rare form of cancer that is easily cured if detected early, but most certainly fatal otherwise

Anderkent · on April 5, 2017

Depends on whether you count opportunity costs under 'inconvenient' or 'expensive', I guess. If you have to take time off work to travel to a doctor, wait for an appointment, etc, that's all costs.

mustacheemperor · on April 5, 2017

>it can't be more than $10 or so for a quick consultation

It's times like this I realize just how different the rest of the world is.

jmmcd · on April 5, 2017

Dunno about the US, but in Ireland it is EUR60 for a 10-minute GP visit.

amelius · on April 5, 2017

> Second batch for animals and their conditions.

We can learn a lot from animals, and clearly the data should not be protected as much as for humans. Therefore, I'm wondering, where is all the animal data? Could we already start to build models using that data? This could be useful for researching conditions that are more difficult to get data for in the case of humans.

rbobby · on April 5, 2017

> to have a 99% certainty of our condition

That seems way high. Correct diagnosis by doctors isn't 99%, or even close.

wapz · on April 5, 2017

Yup. I think it can give you a 99% certainty what condition you don't have, but not what you have. Especially not without physical recognition.

abetusk · on April 5, 2017

FYI, as far as I know, the Harvard Personal Genome Project is one of the only publicly available resources that has whole genome (and other) data along with health record information available for free use (CC0 licensed) [1]. Open Humans [2] and OpenSNP [3] have data along with various degrees of health record and phenotype information as well.

[1] http://www.personalgenomes.org/

[2] https://www.openhumans.org/

[3] https://opensnp.org/

gotthemwmds · on April 5, 2017

http://ghdx.healthdata.org/ihme_data too

siculars · on April 4, 2017

Any tagged data sets like CCD's with SNOMED/LOINC encoding. Basically anything that is serialized in HL7/FHIR for a large enough population longitudinally. It's the time oriented set of population data for a region, like a major health center over a period of five to ten years or better.

jrowley · on April 4, 2017

Yes this ^^

Sources like MIMIC are certainly interesting and valuable but it'd be great to get data longitudinal records, spanning years of coverage.

https://mimic.physionet.org/

merqurio · on April 4, 2017

I agree, FHIR serialized data would be great, it's easy to move it around.

ch4s3 · on April 5, 2017

Yeah, you could do some great work with this data. Pre-diabetes identification springs to mind.

ipunchghosts · on April 5, 2017

IBS data. Since DoD threw money at this problem after the Iraq war, we've discovered that IBS occurs in about 1 in 10 who have had food poisoning. This is the biggest advancement made in the field in decades. We are close to putting this to bed but just need more data.

olegkikin · on April 5, 2017

Costs of all the procedures for each hospital. Whatever people get charged.

pdog · on April 5, 2017

Medicare provider utilization and payment data (which is a large percentage of the total market in the United States) has been publicly available[1] for several years now. The Wall Street Journal won a Pulitzer Prize[2] for the analysis they did of the public data sets.

[1]: http://graphics.wsj.com/medicare-billing/

[2]: https://www.wsj.com/articles/wsj-new-york-times-win-pulitzer...

olegkikin · on April 7, 2017

That's medicare only, unfortunately. I have that dataset.

swalsh · on April 5, 2017

Second this, have been trying to get my hands on this data for a while... NOT EASY :D

vicbhatia · on April 5, 2017

You are asking for the hospital master charge sheet. Impossible to get. I tried to get around this, and partially succeeded. Happy to share more details if needed.

rawsyntax · on April 5, 2017

https://www.healthcarebluebook.com/

ageyfman · on April 5, 2017

California made all charge masters public: https://www.oshpd.ca.gov/chargemaster/

TuringNYC · on April 5, 2017

I wont comment on what, but on how:

- If the datasets are imaging, there should be enough per class for typical ML techniques. Otherwise you just get people over-fitting models on sets of 500 images and the illusion of progress.

- I'm quite happy with the Kaggle datasets generally, but why do others make consuming data so difficult. Heck, if we've already received the data, lets just take it the last mile and make it consumable with obvious labeling, standard formats, etc. This is such a pet peeve of mine that -- if you need help taking datasets to the last mile -- i'm volunteering, ask me to help make it presentable. Ideally it should be pull-able via curl/etc, unzippable and be able to get into a pipeline w/o manual effort.

mentalhealth · on April 5, 2017

Re imaging, throw away the community hospital crap that IBM's been peddling. We need quality imaging, with diagnostics, with followup data, from major quaternary care research centers.

kfor · on April 5, 2017

For those interested in global health, we've tried to collate as much data as possible at http://ghdx.healthdata.org/ (disclosure: I'm the director of data science at IHME, which hosts this).

Note that most of this data is population level epidemiologic and administrative stuff, not the detailed biomedical measurements I see most people requesting - but I promise you there's some really interesting things that can be done with it nonetheless!

ska · on April 5, 2017

An awful lot of medical data is complex.

Here is what you really want: Large amounts of curated/quality controlled data with ground truth that you can aggregate & share. Preferably with multiple studies and time points and/or followup. That is stated in rough order of difficulty to acquire.

Here is what you typically get fed into an learning pipeline: 1-2 orders of magnitude too small, with all kinds of noise, and no truth data(i.e. at best a bad proxy).

Hand-waving about unsupervised learning won't solve many of the really difficult problems (although it has uses, obviously). Neither will hand-waving about transfer learning. In some areas most retrospective data sets will never be really available because of consenting issues. QA is hard - the sheer variability of clinical systems in the field, not to mention protocol and practice differences, is often astonishing.

So where does that leave us? To make a real dent fast I suspect you need to focus on data availability, not problem. Ask the question:

What are the fastest path(s) to collecting large volumes of clinically representative data with some QA in place, consented for the ways we want to use it, and with real clinical truth or a decent proxy we can get at in an automated or semi-automated fashion? 1000 Bonus points if real outcome data will be available in future.

sperant · on April 4, 2017

I'm the cofounder of a startup building a new EHR to help solve this problem (we just applied to YCS17).

We will use NLP and AI to provide structured data from unstructured medical data (encounter notes, etc...) stored in the EHR for both analysis and integration. For example, one of our partners right now wants to integrate directly into our EHR in order to run computer vision algorithms on top of uploaded eye exam images in order to help diagnose eye diseases. We give them access to the eye image and other patient data, including the encounter, diagnoses, etc. After they have trained their algorithms, we then allow them to hook directly into the encounter workflow to send alerts live to the doctors during the appointment. We want to be a platform to help other startups and researchers connect with medical data both for analysis and also to help make a meaningful impact directly to doctors' workflows and patient care.

We would love to help out and/or learn about any use-cases that others might have requiring medical data. If you would like medical data or would want to integrate directly into a doctors' workflow in their EHRs based on NLP/AI hooks, we would love to hear from you. You can reach out to me directly at ginn@stanford.edu

PostOnce · on April 5, 2017

Anonymized patient records, preferrably with information about the doctor performing the diagnosis as well. I've only been able to find small datasets of some tens of thousands of records, I would like tens of millions. You can't learn much from what amounts to one small town's medical records, in terms of finding accurate diagnoses, or identify places and situations that result in better doctors.

omginternets · on April 5, 2017

>Anonymized patient records, preferrably with information about the doctor performing the diagnosis as well.

"Anonymized" is more like it ...

snovv_crash · on April 5, 2017

Health and doctor visit information which has been cross-correlated with food purchases and exercise type and frequency.

Right now we have no way of determining which interactions lead to which conditions, so we generalise based on the 3 inputs independently, when in reality it is perfectly normal to eat more when doing lots of exercise, or need doctor visits when doing exercise with inadequate nutrition.

pbnjay · on April 5, 2017

We are working on this from another angle - sequencing plants to map out nutrient biosynthesis pathways. Then determining how those nutrients affect human health.

With that info we can start doing "personalized nutrition" such as (totally made up): "If you have a Diabetes, then you should eat more Broccoli and Radishes because they have nutrients that mediate sugar uptake"

ransom1538 · on April 5, 2017

OP: You need more doctor data.

Given you have surgeon [x] what are odds of a successful surgery with [x]. THIS is the guarded secret -- yet the most valuable.

If you have medical data (or want to be a cofounder) please email me :ransom1538 at gmail.com -- a prototype: https://www.opendoctor.io to find out data to this very question.

mikecsh · on April 5, 2017

In the UK, under the NHS, summary data regarding a surgeons performance is published [1].

There is an interesting debate about whether this is a good thing or not. One argument is that it improves transparency and allows patients to make a better, more informed choice of who operates on them.

The counterargument is that most patients don't understand that there is an element of probability distribution involved. Perhaps more importantly, if the thought process of a surgeon changes from "performing surgeries to the best of my abilities and knowing I will lose my job if I am dangerous" to "all my results are published for public scrutiny so I need to have a survival rate as close to 100% as possible as that is all the public comprehend" may lead to surgeons only being willing to take on cases which are very likely to be successful, as taking a difficult or last-chance case will have a high probability of mortality and therefore will affect their numbers. This would be a loss for many people. I don't know if any research has been done to determine whether that has borne out or not though.

[1] https://www.nhs.uk/service-search/Consultants/performanceind...

eragone · on April 5, 2017

> Given you have surgeon [x] what are odds of a successful surgery with [x]. THIS is the guarded secret -- yet the most valuable.

What is a successful surgery? One with the least complications? No complications? Shortest recovery time? Best return to function? Best outcome for that particular patient?

It's not an easy question to answer.

tunneylax20 · on April 5, 2017

this is pretty cool. where did you pull this data from? is it publicly available?

llccbb · on April 5, 2017

Blood glucose levels as time series from continuous glucose monitors with additional tagged data like food intake, exercise, and sleep. Each record needs the obvious human-data like sex, age, weight, nationality/ethnicity, type 1 or type 2 diabetes.

cmdrfred · on April 5, 2017

This is a big ask and I'm not working in AI but I'd like a comprehensive as possible list of treatments offered at facilities with pricing. I'd build a website that looks up the treatment you require and compares the estimated cost of travel to each location that offers it (keeping in mind exchange rates) to find the lowest total price. This data should be global to be as effective as possible.

It could even offer suggestions like "Spend $200 more and recover on a island paradise!"

If globalism is good for low wage workers it certainly should be good for the medical profession.

_fq4v · on April 4, 2017

Hormone problems are extremely difficult to tease out.

In my opinion, large datasets testing wide spectrums of hormones in a large population, tagged with any diagnosed endocrinological condition would be extremely valuable. I bet with this information, we could learn a lot without conducting actual physical studies, by simply sectioning the data appropriately.

I'm not a doctor though, so I don't know exactly what would need to be recorded, but having dealt with bizarre endocrine disorders that doctors don't really have any answers to, my gut feeling is that such a data set would be incredibly useful.

surgeryres · on April 4, 2017

Trauma is the leading cause of mortality for people under 40 years old in the US, however it is very poorly funded in terms of research dollars compared to things like cancer, HIV etc.

Datasets are limited and expansion with AI would be huge.

One specific application - determining cost effectiveness of placing tourniquets in public places - much like the idea of having defibrillators at the mall. And funding community training, see the "Stop the Bleed" campaign.

TurlochOTierney · on April 4, 2017

Anything I can drill into on bipolar. Treatment, outcome and quality of life. I came across this https://blog.23andme.com/23andme-research/what-patients-say-... in 2013. Most studies are qualitative not quantitative and the data is not released.

jelling · on April 4, 2017

snowpanda · on April 5, 2017

Lyme disease frequency, given that the CDC grossly underestimated it (and admitted to that). [1]

And frequency by state, especially in the Western States where it is under-diagnosed.

[1] http://www.cbsnews.com/news/cdc-lyme-disease-rates-10-times-...

eragone · on April 5, 2017

This is an excellent idea! Lyme disease has more uncertainty about it than most people realize.

angersock · on April 5, 2017

So, the last startup I was a fulltime engineer at actually worked in this area.

What I would suggest to be maximally useful would be to focus on physiological data: EKG/ECG, EMG, glucose, SPO2, maybe various blood work counts.

All of those are data that are both well-understood and are thrown away regularly, and that if fed into a computer with modern ML methods we could maybe see some really cool stuff.

I'd suggest staying away from unstructured data and things that are primarily of interest to only the business side of healthcare--insurance figures, billing codes, EMR/EHR shit.

If you really wanted to get in there, putting up a minimal and standardized format for representing labs and medications would go a looooong way.

~

The problem in healthcare isn't the medical stuff--it's that people get bogged down in the inefficiencies of the system and zoom off solving problems that are removed from the immediate task of "what the fuck is wrong with this patient from the instruments I have at hand?"

eragone · on April 5, 2017

> I'd suggest staying away from unstructured data and things that are primarily of interest to only the business side of healthcare--insurance figures, billing codes, EMR/EHR shit.

I would argue these are the most important areas to target. We need tremendous reform in this area, and if we can demonstrate meaningful improvement over what we have now, maybe we can let doctors get back to actually treating patients and not spending the majority of their time checking boxes on a computer.

technics256 · on April 5, 2017

Curious why you say stay away from the billing codes and EHR side? That is one area I'm looking at that is ripe for AI and deep learning to optimize.

leovander · on April 4, 2017

SNOMED, LOINC, CPT, ICD9, ICD10, Gender, Race and Ethnicity codes. On top of that, getting all the CCD section specific OID's.

awjr · on April 5, 2017

Did some investigation into prescription data however prescription data is usually aggregated at surgery level. Also the reason for prescribing the drug (even at high general level) is not recorded. If prescription data was available at LSOA level (http://www.datadictionary.nhs.uk/data_dictionary/nhs_busines...) then you would be able to study epidemiology and potentially identify urban/rural areas where certain diseases are prevalant.

ljw1001 · on April 5, 2017

Combined phenotype/genotype datasets. These are (with some good reason) very difficult for anyone outside the medical-research establishment to get access to, but the net result is that it creates market barriers supporting the existing big players.

hesparrow · on April 5, 2017

Not sure if this is exactly what you're after, but Gene2Phenotype has some genetic association data: https://www.ebi.ac.uk/gene2phenotype/downloads

davecap1 · on April 5, 2017

These are quite difficult to get (and very expensive to produce) even inside the medical-research establishment. Which big players are you referring to?

ljw1001 · on April 5, 2017

Large research universities and hospitals, Pharma, and, of course, the Broad.

I understand pharma not sharing but much of the hospital/broad/university data is produced with (at least some) public money.

jszymborski · on April 5, 2017

(Breast) Cancer biopsies, with histology and outcome reports.

While it isn't my research project, I've been trying to use computer vision and some naive AI to identify early breast cancer lesions in images from mouse tissue with mixed success, but it's something that can be very much accelerated with a large human dataset with outcomes.

(If you work in the field and what to help/hire me with/for something like this, kindly send a message to hn AT naj-p.com)

There are understandably some ethical guidelines that need to be worked for this sort of thing, but seeing as their are public repositories of not-so-dissimilar information (e.g. mammograms), it should be workable.

acveilleux · on April 5, 2017

You're probably aware, but CAD is a staple of modern mammo interpretation workflows. Products like Hologic ImageChecker CAD.

jszymborski · on April 5, 2017

Yup :) though I'm more interested in biopsies information, because they give a better understanding of cellular architecture, and if they're stained against markers, the molecular biology of the cancer.

Mammogram analysis is an essential first-line, but I think doctors need better insight in treatment options and finer stratification. An early Atypical Ductal Hyperplasia, for example, is usually treated as pre-pre-cancer, but we might be able to identify a subtype of these lesions that progress on to more aggressive stages.

ska · on April 5, 2017

Which was shipped in the 90s, well before Hologic acquired it.

jnordwick · on April 5, 2017

STDs by congress person

amelius · on April 5, 2017

I'm wondering how an automated diagnosis could work in practice.

The data probably contains a number of symptoms or measurements (bloodwork), and a diagnosis by a doctor.

I can see how you can train a deep-learning model for that.

What if the patient is prescribed medication. Is the condition of the patient over time (after giving the medication) tracked by doctors?

Personally, I have found that once a doctor prescribes me some medication, he never asks me how things are going (except maybe once). So how accurate can the data be?

StClaire · on April 4, 2017

Images. Brain scans. Mammograms. Eye scans.

Patient history would help too. (I know there's HIPPA to comply with, but as much as we can get can help train better classifiers.)

neuromantik8086 · on April 4, 2017

I would check out https://github.com/caesar0301/awesome-public-datasets#neuros... for brain scans.

Also, anything from here: http://www.nature.com/neuro/journal/v17/n11/fig_tab/nn.3818_...

And the following: http://www.ukbiobank.ac.uk/imaging-data/ http://nmr.mgh.harvard.edu/lab/harvardagingbrain http://www.einstein.yu.edu/departments/neurology/clinical-re...

neuromantik8086 · on April 4, 2017

Forgot about this:

https://github.com/NeuroTechX/awesome-bci#brain-databases

nl · on April 5, 2017

http://www.cancerimagingarchive.net/

tmaly · on April 5, 2017

I really think having price transparency across providers for both medical treatments and for medicine would be a game changer for the industry.

rafinha · on April 4, 2017

I'm not sure such thing exists: "large companies with lots of medical data". Medical data is often confidential and belong to hospitals.

dikdik · on April 4, 2017

I used to work for a medical lab and worked on a lot of projects that involved aggregating and cleaning medical data to sell. Often it would just go to pharma companies so they could target the best places to sell their drugs.

Anyway, it's completely legal. You just have to scrub the data pretty thoroughly before you sell it.

sidlls · on April 4, 2017

I think health insurers know every little medical detail about you from every interaction with the health care system you have that they process a claim for.

The dataset spanning all of them is likely to be in the tens or hundreds of TB range, if not PB.

swalsh · on April 5, 2017

I've worked in a large company that has "lots of medical data" which we used for analytics purposes. What we would do is "Anonymize" the data. Unfortunately that's a pretty herculean task.

intended · on April 5, 2017

What's involved? It sounds fun.

swalsh · on April 5, 2017

So the naive method is to just remove any information such as name or birth date. However the medical data is still identifiable. So we need to alter that too. However, we did analytics, so altering it might ruin the results. So we needed to design ways to scramble the data so that it wasn't identifiable, but that in aggregate the results stayed the same.

wackspurt · on April 5, 2017

Did you use something like Differential Privacy?

telchar · on April 5, 2017

I would be very interested to know if anyone is using that in practice. It seems like a promising concept.

wackspurt · on April 6, 2017

Apparently, DP has some detractors. I was told by my signal processing professor that differential privacy wasn't really a solution for privacy preserving data analysis. He said something along these lines: "if I know something about the underlying data distribution (Gaussian, etc.), it is possible to wash out the randomness."

Now, I don't understand DP well enough and information theory/signal processing still seems a bit like "dragons be here" to me. But, I want to take a stab at trying to reason why he said that.

For example, take randomized response (the only DP technique I understand). That is vulnerable to a longitudinal attack: a person can query repeatedly to wash out the randomness. If you think about it, isn't it the almost the inverse of a repetition code (error correction)? There, you're trying to use redundancy (repetition) to remove noise.

telchar · on April 18, 2017

You're right that with repeated sampling you can learn more about the data set. If I understand it correctly, I think the solution to that in DP is that you have a limit to how many times you can make your repeated query before you have spent your privacy budget and are cut off. The idea is that for a given budget level you are limited to fewer queries than you would need to learn enough to de-anonymize the data set.

If your signal processing professor was already taking that into account then I would be curious to know how that attack would work.

maxerickson · on April 4, 2017

In the US hospital systems are increasingly huge companies.

deepnotderp · on April 5, 2017

Drug molecule datasets would be an absolute boon.

gregfjohnson · on April 5, 2017

I work in respiratory therapy. Would like real-time ventilator telemetry data: volume, flow, pressure, SpO2. Alarms. Setting changes to medical devices (ventilator specifically). Condition requiring ventilation (ARDS, COPD, premature birth, etc.) Clinical assessment of patient outcome.

donquichotte · on April 5, 2017

I'm neither working on AI nor a medical expert, but it would be nice to have a dataset with pictures of melanoma and whether they are cancerous or not, to build an app similar to https://skinvision.com/.

kumarski · on April 5, 2017

We have over a billion data points at http://semantic.md with high value context that we use to service companies in the space.

Would be exciting/somewhat disruptive if YC democratized access to it.

ipunchghosts · on April 5, 2017

EKG data. I've never had an EKG done where the computer was even close to predicting correctly what was going on. As someone who does DSP, this is not that difficult of a problem a RNN and lots of data.

jonjlee · on April 5, 2017

Decoupling medical notes from billing would relieve a huge burden from the modern practice of medicine. I would like to have a robust set of clinic notes with the corresponding outgoing billing documents.

oomkiller · on April 4, 2017

Structured longitudinal patient data (diagnosis and procedure codes, lab data, step data from fitbits, etc), but with AI unstructured may become more useful as well. This is probably an opportunity in itself.

technics256 · on April 5, 2017

This. Inpatient daily notes with charge codes/ICD10.

Odenwaelder · on April 5, 2017

For my work, I need information on public health in developing countries, especially in Africa. There's a lot of information from WHO, but it's not properly machine readable.

getAidlab · on April 4, 2017

At https://www.aidlab.com we use PhysioNet for our filtration research, but an additional database would be lovely!

zitterbewegung · on April 5, 2017

A comprehensive listing of foods and the allergies that are associated with them (a listing of food ingredients with tags like peanuts / shellfish etc... ).

sheraz · on April 4, 2017

Spirometery data for people with or without respiratory diseases / conditions. For example, healthy males / female data from age 5 to 95.

Then those with Diseases or conditions.

rajvansia · on April 5, 2017

Vital signs data during surgical cases with anesthesia data

id122015 · on April 6, 2017

A dataset with all the doctors in the world. And their ranks if possible. And what they worked before being a doctor - paid feature.

leecarraher · on April 5, 2017

why start with medical data? with hipaa it seems (rightfully so) to contain some of the most heavily guarded types of data out there.

technics256 · on April 5, 2017

Hospitalist/intake physician hospital notes matched to ICD10 codes. Extremely useful and difficult to find.

JusticeJuice · on April 5, 2017

Illness rates by region to build a consumer facing 'google trends' for health.

gotthemwmds · on April 5, 2017

https://vizhub.healthdata.org/gbd-compare/ enjoy

idclip · on April 5, 2017

Viruses for sure to curb the next epidemic - and put an end to hiv, hpv and the flu.

ipunchghosts · on April 5, 2017

Easy, picnichealth.com database. Curated medical database. This thing is a gold mine.

carbocation · on April 4, 2017

ECG data at the signal level.

jrowley · on April 4, 2017

There is some of ECG data available in physionet:

https://www.physionet.org/search-results.shtml?q=ECG&sa=Sear...

carbocation · on April 5, 2017

This is helpful - thank you!

ryptophan · on April 5, 2017

Dermatology is complicated. A labeled image dataset of skin conditions?

merqurio · on April 4, 2017

Tagged medical notes from Medical Records. The same way there is ImageNet.

socmag · on April 5, 2017

Sounds awesome.

Anything geospatial would be superb. Disease transmission for example.

maxxxxx · on April 4, 2017

How about pricing data?

jrowley · on April 4, 2017

Pricing such as total cost of care, at a covered california region level is available here: http://costatlas.iha.org/

(I built this tool, so please be nice ;)

nodesocket · on April 4, 2017

I'm interested in retail prices from Blue Shield. For example their Silver 70 PPO plan when you get a price quote from Blue Shield is nearly $140 a month more expensive in San Francisco then Los Angeles (Beverly Hills).

  SF Premiums - $479
  LA Premiums - $340

I gotta get out of this sanctuary city.

jrowley · on April 4, 2017

That type of data is hard to get and hard to release publicly. But if you play with the tool, you can see that across both HMOs and PPOs, San Francisco County is way more expensive than LA (even when taking into account a risk adjustment). So this may be indicative of a geographic market issue.

http://costatlas.iha.org/map?m0=TCCCOMP&p0=HMOPOS&m1=TCCCOMP...

jelling · on April 4, 2017

Do you mean hospital chargemaster data?

jrowley · on April 4, 2017

This data doesn't come from Chargemasters - it comes from claims data, aggregated and averaged over all members, so it is real data showing what the average cost per member is over a year.

farhanhubble · on April 5, 2017

Blood test reports of all kinds, images of smears.