Hacker News new | past | comments | ask | show | jobs | submit login

I worked in healthcare ML solutions, as part of my PhD & also as consultant to a telemedicine company.

My experience in dealing with data (we had sufficient, and somewhat well labeled) & methods made me realize that a lot of the prediction human doctors make are multimodal - and that is something deep learning will struggle for the time being. For example, say in detection of a disease X, physicians factor in blood work, family history, imaging, racial genealogy, general symptoms (like hoarseness, gait, sweating etc), even texture & palpitations of affected regions sometimes before narrowing down on a set of assessments & making diagnostic decisions.

If we just add in more dimensions of data to model, it just makes the search space sparser, not easier. Throwing in more data will likely just fit more common patterns & classes well, whereas a large number of symptoms may be treated as outliers and mispredicted.

We humans are incredibly good at elimination of factors & differential diagnosis. The findings don't surprise me. There is much more work needing to be covered. For straightforward, and conditions with limited, clear cut symptoms they are showing promising advancements, but it cannot be trusted to wide arrays of diagnosis - especially when models don't know what 'they do not know'.




are you really sure the doctors are doing a better job when they go through the motions of incorporating a wide range of data? Or do we just convince ourselves they're better?

I suspect we massively underestimate the amount of misdiagnosis due to incorrect analysis of data using fairly naive medical mental models of disease.


> Are you really sure the doctors are doing a better job when they go through the motions of incorporating a wide range of data? Or do we just convince ourselves they're better?

Personal story: I was diagnosed with a rare genetic disease in 2019. If I ran the symptoms through a ML gauntlet, I would be sure they would cancel each other out or make little sense. Chest CT (clean), fever (high), TB test (negative), latent TB marker (positive), vision difficulty (Nothing unusual yet), edema in eye socket (yes), WBC count (normal), tumors (none), hormones (normal) & retina images (severely abnormal)

My condition was zeroed in within 5 minutes of a visitation to a top retina specialist, after regular opthalmologists were in a fix about two conflicting conditions. This was differential diagnosis based even though genetic assay hadn't returned yet, which also later came in favor. I cannot overemphasize enough how good human brain is in recalling information & connecting the sparse dots to logical conclusions

(I am one of 0.003% unlucky ones among all opthalmological cases & the only active patient with that affliction in one of the busiest hospitals in the country. My data is part of the 36 people in a NIH study & opthalmo residents are routinely called in to see me as case study when I go for follow up quarterly).


How many specialists did you go to before it was identified?

How many other people with the condition were misidentified?

I only say this because of a family member with a rare genetic condition. For years they were told it was something else, or told 'it was in their head'. The family member started a journal of their medical conditions and experiences that was detailed then brought that to their PC which whom then sent them to a specialist, this specialist wasn't sure and sent them to another specialist that had a 3 month wait. After 5+ years of living with increasing severity of the condition it was identified.

So, just saying, it's as much likely that the condition was identified because you kept a detailed list (on paper or in your mind) of the aliments and presented them in a manner that helped with the final diagnosis.


> How many specialists did you go to before it was identified?

2 opthalmo, 1 internal medicine, 1 retina super-specialist & finally someone from USC Davey

> How many other people with the condition were misidentified?

Historical data: I don't know. It is fairly divided between two types, one being zoonotic & other to IL2 gene. I am told this distinction of pathways was identified in 2007.

> [..] you kept a detailed list (on paper or in your mind) of the aliments and presented them in a manner that helped with the final diagnosis.

I might have been a better informed patient but I went with a complaint of pink eye, flu & mild light sensitivity. Never imagined that visit would change my life forever. Thank you though, for expressing your concern & support


This escalation is going through a range of increasingly more specialized experts is a critical part if a difgicult diagnosis. The process endures different unique perspectives.


That all sounds shitty but I don't see how that's valuable information. They did eventually solve the problem and there's no comparison to some ML success story.

Human minds can be really good at diagnostics and still fail sometimes when faced with very difficult cases.

In my experience, ML would just classify everything as a very common disease and people would call it a success because it has an 80% effectiveness rate.

The problem that needs to be solved is a case like your example, not diagnosing the common cold.


>They did eventually solve the problem

One of the points taken should be that either via ML or human diagnostic is that these rare problems are either not diagnosed for long periods of time reducing quality of life or diagnosed posthumously.

The reduction of these measures are what we should use when making meat vs machine efficiency correlations.


Maybe I'm reading it wrong but doesn't your story confirm that doctors aren't good at diagnoses? It took the cream of the crop top tier specialist to correctly diagnose your condition.


What is the name of this disease?


Modern medicine already incorporates wide ranges of data. Doctors use flowcharts, scales, point systems, etc, to diagnose certain conditions because those tools have been developed by studying and considering a lot of cases.

However, there's a lot that isn't covered with data. The "middle of the scale", the "almost but not quite there", the "this is weird"... Doctors are good at that, through experience, and those are the difficult cases. Those are the ones where ML will not only likely fail, but won't even explain why it fails. We're talking about human lives here. If anything, I think software engineers massively overestimate the performance of ML and underestimate doctors.


Yes, notwithstanding those factors you described, it is not uncommon for tests to reveal false-negative or false-positive results due to their intrinsic specificity and sensitivity. A normal value is not always indicative of health, either.


My view on this is framed a bit differently but probably a similar ultimate perspective:

I think it's probably going to be a long time before models only using quantifiable measurements can even meet the performance of top doctors. I can't recommend enough that someone experiencing issues doctor-shop if they haven't gotten a well-explained diagnosis from their current doctor.

But I'm very curious how good one has to be in order to be better than a below-average doctor, or a 50th-percentile doctor, or a 75th...

But I also think there may be weird failure modes similar to today's not-fully-self-driving cars along the lines of "if even the 75th-percentile-doctor uses the tool and sees an output that stops them from asking a question they otherwise might have, can it hurt things too?"


> But I'm very curious how good one has to be in order to be better than a below-average doctor, or a 50th-percentile doctor, or a 75th.

In dermatology, on which I was working, models were better (at detecting skin cancers) than 52% of the GPs, going by just images. In a famous Nature paper by Esteva et al., the TPR was at 74% for detecting Melanomas. There is a catch which probably got underreported (The skin cancer positivity rate was strongly correlated to clinical markings in photos. Their models didn't do quite as well when 'clean', holdout data were used).

But the nature of information in all these models were skin deep (pun intended). They were designed with a calibrated objective in place unlike how we approach clinical diagnostics as open ended problems for the doctors.


Isn't it a tad unfair to compare a ML model for dermatology, only working with pictures, against general practitioners? IMHO comparing said model against dermatologists would be a better approach. And just working from images is not necessarily a dermatological model, buy rather an image analysis model.


> We test its performance against 21 board-certified dermatologists

From Andre Esteva's paper:

https://www.nature.com/articles/nature21056


Interesting, could you explain more about the clinical markings? Was this mentioned in the paper itself or was it later commentary?


I remember a similar New Yorker article ~5 years ago about medical imaging ML/AI where they realized it's good hit rate was actually a data artifact from training. Something along the lines that essentially all the positives had secondary scans and so there was a set of known positives which had say run through the same machine/lab and had similar label color & text markings in the margin of the imaging.

When they went back and tested with clean images that didn't basically have the "im a positive cuz I have this label over here in the margins", the hit rate dropped below that of humans.

It was an article with anecdotes about some of the hospice cats that seemingly are able to detect when a patient is about to die. Entirely possible as they have a sense of smell and patient tumors likely giving out detectable odors.

Nonetheless, the ML model & the cat were similarly inscrutable.


Nice, a model identifying the cases a doctor had enough doubts about to have a second test run.


Statistical mortality models are already more accurate than physicians for an average patient.


Not this ignorant comment again. AI will replace software engineers long before it replaces doctors. There is an arrogant ignorance of what doctors do that always shows up in comments when topics like this pop up.

And yes I'm a physician and MLE. So i understand both worlds clearly


Well look at that! If it isn’t another member of the medical mafia on HN.

It is kind of funny to see comments complain about the lack of perfect sensitivity and specificity of their physicians.

We complain about the same thing from the the various ML techniques in radiology which currently are pitiful and a gigantic waste of time and money. When I went into rads I was pretty worried about ML - not anymore.

I’m hoping this upcoming recession will dry out a lot of institutional use of ML. In radiology it’s not that helpful and there’s no technical fee increase for it. But you can advertise with it i guess? Commercials with lasers, robots, and AI with pleasant voice overs about cutting edge techniques and getting the care you need in the 21st century and blah blah blah


But we all saw House, Grey's Anatomy and Emergency Room. So we know exactly what doctors do!


> I suspect we massively underestimate the amount of misdiagnosis due to incorrect analysis of data using fairly naive medical mental models of disease.

I suspect software engineers massively underestimate the value of skills outside their domain.


The good ones don't, they also realize the value of working with experts from the field they develop software for and with. And ues, that includes business people. The bad ones they can replace decades worth of experience, training and eductaion in certain field with deducting the necessary insight with first principle thinking.

The same applies not just for software devs, but for every other domain as well.


I wouldn't call it massively underestimate -- if I recall the research of Meehl et al correctly, clinicians, on most types of cases considered independently, underperform simple arithmetical models of 2--5 variables by something on the order of 10 %. So not a huge effect, but also humans aren't as good as they think. (They do get lucky though! Sometimes some people get very lucky and accidentally get a long string of cases right.)


I’m confused by your comment because these are exactly the type of problems that humans generally really do a poor job classifying.


Most modern ML techniques do a poor job on these types of problems too unless they have a lot of data (hence the reference to sparsity) or assume structure that requires domain specific modeling to capture.


It could be that after we train a biological nural net for decades it can get pretty good at intuiting things even if it can't explain how.

The numeral net in question is the Drs. Brain.


> We humans are incredibly good at elimination of factors & differential diagnosis.

I don't automatically buy this.

Didn't heart attack care in the ER get dramatically better when people started following checklists? That suggests that human doctors aren't that great at even getting the basics correct.

In addition, most doctors are below average. So, maybe the best doctors are better than the AI. However, I may not have access to that doctor and the AI may be better than all the doctors I have access to.


Using check lists means stabdardization, and that makes results compareable and reduces the risk of forgetting something under stress. Check lists have nothing to do with ML or AI so.


It wasn't just "forgetting". Every doctor had their own take on diagnosis and the checklist was actually better than a lot of them since the checklist was constructed from data.


It's interesting you say this. I read a book several years ago, I don't remember what it was. But it talked about how there used to be a lot of questions that were used to determine if someone was having a heart attack.

They then did a lot of stats and number crunching and determined that with just 2 questions they could accurately predict whether or not someone was having a heart attack 95% of the time which was a considerable improvement.

I wonder how many problems are we making worse by throwing more data at rather than making them better?


For ML to really make a dent in medicine, the whole system needs to be altered, in a similar vein to how building roads tailored to self driving cars would make them much more successful. Most medical diagnoses are only made when severe physiological derangement has already occurred. If we had access to longitudinal streams of data, then ML would be essential to detecting anomalies which point to early evidence of disease. For example, streaming in regular noninvasive measurements on breath and urine; wearable readouts on heart rate, oxygen saturation, blood pressure, temperature, movement; neurocognitive streams based on analysis of email, video calls, text messages etc. But this is a fundamental change on many levels, with many barriers. It will also probably fail to improve the health of people who need it the most. I’m not sure it’s even that appealing as an alternative to the current meatspace system.


> We humans are incredibly good at elimination of factors & differential diagnosis. The findings don't surprise me. There is much more work needing to be covered. For straightforward, and conditions with limited, clear cut symptoms they are showing promising advancements, but it cannot be trusted to wide arrays of diagnosis - especially when models don't know what 'they do not know'.

If it were presented this way, while accurate and honest, it would in no way get the media hype and thus funding from both state actors and private investors looking to get to be a part of the 'winner take all' model.

As a person studying AI and ML at the undergrad level, is there any advice you have in order to the pitfalls that this Industry has become?


Thanks for sharing. My belief is that we need to figure out a way to make humans interact with prediction models in a virtuous way. Prediction models suck at "connecting the dots" or considering multiple sources of information (for example: multiple models predicting different outcomes). Until we get true general artificial intelligence, I think the way to go forward is to try to quantify those unknowns through confidence intervals (conformal prediction seems to be quite nice for many models) plus some multiple hypothesis testing to handle the multiple outcomes / multiple models.

This needs to be then implemented on a real flow where humans and prediction models interact (for example: approve these things automatically, send these other test for humans to revise)


Personally knowing the hit rate of these ML models & their non-explanatory nature, weighed against their low cost.. I'd argue they should be used as a default automated second opinion to radiologist opinion.

Recently went through a pet cancer death so though medical imaging, diagnostic testing, specialist escalation and second opinion workflows are pretty fresh in my mind. There is a shortage of specialists, backlog for appointments and many astonishingly bad practitioners out there.


In my experience, multiple knee MRTs, the liver, the intestine, the ankle..., radiologists by default are the second opinion to the specialist, e.g. an oncologist, that sent you to get the scan in the first place. I never ever had a radiologist come up with a diagnosis by himself.


The system itself should be built around these capabilities, not the other way around. Instead of collecting data at regular intervals we wait until symptoms to go to the doctor. This is why the dataset is so sparse.


Exactly this. The features (or limitations) of medical data is inherent in the process of clinical practice, but this seems to be oftentimes overlooked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: