Some background info on mammograms (x-rays of the breast). For whatever reason breast cancer is highly political in the US. As a result there are federal laws regarding communication of results and scheduling follow-up exams that apply to mammograms only and no other radilody or other medical test. To comply with these laws radiologists set up a uniform grading system with highly specific jargon. As a result the formatting and language used in mammogram reads by radiologist will be much more consistent than, say, a chest x-ray screening for lung cancer. This makes mammograms the ideal studies to be tackled by AI
This is actually why it isn't ideal for AI to tackle. The fact is there's no market for the product.
Right now much of the mammogram reading is extremely straight forward. Radiologists can fly through those exams with high quality, especially when you consider the macros they have setup that allow them to write most of the report with with a two second voice command.
This is one of the biggest things radiologists complain about- from their perspective most AI is solving problems that aren't actually useful to radiologists. I think the companies that will do best are the ones that augment radiologists with a focus on making them more accurate and efficient, rather than trying to cut them out of the loop (I'm a bit bias though as my previous company, Rad AI, is taking the augmentation approach and it has worked out rather well).
> This is actually why it isn't ideal for AI to tackle. The fact is there's no market for the product.
There's a difference between: not ideal to tackle and not ideal to build a business around.
It is absolutely ideal to tackle which will hopefully build confidence in a system that can then be expanded to solve harder problems. If you can't even solve this problem I don't see how you're getting public confidence in the harder ones.
Why? The defining feature of the third world is cheap labor.
The third world might need more effective training, so they can convert more cheap labor into specialists, but I don't see why per se they would prefer to use a computer sold at Western prices to a local specialist.
That's empirically not true; the first such product approved for clinical use entered the market in the late 90s, and they've had small-to-medium success since. Definitely none of them changed things radically, but it's pretty common CAD functionality now and has both detractors and champions.
> Radiologists can fly through those exams with high quality,
This is also not entirely true, one of the biggest problems for screening (not diagnostic) programs is that the false negative rate for radiologists is high. They do fly through these scans, but that's to make it economically feasible. Also, the average radiologist performance on screening mammo is not very good - most have to do it regularly (probably 100s per month) to do it well.
You are right that augmentation is the more plausible path to clinical success, but that's been what most of the clinical systems aim to do anyways, provide a sanity check on the radiologist.
I have no idea why everyone tries to solve the hardest problem (true positives, no false positives) with machine learning, instead of solving the much easier problem (true negatives, no false negatives) that often delivers similar value.
E.g. If a system could tell me which 967 results out of 1000 are normal, and send the other 33 for human review.
True positives require solving every problem in the solution space. True negatives only require solving the most frequent problems.
Why do I feel the AMA was a bigger problem than liability?
Plus--wouldn't the software company be liable? (I don't think the tech is ready for office use, but when it is--I still think it will be a battle.)
Kinda like all medical devices. Whenever there's a problem, the equipment is looked at first?
Personally, I don't think the AMA wants computers taking away their doctor's precious income, and will only acquiesce when the tech makes the doctor more money, or is so good politicians/insurance companies start demanding it.
Insurance companies will be the ones to push it. And they'll push it when the government (through Medicare) mandates it / cost reductions.
But IMHO medical law in general has a big device / software blind spot. It's become decent at figuring out liability for humans, while permitting normal operations.
But all that machinery and case law doesn't really exist for black box software systems (no human in the loop). And if there's one thing that medical providers and insurance companies hate, it's unbounded liability.
Just commenting to agree with this. I've done a lot of ML use case ideation work, and what happens so often is the ML folks latch on to something that ML may be able to do, but that doesn't move the needle at all in terms of doing the work faster or better.
Often, the ML provides an extra layer that slows things down, even if it's working at state of the art AUROC or whatever we're trying for.
I'm really surprised by the pushback you're getting. I'd be curious the background of the people that have come back with replies about how it's an ideal use case, etc. This misunderstanding of the problem domain is exactly why these projects always fail - people come up with a "solution" that solves something that isn't the problem, despite whatever clever horse analogies people can come up with.
Is there an argument to be made that decreasing the workload on radiologists could improve their results in more difficult cases? I don't think it's a stretch to imagine that less tired/busy doctors would be able to perform their job more effectively.
There's also the fact that radiologists get paid a pretty decent salary. If I could devise a system that can do a highly paid worker's job faster, cheaper, and at a similar level of effectiveness, why wouldn't I? It also opens up that skilled worker's schedule and allows them to tackle more difficult to interpret scans.
That argument is exactly the one made by my former company, Rad AI. Rather than focusing on trying to "solve" radiology they're focused on augmenting radiologists. Their main goal is to reduce the amount of time a radiologist spends on each report while increasing accuracy and reducing errors.
The point I'm making is that focusing on mammograms is not the way to do that. It's already a very optimized specialty. If AI researchers worked more closely with practicing radiologists (not academic ones) I imagine there's all sorts of areas where existing technology today can improve things.
"If AI researchers worked more closely with practicing radiologists (not academic ones) I imagine there's all sorts of areas where existing technology today can improve things."
That's almost certainly true, and surely there must be people in the medical industry already working on this. In hospitals, there are internal organisations that vaguely attempt to link up technology workers with practicing doctors to improve medicine using technology. One of my most competent former developer colleagues retrained as a doctor and worked in medical digitisation, as well as being a practicing doctor.
However:
* Often coming at a problem from another angle and ignoring existing professionals can be useful too. Existing workers are very often stuck in their ways and unwilling to accept the way they look at a problem may not be the best. I have heard from many in medicine, that doctors are extremely distrustful and hesitant towards technology, as well as fiercely protective of their jobs.
* The medical industry is internally highly dysfunctional and trying to work with practicing doctors can be extremely politically difficult. The hospital bureaucracy often actively tries to prevent practicing doctors from working directly with technologists, preferring instead to silo them and direct all communication via senior management - who are often neither technology nor medical experts. Consequently it seems more likely that major changes will come from outside.
* AI has occasionally (rarely?) shown great success in solving problems without needing domain experts. See AlphaGo.
> AI has occasionally (rarely?) shown great success in solving problems without needing domain experts. See AlphaGo.
Games are always orders of magnitudes easier then analyzing reality. They are highly constrained to the narrow domain and space of possibilities. Reality doesn't. There is work on understanding physical laws by the AI but it's very long path and they are much less successful than their artificial reality colleagues.
In just a few decades we've gone from 'chess is ai complete' to 'go is ai complete?' to 'games are too easy to be meaningful.'
In game domains we have the ability to generate infinite meaningful training data and have easily measured outcomes. Waymo is generating near infinite data for driving, and doing quite well with it. In medicine, we have piddly ass datasets because of privacy concerns. 'Reality is hard' had nothing to do with it.
Reality is hard is still a thing. Google has the ability to spends billions (or 10's) on something like Waymo and they still have to spend billions (or 10's) more to properly commercialize it.
> The medical industry is internally highly dysfunctional and trying to work with practicing doctors can be extremely politically difficult. The hospital bureaucracy often actively tries to prevent practicing doctors from working directly with technologists, preferring instead to silo them and direct all communication via senior management - who are often neither technology nor medical experts. Consequently it seems more likely that major changes will come from outside.
This is one of the areas where working with radiologists actually makes things easier. Radiology groups are normally independent practices that contract with multiple hospitals- they link in with hospital systems but mostly run their own thing. They also tend to have really solid IT departments and enough funding to make things work. Working with hospitals is a nightmare though, as you say.
I think my point is that mammograms are a (relatively) easy to solve problem using today's technology. Why wouldn't it be worthwhile to solve that problem such that it has very limited human involvement?
It's the reason professional baseball is experimenting with electronic strike detection, despite the fact that umpires are very accurate (~94%) calling pitches.
I get that there are useful applications of AI that could augment a radiologist. I'm simply pointing out that if one could completely remove a radiologist from the equation, that could be a super helpful tool. I suppose my view is that removing/limiting the human element in routine inspections is a very worthwhile goal.
There's also the side benefit of increasing trust for these technologies. I think a lot of people are rightfully scared of AI technologies; showing that they can be as effective as a radiologist for routine screenings would be very beneficial.
problem is unless this AI is 100% perfect it will not provide any value to a radiologist. it is their reputation on the line with each interpretation.
to have them read an AI's diagnosis before having to assess a mammogram themselves to confirm the AI's finding is adding an unnecessary step to their workflow.
> This is actually why it isn't ideal for AI to tackle. The fact is there's no market for the product.
I think you're disagreeing with a point that the parent comment didn't make.
It's worthwhile in general to continue AI research. The properties of this application make it ideal for testing and progressing AI technology. But if AI doesn't end up being used in this particular area in the short or medium term, that's another matter entirely.
Hey I've also worked in a previous company that was doing exactly this for DK, UK, etc. As far as I remember one of the bigger things there was tissue density and masking effects where it might become not as clear cut for a radiologist. We saw a lot of opportunity in that area and in prognosis- the whole health economics aspect of frequency of visits.
> Right now much of the mammogram reading is extremely straight forward. Radiologists can fly through those exams with high quality, especially when you consider the macros they have setup that allow them to write most of the report with with a two second voice command.
That to me sounds like an ideal automation use case. You have an entire class of highly skilled professionals performing a repetitive task. Sure, they may be able to do it quickly, but I imagine they must have to do a bunch of them each day.
You could imagine a world where hospitals didn't have radiologists on staff, but rather they acted in a consulting capacity for difficult cases.
But coding is a creative task. Breast cancer screening is more like classifying breeds of dogs or determining if there's a fire hydrant in the picture.
This shows clear ignorance of what radiologists do, it's like a fish telling someone how to improve land transportation.
For some reason CS people fixate on radiology for automation just because it is imaging, but there's a lot of context behind it. There's a reason a radiologist/pathologist is called a doctor's doctor, and no one in the field is worried about automation.
For perspective: Training to be a radiologist is 5+1 years, your family doc trains for 3.
> For some reason CS people fixate on radiology for automation just because it is imaging
Precisely because it's imaging. Training data is abundant, and has the potential to be well labelled. And one of the most active field of AI research is... computer vision. So it's no wonder the low hanging fruit would be medical imaging.
> There's a reason a radiologist/pathologist is called a doctor's doctor, and no one in the field is worried about automation.
Spoken like Garry Kasparov.
> For perspective: Training to be a radiologist is 5+1 years, your family doc trains for 3.
What I love is how the immediate knee jerk reaction isn't to explain why the ML approaches won't work but to immediately retreat behind gatekeeping, in this case the tittle and number of years of schooling.
> What behind the scenes context would make it impossible?
The MD context. One first needs to train AI to perform the job of a general MD before you can get into stuff like radiology (that is, their real job, not what some novice CS grads, or not so novice AI experts like Hinton imagine it to be - I.e. not segmenting things into funky shapes or running some funky black box magic that spits out "tumor/not tumor" with no context whatsoever, no. Actually diagnosing real people, where a life is on the line, and if you fuck up enough times, your career).
No, identifying the tumor is only step 1, and is the easiest step. Most non-radiologists can identify whether a tumor is present. The harder part (and the true value of radiologist reads) is everything that comes after finding the tumor: what structures are the tumor invading? Is there spread to lymph nodes? Are there secondary findings that might affect the diagnosis or treatment?
These questions and their relevance changes for every individual case, and while each question by itself may be approachable with AI, getting a detailed and relevant report without meaningless noise from an AI ensemble is a very very hard problem.
Finally an answer that's not just throwaway accounts flagging a submission!
These are all interesting problems where I could see an AI struggling. I guess the next step, once tumor identification becomes a solved problem, will be to train the AI on treatment data and follow-up, ie, this is an example where there was spread to lymph nodes.
Human doctors will not only tell you if there's something funky in the image, but will also interpret it in light of a patient's medical history, symptoms, possible diagnoses, etc.
Subtle shading near some structure involved in one of two possible conditions might be very important, but an obvious cyst in an unrelated organ likely means nothing. People are weird close-up!
This is an excellent point. An AI may be able to do a diagnosis to you, but a doc could do diagnosis+post diagnosis, with the foresight of medical history. Theoretically an AI could possibly do this as well, but we're nowhere close.
This AMA cabal/Luddite radiologist narrative is ridiculous. Many work on ML actively, we would love to have assist technologies, radiology residents are scared shitless about AI. I would really recommend you speak with more practicing radiologists outside of academia.
Though it’s plausible that ML could eventually take on some screening or diagnostic tasks from highly trained radiologists or pathologists, there is the hurdle of accountability that will need to be tackled. As with self-driving technology, the models will have to be good enough to take on the liability risk at scale.
It depends on the specific situation. Screening mammography for instance carries significant financial (and emotional) risk of missing a call that ends up being cancer.
Liability will be delegated away from the hospital and onto the service provider, same way they are with diagnostics on clinical lab equipment. Hospitals love to continue to get paid while shouldering less risk.
Breast cancer isn't political. But there's continuing debate in the US medical establishment over the proper way to screen for it -- how to test, how frequently, at what age range, what constitutes a positive result, and how should the doctor follow up if the test is positive? The recommended answers to these questions seem to change every few years, and the substantiation behind the answers is often inconsistent.
The same issues bedevil prostate cancer screening.
The science is pretty well established on *generalized* breast cancer screening not improving health outcomes despite being costly. (In short: cancer is rare, tests have false positives that trigger biopsies, biopsies are invasive and can be harmful, overall biopsies and pointless surgery on slow growth cancers do as much harm as cancer.)
But since screening has been framed as caring for women, pointing out the flaws with screening is automatically seen as hating women.
While Ronald Reagan was in office, he had a colectomy for a colon tumor in 1985 (turned out it was not cancer) [0], and his wife had a mastectomy for breast cancer in 1987 [1]:
While Reagan did a lot to try to defund cancer research regardless, the first lady's mastectomy drew a lot of attention to a previously taboo topic.
Things like: It's normal to breast feed in public. Or the nation does not go bananas when a nipple is exposed on TV. Or how normal it is to sunbathe topless. Arguably this also relates to how normal it is for kids to walk around naked and families to be naked around each other.
If you're really interested, watch some older Dutch movies to see the normalness of naked, like Turkish Delight [0]. Or even have a closer look at the relation between the Professor and Raquel in the recent Casa the Papel. It's different from US series. More respectful, mature wrt women if you ask me. More emphasis on intelligence. Or on a beautiful woman in her 40s with normal wrinkles. The US, like on many other topics, seems to be polarized, caught between the hyper sexuality of Cardi B and the prude nature of American culture in general.
In particular, the pearl-clutching gender-specific moralising: it is absurd that an organ whose specific purpose is to feed infants by being inserted into their faces is censored specially from minors, whereas the non-functional copy of the organ in men which can’t even do that is apparently acceptable (modulo dress code, weather, etc, but you can walk around or swim or sunbathe topples as man, not as a woman).
Also the largest and best known breast cancer fundraising organization, Susan G Komen, has, in the words of wikipedia, "been mired by controversy over pinkwashing, allocation of research funding, and CEO pay."
I don’t think “political” is meant to mean “controversial” here. There were just some extremely successful awareness campaigns for it in the 80s and 90s (to the point where it’s the stereotypical example of an awareness campaign for many of us), so people care a lot about it.
One thing that will cause mammograms to be political, in several countries, is that there's a difference in perception of the downsides vs upsides of such screening programmes, and we're bad at communicating the trade to a population who lack the statistical literacy to understand it intuitively.
So our best shot from a public health perspective is to say "Here's what we recommend for everybody" and pay for that.
All screening programmes have two difficulties, which must be balanced against the benefit, and this trade is somewhat personal, so when the balance is quite fine the arguments can be vociferous as a result.
1. The screening itself may seem unpleasant. One woman may find it a very mild annoyance, a drive ten minutes out of her way, the staff are very pleasant, the scan itself is far less traumatic than a bra fitting, and she receives easy to understand results after not very long and isn't anxious about them; but for another maybe it's an hour's bus journey to the city hospital, the staff there are short-tempered and say she has the wrong paperwork, then another hour in a queue, she feels like she's just meat, squashed around for the convenience of the machine for what seems like forever, and then after anxiously waiting for what seems to be too long the results are confusing to her and she has to have a friend interpret them.
2. Over-treatment is always a problem. Screening by definition detects something that isn't causing noticeable symptoms. If you have a noticeable lump, or mysterious bleeding, you don't need screening you need a doctor's appointment. So a positive screening result might be nothing important. However either you've now got the burden of a diagnosis you ignored or, you accept the medical advice and are treated, even though it's possible (not likely, but possible) that you would have been just fine without treatment.
So, screening programmes are set up based on guessing how to trade these factors plus a third, how much should we spend on this medical intervention? After all, in some sense every dollar doing breast cancer screening is a dollar you don't have to cure blindness in poor orphans (or of course, to bomb somewhere)
If your experience of a screening programme is that it's a minor inconvenience at most, and yet you know people who died of undetected disease, more screening seems like a no brainer. Particularly if you live somewhere where screening stops at age 50, and somebody you know died of undetected disease aged 54, you might reason that the screening should go to age 55 or 60 to detect such cases, no matter the public cost.
On the other hand if your experience is that it's an awful ordeal even when negative, and you know people who spent their last years horribly scarred by surgery as a result of suspected disease but then they died in their sleep from something else anyway, you may feel that there's already too much screening and it should be trimmed back, not to save money but the extra money for other programmes is welcome.
The other issue is the screening isn't risk free. I don't recall the exact numbers, but for every 3000 cases of breast cancer detected one is in someone who wouldn't have got breast cancer at all if she wouldn't have got all those screenings. It is still worth doing because it saves a lot of lives, but the more you do it the more cases you will cause so you need to find the right balance.
Most women don't get breast cancer, the 1 case in 3000 (again I don't know the exact number, but this is reasonable for range) caused by xrays includes screenings of all women, including those who never get it.
However if you are a doctor trying to figure out how often to screen women, the danger if xrays is a strong reason to not do it too often. Daily screenings would catch breast cancer a lot sooner on average, but just isn't worth the risk even if the screening was otherwise free.
We're not even talking about treatment: just regular screening and detection, the same as you get when you go to your doctor and he sticks a finger up your butt and tells you to cough.
also, maybe there's more number of scans, so higher N datasets to train on?
Our hospital bought an AI stroke detector (viz.ai). The "AI" part is laughable but it allows the hospital to collect some extra HHS/medicare fee for using "imaging algorithms" or something. I suspect that company is potentially using anonymized scans to continue training their data, because the initial product was trained on some tiny sample of brain CT scans, like under 500.
The one plus for actual physicians and nurses is that at least they wrote a non-sucky PACS imaging interface for the iPhone so we can just pull up the plain scans and view them easier.
Additionally, breast cancer screening is a high-volume and low-prevalence task and CAD applications has been developed for decades (although not with the performance of latest CNN algorithms).
This type of evaluation (both the meta-evaluation and the underlying studies themselves) are the exact sort things AI enthusiasts should want researchers to do. It cuts through the hype, and gives everyone a clear assessment of where the technology currently stands.
Clearly there is room for improvement. Maybe this study will also spur the development of new types of systems which augment human/radiologist decision-making.
AI isn't a stand-alone panacea for this. We've done lost of studies with other radiologists. Taking the union of two "bad" radiologists outperforms a single "good" radiologist. Humans aren't machines, we have bad days. I think the long term benefit of this kind of AI is as an audit AFTER the radiologist has done their assessment. If there is a mismatch, it should trigger a follow-up with a different a radiologist (without telling them the result of the 1st or the AI) to build consensus.
This! I wish we did this type of usage a lot more in a variety of areas. I previously proposed this at a job to enhance filtering out of false-positive security scan results. I think it doesn't get used because it would increase the immediate cost, not reduce it.
I completely agree. In fact, one wonders if the biggest use of AI/ML in medicine will actually be in medical notes and record keeping -- voice transcriptions, medical scribing, and interpreting writing, free-form text into standard diagnostic categories.
AI which is slightly less effective than a well-educated, well-trained, expensive radiologist is better than nothing.
There are places in this world which could afford a $XXK x-ray machine, but don't have access to a good radiologist.
Agreed! It’s easy to get sucked into the (intriguing) human vs. machine question and forget that the best solution is often both or some orthogonal improvement.
My first step in doing such a meta-study would be to throw away any paper with <1000 (or choose your threashold) training sets. If the n value is insufficient to train the number of parameters on the model, the paper isn't good.
When I was CTO of DocHuddle (ML + Radiology), we reviewed every paper and preprint we could find and a huge number appeared to be overfit on tiny datasets.
Any meta-study which doesnt throw away obviously poor attempts will find bad-skewed meta-findings.
That's a real shame. It's been years since "The Unreasonable Effectiveness of Data" has been published and out in the world. You'd think that using a large N for statistical learning methodologies would be standard practice by now.
Do Hospitals keep radiological images? Are they researching this stuff? Presumably they'd have a large enough sample size.
Yes, hospitals keep imaging on their PACS systems and VNAs (Vendor Neutral Archives). Large systems have millions of images, though not necessarily same body part. Very large systems have millions of the same body part alone.
Getting images is one thing. Getting labels is harder. Getting annotations on regions-of-interest is yet harder. Often the labels are stuck in unstructured data (notes.)
When I was doing my startup (https://www.dochuddle.com/) we trained our classifier and object detector on 1.2 million images. We worked with a large sovereign on getting the images, reports, and annotations.
Just dealing with 20+ TB of images was a job in its-self...but it was indeed unreasonably effective. The success was only technical and ML success.
Even harder -- commercializing it effectively after accounting for legal/contract costs. Yet harder -- getting past conflicts of interest inherent in the US medical system. We were not successful on this front. Possibly too early (we started in 2014)
Actually, a real shame is a total lack of really big (millions of records) data sets on pretty much anything medical. And this is of course due to our love with privacy. Probably millions of people died too soon because of that, and many more will follow in the future.
I'd agree that millions suffer due to a lack of care, or a lack of affordable care, or a lack of early diagnosis.
I'd disagree this is due to privacy. IMHO it is due to conflicts of interest in major medical systems (certainly in the US) where incentives are skewed away from efficiency towards more billing.
I remember a friend in an allied discipline talking to me about what they did for cervical cancer where again the initial AI outcomes are weak. I don't know if this is now gold standard clinical practice, or still an experimental, but here's how I remember it:
We screen people with a cervix for cancer by scraping away a tiny sample of the cells periodically and having that examined at a laboratory for anomalies. If caught early, cervical cancer isn't fun but it's extremely survivable. You don't technically need a cervix (after all about 50% of humans don't have one) so in the worst case hysterectomy (removal of the womb and cervix) is an option.
Machines aren't very good at looking at a slide full of cells from the human cervix and giving it a score like 1-5 where 1 is "Fine" and 5 is "Cancer". This is a standardised task that her (human) team do every day, and she works with other bodies across the continent to ensure they're all doing a roughly similar job by looking at each others examples and checking they get the same number, so this way a doctor in Paris and one in Birmingham should get interchangeable results despite using different labs to process the samples.
However, it turns out the machines are stellar at a related task. "Is this sample infected with HPV?". Humans do not find this task easy and historically this test wasn't done anyway.
So maybe a human would do the first thing, and, if it was a bit borderline, then they'd check the second thing. Cervical cancer is almost always caused by HPV, so if you don't have HPV then you almost certainly don't have cervical cancer. But since the machines are great at that second problem you can reverse things. The machine processes every sample for HPV and then a human only looks at the positive ones, rating those.
What seems to be coming up repeatedly is that machine learning techniques are pretty good at being almost adequate replacements for recognition problems. Able to catch blunders but also missing at a rate high enough that you need the expert human anyway. In reality just replacing human kinds of error with machine kinds of error.
Automation though can also make experts stupid depriving them of the practice to keep skills sharp.
That's why most uses of AI in medicine that I've seen have focused on performing a sanity check on the physician (e.g. checking a new prescription for contraindications like conflicting meds or maladies). Medical insurance companies have embraced this practice for obvious economic reasons.
Doing this early also helps people get used to the idea of AI replacing/complementing doctors in the future. Objectively, it would be better to be diagnosed by a machine instead of a human doctor, if the machine is better on average at giving the right diagnosis. In practice, there is a large legal and emotional gap to close to get acceptance for this.
The gains could be incredible though, not only would we get better diagnosis, it could be done faster, cheaper, and remotely.
> Objectively, it would be better to be diagnosed by a machine instead of a human doctor, if the machine is better on average at giving the right diagnosis.
"On average"?
What if the machine is better than average for common things but consistently, 100%, misses uncommon conditions with very high short-term mortality rates?
I'm also generally sceptical of the current AI hype, but I don't see exactly how these results cut through the hype. Why isn't it seen as an accomplishment that 6% of those systems performed better than an actual radiologist?
I think this headline is bad. When you read the abstract you find that what this is really saying is that there are essentially no studies that are fit for purpose to evaluate AI systems for breast screening in general and those that are the closest to being suitable show the worst results. That's the real take-away here - we're not even at the point where we're collecting good data about AI accuracy, let alone actually producing accurate AI.
I would suggest these two things might be linked - the only way you get publishable results is by failing to do rigorous studies.
We've since changed the title. The submitted title ("94% of 36 AI systems evaluated were less accurate than a single radiologist") broke the site guidelines by editorializing. Submitters: please don't do that.
You can get away with publishing models with worse accuracies if you can pitch a more fair evaluation scheme, but it's way harder, and you don't get to do flashy presentations where you state you're SotA.
I hope I don't offend any other people in the field, but I think historically fields like histology/radiology/MRI screening via AI have been subject to less scrutiny by virtue of their multi-disciplinary nature... particularly in the past, it was hard to find reviewers that both (A) understood the biological/clinical validation (B) the technical validation.
I think things have drastically changed over even the last 5 years, which is why you're more likely to see headlines like these. We have a lot of discussions like these during our lab's journal clubs. The optimists among us (e.g. myself) argue that this opens up a lot of opportunities whereby more fair validations/benchmarks allows us to compete with "SotA" methods that are actually fragile and easy to surpass on even ground. More pessimistic members are quick to note that it's far harder to introduce new, fairer validations and that reporting lower metrics is far less buzz worthy (all true points).
This reminds me of a fundamental error that was recently found in methods that tried to predict protein-protein interactions. It was found that the train/test/validation method used by ostensibly every paper was leaking huge amounts of data [0]. When we plugged that leak, we saw much more modest metrics, but focusing on regularisation allowed us to beat the competition [1].
The kicker? The information leak were identified in 2012, yet you'll see papers written every year that have the same leak and report >90% accuracies, and >0.9 ROCs.
Also publications are not what determines if AI get deployed in clinical practice. That's the job of the FDA and million of dollars spent on validation like clinical trials and quality management systems.
No, this paper is a review of other studies, and what I'm saying is it finds that there are quite a few problems with the studies they're reviewing - they point out that the smaller studies that show stronger results are actually less likely to be generalizable, and the studies that are broader show weaker results but even then they still have flaws in the method. There is quite a good table (figure 2) that shows how they rate their "concerns" for a number of metrics, and it's a sea of red where they have lots of concerns.
This implies that there were a couple of AI systems that actually beat a radiologist, which I take as extremely promising for the field of AI radiology.
Like any domain in applied AI, there will be a lot of approaches that miss the mark, or are simply stepping stones to better approaches. There are thousands and thousands of papers on language modeling, but we only needed one superior approach (GPT) to change the game entirely.
The search through any cutting edge problem space is messy and full of failure, and that's fine. You only need one breakthrough.
> This implies that there were a couple of AI systems that actually beat a radiologist,
Without any more details about the error rates, we can't be sure how likely this is due to chance. I would caution making any conclusion about AIs without better understanding the underlying statistics.
FTA:
> Thirty four (94%) of 36 AI systems evaluated in these studies were less accurate than a single radiologist, and all were less accurate than consensus of two or more radiologists.
So yeah, no AI system beat consensus of two radiologists. That's pretty damning.
Depending on how correlated the verifications between the human and AI system are, this could be used as a verification system to determine if consensus needs to happen. I.E. Always run the ML system and only ask for a consensus if the ML system disagrees with the diagnosis. This could still provide a lot of value I would assume.
Not a single AI model is better, but what about the consensus of the 36 AI models? Ensembling different models is a common technique to improve machine learning models, did they test that?
Indeed. And we all know how quickly radiologists are improving at their job. At this rate the 6% of AI systems that beat one radiologist will be down to 0% in no time.
I'd push back on the 6% of AI systems being better than a radiologist and calling that a success, but you are right in the meta.
It's fair to say that yes, AI systems aren't good enough yet. On the other hand, it's pretty clear some technological approach will outperform a radiologist at pattern recognition at some point in the future - whether that's "AI" or "if statements" or some third option.
Another interesting subtlety is that there are only a finite amount of radiologists and they’re generally concentrated in wealthy countries/areas.
AI based analysis - whether it’s better than a human radiologist or not - is far more scalable and cost effective. Even if used as a screening mechanism to be escalated to a human radiologist, this approach will be very helpful to much of the world.
I just meant that it's not clear from this that the 6% are 'overall better' just that the 94% are 'overall worse.' More data is needed, but it does appear that progress is being made, and I'm excited by that.
Here, the context appears to be a somewhat arbitrary selection of published algorithms. All they've really determined is that at least 94% are not ready to replace radiologists.
That's pretty much confirmation of the default assumption. If they were, they'd all be trying to get these into hospitals, and they're not.
And how do we know that this small handful that beat the radiologists didn't just get lucky? You really need to know the sampling distribution of what's being measured here.
It doesn't imply anything of the sort. Until a well powered randomized controlled clinical trial shows an overall mortality benefit from an AI screening program, the field hasn't contributed meaningfully to medicine. I'm not saying it won't happen, but we are almost certainly very far from that goal.
Clinical AI (which is currently regulated as a CAD medical device by the FDA) won't replace radiologists but treated as an additional clinical vendor application integrated into existing software. Similar to speech recognition diction that has been provided by Nuance for decades.
I think the problem with this is that the AI is being used on metrics and images and tests that have meaning to humans. But if we started to take diagnostics that had higher dimensions and resolution, and just trained blackbox AI on that data, it probably stands to reason that it could do a better job. Especially with a tight feedback loop that would identify interesting regions and rescan them instantly, instead of having to bring patients back into the room at a later date.
It's the same problem I have with self-driving cars. We are teaching cars to behave around (and like) humans, instead of just fencing off the highways and inventing an autonomous system of vehicles that doesn't have this much tougher constraint. I think AI can do a lot of good things today, if the people trying to apply it looked to completely redesign existing systems around AI, instead of trying to replace the humans operating the existing systems. The latter is much more difficult.
"But if we started to take diagnostics that had higher dimensions and resolution, and just trained blackbox AI on that data, it probably stands to reason that..."
...that we would have no way to evaluate the systems' accuracy because no other non-statistical system can understand the input data?
The Oracle at Delphi is always correct. Even when the Oracle is wrong, the Oracle is correct.
"...just fencing off the highways and inventing an autonomous system of vehicles that doesn't have this much tougher constraint...
Kind of expensive to build a second interstate highway system, though.
We can evaluate over time. If the system repeatedly predicts people you will get lung cancer in 10 years based on scans, and in 10 years we finally detect that cancer that says there must be something even if we don't know what. Finding that what then becomes important (along with figuring out how to treat lung cancer 10 years before we can currently detect it)
Of course this is all assuming there is something there. It might be there is nothing to find. I wouldn't be surprised either way.
> instead of just fencing off the highways and inventing an autonomous system of vehicles
As a popular joke likes to say: The best place for AI cars is on special roads. You can then make those cars drive very close togehter, even touching. Oh and you can then move the special road behind buildings instead of the front. Give room for more pedestrian friendly pleasant streets. Then you may as well move the road underground.
We have automated subway where I live, and while they're nice and modern, I understand why people would want the privacy and comfort of a personal car. That's one of the points that isn't mentionned in your joke.
Another is that you could have your car use its AI in the special roads in the city and highways, and the rest of the time drive yourself, where AI is harder to implement. A bit like some hybrid cars where you can do most of the day to day commute on electricity, but for longer road trips you can use gas.
Sort of. I think there's a good argument to be made for part-time self-driving cars with specially-designed highways in places where it makes sense. So you'd drive your car the old-fashioned way from your home on the residential streets until you get to the Automatic Interstate. There, you maneuver into the "pattern" lane and the road and the AI pull you into traffic and drive you to your exit. At that point you're directed out of the pattern and control is given back to you, where you take the side streets to wherever you were going.
It has the advantage that it covers the very common use case where someone says they can't take public transportation because they need mobility at their destination.
Is there not a space between "fully automated subway" and "automationless road" where practical solutions exist? Case in point, what if we changed the materials composition of road paint? Or, what if cars were designed to better alert other cars of their location and speed?
Are you going to attach these transmitters to every deer and every pedestrian? If not, we're back to 'elevated or buried road so we don't have to worry about animals and people- why not just build a train?' or 'AI which can solve the whole problem on its own' or 'If I recognize something unusual hand off back to a human, hope that they were paying attention and know what to do.'
And that still leaves aside tricky things like construction sites, double parked cars, etc.
Commercial airline autopilot has a lot more straightforward job: at normal operating altitudes there are no animals to worry about, no construction or other unexpected obstacles, and few other aircraft all of which can be essentially guaranteed to be squawking transponder codes of their own, and they still are mandated to have two pilots in them at all times. This is not because airlines like it- look how fast the flight engineer disappeared once flight management and navigation software became sufficient. It is instead because airlines have learned over the decades, from a lot of blood, that sometimes you need to have a pilot ready to intervene RIGHT THIS INSTANT, not 3 minutes from now when you've regained situational awareness after zoning out for a bit- and the AF447 disaster shows that they are right about that. And the only way to make sure you have such focus and alertness over many hours is to have two pilots, so they can trade off the responsibility of maintaining SA.
And that's a much easier problem which has been actively attacked for decades.
Let's call street harassment what it is: an epidemic. According to a survey of subway riders conducted by the Office of the Manhattan Borough President in 2007, roughly two-thirds of respondents, mostly women, experienced sexual harassment on the subway; 69% reported feeling sexually threatened
Well, if you did that, you probably wouldn't even need AI for the control system. If you know the parameters and have some control over unknowns, you want a deterministic control system. "AI" will hopefully handle the cases where you are in an uncontrolled environment. Current "AI" doesn't seem quite up to the task of self-driving. It will probably get there, but when is anyone's guess.
> you probably wouldn't even need AI for the control system
Wouldn't need it, no, but could use it to analyze traffic patterns and optimize flow for fuel efficiency, travel time, and safety. Those are much more constrained problems for which we have some useful tools.
As with most things, trying to automate everything is really hard because things are rarely built to be easily automated. And sometimes it's very hard to extract the easy to automate part from the whole. Driving is a very good example.
> It's the same problem I have with self-driving cars. We are teaching cars to behave around (and like) humans, instead of just fencing off the highways and inventing an autonomous system of vehicles ... to redesign existing systems around AI, instead of trying to replace the humans ...
Of course, fencing off existing roads to exclude human-driven cars is an economic impossibility, like converting all our electrical systems to use compressed water instead.
Regardless, today's autocars are clearly NOT ready to fly solo in the absence of humans, as the Tesla that crashed into an overturned truck on a clear day a few months ago so clearly demonstrated.
I think the tight feedback loop for multiple scans is a good idea worth exploring, but the larger idea of building alternate systems has a problem that there's no clear path to a viable end-state.
With diagnoses, collating a consistent set of data is not a requirement you can impose because of how fractured the healthcare system is. If you had great AI systems in place maybe they could be convinced to invest in these systems, but you don't and so creating them is quite difficult.
Similarly, in the case of cars having their own highways and roads: how do we do this in a way that people find acceptable? If you prevent current highways from being driven on by normal cars, you cut off the majority of people from their daily commute which is a nonstarter. Making new highways in most places will also be a nonstarter for both cost and space reasons. At best you might hope that decades from now every single car has the means for self-driving in the limited environment and then one day the government turns it on, but that also feels iffy to me.
Building parallel systems incrementally creates harder technical problems but allows you to side step even harder coordination problems.
When AlphaGo went against Lee Sedol, it often made moves in the later stages of the game which were completely inexplicable to almost all game analysts and commentators, it seemed like a bad/stupid move, or at the very minimum, it seemed like a move which was hard to build upon. But it eventually won.
I think we should let an AI model do its thing, even if it possibly seems absurd to traditional human understanding. The aim shouldnt be to make them like us(or better than us), it should rather be to make them the best form of themselves( which I hope will eventually be better than us )
Also, presumably xrays are harmful if done too often or too strongly, so being able to replace those with something less harmful (like some kind of 3-D sonogram) but that AI can interpret to the same degree would be a net benefit potentially becoming a home device for people with suseptability.
If you shift the curve you could detect earlier and reduce false positives.
It's not an issue of resolution but of generalizability. Populations and scanners shift over time and the biggest issue in clinical AI is the changing data distribution, such as data acquired at different times at different institution. Medical devices (which AI software is considered) is also more regulated than self-driving cars.
"just fencing off the highways and inventing an autonomous system of vehicles"
And then they can platoon together to reduce air drag 4x, be electric and be built on a special surface to reduce friction 4x compared to 'normal' road?
I was a bit confused by exactly what the percentage in the headline was, so here:
>> In two of the largest retrospective cohort studies of AI to replace radiologists in Europe (n=76 813 women), all AI systems were less accurate than consensus of two radiologists, and 34 of 36 AI systems were less accurate than a single reader
I'm still a little unclear how you measure the accuracy of a consensus of radiologists, without relying on the consensus of other radiologists. Maybe just a bigger consensus, or by using known end results but looking at imagery that could have been an early warning?
Not necessarily. If you were running a casino and someone was able to significantly beat the odds on 34 of your 36 tables, would you assume they cheated at all other tables but not those two, or just that other players at those tables were really good?
It could be that 2 of those systems are actually better. It could also be that the state of the art in deep learning just isn't there yet for radiology. But this is a strong suggestion that research is being dramatically overstated, so I would get real careful with my investments in this area overall.
Or that 2 of those systems got lucky, none of them actually work for their stated purpose, and investing in radiology AI is about as wise as investing in the Keely Motor.
We've since changed the title. The submitted title ("94% of 36 AI systems evaluated were less accurate than a single radiologist") broke the site guidelines by editorializing. Submitters: please don't do that.
"Please use the original title, unless it is misleading or linkbait; don't editorialize."
Today I got my chest x-rayed to search for a broken rib. When we looked at the x-rays, I was wondering how the hell someone could recognize anything there. The doctor didn't find anything and just said if it still hurts in 3 weeks, we'll do another one, maybe we missed it.
Same last year, when I clearly broke a rib: first x-rays revealed nothing, one week later a second set revealed a really nice oblique displaced fracture, which was still hard to find.
Today the tibia also got x-rayed and it was a pleasure to look at, but as soon as it has to do with the chest, where all the organs are messing with the x-ray, one can really hope that AI will finally solve this issue, since every city has a couple of doctors with x-ray machines, contrary to MRI/CTs.
Come on, most of us are just sitting around burnishing our fingertips on a small piece of glass. Any more details on how you wound up looking at an xray of your ribs twice in ~1 year?
I started mountain biking around 2 years ago. I think my front suspension is too soft for my weight, so both times the front wheel got in a trough(?) and the suspension gave in too much, making me fly over the handlebar. The tibia was because the bike landed on my leg after flying over me.
I'm reconsidering my life choices regarding MTB, since I'm already in my 40s and made zero exercise for at least 20 years.
But this time I learned. I will never risk it in the summer, since those sunny days are way to valuable for spending the time on the bike instead of recovering. I told me this last year, and this time it definitely sunk in. "Be careful, you're not a teenager".
But it makes so much fun. It is so good to breathe in that fresh air and to exercise hard, I always end up with a smile. Too bad I didn't start with this 20 years ago. I had the chance.
Thank you!!! I'm in your same boat, maybe worse b/c I'm deep into my 40's, but just never really did much outdoor activity. When I was a kid there was all sorts of federal lands around that we could really roll around in however we liked. Where i'm at now it's a 40 minute drive to trails, so it's just streets and other people's property. :/
Just now getting set up to get more time outside . Sounds like i should take a lesson from your pain. :) Hope you get feeling better soon and get back out there (if maybe a tick lower kinetic energy :)
(also i meant to put a smiley in my first reply, sorry for coming across as a dick)
Stage 0: "A computer will never beat a human at chess, the game is simply too complex"
Stage 1: "So one managed to do it, but 96% of implementations still can't"
Stage 2: "Well, looks like chess is pretty much dominated by computers. But the Go Game is still way too complex for a mere machine to tackle: only the human mind can"
Stage 3: "It was one tournament that AlphaGo won and really..."
Stage 4: "Yeah, the AI can play better Go and Chess than humans. These are solved problems."
I guess we're at stage 1 for self driving and apparently for mammograms as well. Only difference is, chess players didn't have a legalized cartel working for them in the 90's.
I wonder how long it took for the written interpretation in each case? I work in a rural ER where the "stat" reads often take about an hour, sometimes up to 5 hours, and the quality is... such that I've gotten a lot of practice at doing my own reads.
Something that is "not as good as a radiologist but maybe better than me," and with results in minutes instead of hours, would be HUGE, if even to just prompt me to review the images myself before (instead of after) seeing the next patient.
I actually have a bet with a radiologist that AI will be as good or better than radiologists at reading images in a fairly near future.
The reason I think this (as a non-specialist) is this: when I upload a random photo to Google or Facebook, those systems recognize the faces of people in those pictures without any prompt. This seems like a much more constrained problem (given a pretty specific set of images of chest cavities, state the probability of a cancerous tumor.)
I am guessing that there's nothing inherent in this problem that is much more difficult than other image recognition problems. I suspect that this outcome is because we have fairly new technology competing against highly specialized humans doing a very specific task and doing it well.
I suspect that will happen is that the technology will relatively quickly catch up to humans and do equally well, after which point it's just a speed of response and economics question, where the technology wins over the person.
I'm with @greazy on this one. The assumption made is that a set of labeled images is enough. Was the training set validated against a diagnosis based on biopsy of some kind? I.e. the images that were trained on, marked by some radiologist (more likely multiple radiologists), how were they validated to contain cancer or be cancer free? Did someone follow up a few years later and confirm the diagnosis was correct?
Also, different modalities will produce different quality images, how was that accounted for in training models? Did they use all the images for a single scan or a subset of the images of a single scan?
The problem is you're trying to train a model where you have many images for a single scan, like slices. Depending on the modality, you'll get different resolutions, different "visual" inclusions, etc. etc. So labeling the individual images and labeling the entire collection is really hard.
> I am guessing that there's nothing inherent in this problem that is much more difficult than other image recognition problems.
This is factually wrong. You're assumption I guess is that its a data issue: if we can just get more data we'd solve this issue.
The reality is that these diseases are complex, their presentation is complex, and compounded by different technology. There's also the issue of data complexity, x-rays contain less information compared to a picture of your dog.
You've misunderstood my point I think. How a disease presents is different issue than the complexity of faces. A photo of a face has more information encoded than an x-ray image. This lack of information is part of the problem that I think more data doesn't solve and why humans are able to identify cancer. We have better image recognition 'software'. And that's my point, better algos, not more data
There are two sides to this: enhance human decision (by accelerating it) or try replacing it.
I played around with a number of image segmentation models a couple of years back and I can only imagine how much more subtle densitometry stuff can be, but overall I'm not surprised accuracy can be way off because _any model_ can be way off depending on what it's presented with.
Most commercial Machine Learning doesn't actually learn on the job, and is actually positioned for the triage/faster handling scenario. It will spot what it was trained for, but _most likely only in the conditions it was trained for_. Change the angle, change the patient's age, add any other visible medical conditions, etc., and ratios are likely to drop.
Then again, maybe the study could be a little more systematic (as other comments point out). I'd like to see larger numbers, if only because it would help statistic significance.
If only it was as simple as that. In medicine a wrong diagnosis can have very bad real world consequences so 90% cheaper and 99% faster has a very real possibility of being a net negative.
Yes, and these are typically the best systems, where the doctor is assisted rather than replaced. I've looked at several over the last years and the comparison doctor+augmentation tool to doctor alone and tool alone is invariably superior. For now I think that is the most productive and the safest route.
The lifetime incidence of breast cancer is 13%, and on any given annual exam it's under 1%. So you could build a >99% accurate AI radiologist by hard coding it to always produce negative results.
way cheaper than 90%, and way faster than 99%. Radiologists are crazy expensive, at least 50$ per scan. Let's say you use a 1000$ GPU (which you don't need to as these models are usually tiny in practice because of the limited size of medical datasets and because you're not doing hundreds of scans per second.. a CPU is very likely fine). Then.. just 20 scans in you make your money. Let's say the license cost is 5$ per scan. Still make your money very fast.
Nothing surprising here for me to be honest. Data is almost always misrepresented, even in AI research. I haven't seen many papers showing the full distribution of result measures, only showing statistics about those measures, which can be disguising what's really going on (also made worse by implications of distributions which are not correct). My AI days have been over for a few years now since I finished my masters, but until then I *never* could reproduce anything published in papers, even using official packages. Data selection bias was (is?) a real problem.
It would be interesting to compare the results of AI + 1 radiologist to 2 radiologists (and with a larger sample, to AI + 2 radiologists). Does the AI add a useful perspective? Does the AI perspective add enough that it's a better backup check than a radiologist? If not, does it add anything to a pair of radiologists, since it would presumably be fast and cheap and so it would still be a good outcome to add it into the process.
Another interesting question, if you had unlimited resources for studies: how does AI compare to 1 or 2 poorly trained 3rd world radiologists?
For screening, it depends on the false positives rate. A radiologist with have to check every positive prediction. Although, I believe in Europe, they have approved AI to be used as a second reader.
> 94% of 36 AI systems evaluated were less accurate than a single radiologist
Today 94% of 36 AI systems evaluated were less accurate than a single radiologist.
FTFY
In the beginning of the eDiscovery industry was here too. Lawyers read through insane amount of PDFs, TIFFs, emails and they were better than the AI/ML software presented.
Today most courts will reject human eD if AI/ML was available, and no proper legal team will use lawyers to search through terabytes of data 'manually'. The false negatives/positives are significantly better with AI/ML.
Would be interesting to see the time advantage. Mammography is high-volume and low-prevalence task with standards such as BI-RADS. While AI will not replace radiologists, breast cancer screening is a prime application to assist with radiologist burnout. I believe Europe already approved AI to be used as a second reader.
A lot of the systems seem to be trained on very few images. I work for a ML commercialization company. People have brought us models trained on 10 tumors.
Title isn't remotely clear here. "94% of 36 AI systems evaluated were less accurate" could mean the 36 systems were less accurate in 94% of cases, or it could mean that 34 of the 36 systems were less accurate.
From the actual article's abstract, here is what was meant:
> Thirty four (94%) of 36 AI systems evaluated in these studies were less accurate than a single radiologist, and all were less accurate than consensus of two or more radiologists.
I wonder if the HN title was written by one of these AI systems.
I don't know how much guessing is involved in making a diagnosis but some AIs could have been lucky. It would be hard for the doctor to win all 'games' even if better.
That's the main conclusion to draw. Better studies are needed until we can know whether Geoff Hinton was right [1].
____________
[1] “Machine learning pioneer Geoffrey Hinton said, ‘If you work as a radiologist, you are like Wile E. Coyote in the cartoon; you’re already over the edge of the cliff, but you haven’t looked down.’
“Hinton went so far as to recommend that med schools stop training radiologists right now.”
I see consistent mention of "2 radiologists review". Is there some feasible way to combine two of these models into some sort of metamodel and get a similar benefit? Like is two radiologist review a mechanical way of combining the opinions of two radiologists (in which case let's just do that with the models), or is it that two radiologists actually sit down together and look at the thing which you can't necessary replicate by just using two models.
I've mentioned before that I track technological progress by the change in criticisms.
It appears that AI screening of breast cancer has reached the "not always better than a human expert" stage which is pretty good, that would be sci-fi like 10 years ago.
Makes me wonder how a human-AI "centaur" or the averaged output of multiple independent AIs would do, since the standard already seems to be multiple humans to catch everything.
You have to add a context where the number of radiologist is extremely limited, otherwise the exercise is pointless, but:
a different question I would have is, do the AI models know when they are unsure. If so, you could use the AI, and keep the prediction when it is certain, and call a radiologist otherwise. Obviously, in this scenario, you have measured the accuracy of the certainty prediction.
That's the neat thing about statistical AI techniques: you give it a training set with two conclusions, yes and no. It will then give you results of yes or no, with absolute certainty.
For example, check the photo-recognition systems where minor changes to the picture change the recognition from a parrot to a Ferrari.
Is this a thing that AI can't be good at, or a thing where the training data set is too small? I can imagine clearing all the privacy and legal hurdles to get 100 million annotated mammograms to be quite the feat, and most AI have to train on very little data.
HIPAA compliance has many loopholes. Providers and carriers (insurance) are bound by them, but random private companies are not generally. Most covered entities are going to insist anyone they share data with also be compliant, but it's not necessarily a requirement. Not to say it's trivial, either, but it's not exactly a huge challenge to get 100M annotated anything, really.
HIPAA is about not identifying the patient. It doesn't say you can't take 100M mammograms, wipe out the PII, and add "this image is of a patient that disease X". I'm not sure how much other demographically interesting metadata you can add, like age, gender, height/weight, etc.
'I am a software engineer and I think general AI is upon us but hindered by nasty doctors who want to get rich, therefore I will make every effort to believe that this paper shows that AI is better'
Is that it? I don't think so. That wouldn't make any sense. It seems to me it's trying to suggest the radiologists had 100% prediction accuracy. But that can't possibly be true.
Or, like perpetual motion machines or squaring the circle with compass and straightedge, perhaps the entire premise of the technology is flawed and unfixable.
So, pick the 2 (out-of-36) that did better than a solo radiologist & run them everywhere, at nearly zero marginal cost.
Also check if ensembles of the lesser-performing systems (with other systems or with radiologists) can do better, and whether any of the systems can rapid-classify 'easy cases' so expensive radiologists need only check the tougher ones.
What about the ensemble of multiple AI systems, compared to a single radiologist?
In my mind, many AI systems will still be faster than a single radiologist (though this assumption is a completely random guess from my part, I don't know how fast these analysis robots can work.)
I quickly read through some of the claims that the articles make about the studies.
While I agree wholeheartedly that there are a lot of false claims about medical AI swirling about, I don’t trust the authors of this paper to tell us where it’s happening based on the mistakes I’ve seen here
“ The remaining studies used enrichment leading to breast cancer prevalence (ranging from 7.4%26 to 73.8%37), which is atypical of screening populations. Five studies used reading under “laboratory” conditions at risk of introducing bias because radiologists read mammograms differently in a retrospective laboratory experiment than in clinical practice. Only one of the studies used a prespecified test threshold which was internal to the AI system to classify mammographic images.”
I’m very close to the authors of McKinney et. al and I happen to know that 2/3 of these claims are either false or specious:
1. The operating point was decided ahead of time before the reader study that was conducted in the paper. I was literally there and saw Scott choose it and I saw his methodology for doing so. So this claim that they did not use a prespecified threshold I have first hand evidence of it being false.
2. Enrichment:
Enriching for positives is absolutely a standard practice and it has mathematically 0 impact on computed metrics. It’s not possible to conduct a reader study without enrichment because cancer is so rare that you would never be able to recruit readers to your study. It just doesn’t make sense at this stage of research to not do enrichment because regardless of what you do the study will still be retrospective.
The paper also shows that results on the unenriched data is basically the same.
I think the understanding that we had at Google and that was mentioned in the paper is that these types of retrospective studies are not sufficient to deploy the AI. It’s probably a good thing this paper is stressing those points.
We need prospective studies, but practically speaking it makes sense to first do really solid retrospective studies. Otherwise a hospital system is not going to just let you run your AI on their scans.
Google is proceeding to do this sort of testing with the work that this paper is just falsely thrashing:
I feel that this article has good intent (stressing the importance of prospective trials rather than just retrospective studies)
However, the execution is significantly flawed and the outcome seems to be in this particular forum people believing that AI will never work for this application.
These things take time and there is a pipeline of testing and trials that takes many years to go through.
1. Research and development
2. Retrospective testing: as a replacement system (so you don’t have to figure out the UX of how to make it help the human)
3. Retrospective testing: as a helping system (generally scientific papers have been less interested in this step and it would be nice to see more interest here)
4. Prospective testing as an assistive system.
I think the Google system is on step 4 now, and this paper is looking at evidence published from stage 2 and claiming the system does not work.
It’s probably a fair claim because it has not completed all the testing to be deployed. There is as of yet no intent to deploy it as a replacement system, but rather as some sort of assistive system so that prospective evidence can be collected.
However the way they’ve done this is by misrepresenting the paper, and I really think they’ve overstated their case here.
2. Mammographers directly interact with patients to relay findings and biopsy lesions, and they must collaborate with referring providers and pathologists. Breast cancer is a highly emotional topic in the United States due to a combination of public personas ('Angelina Jolie effect') and political maneuvering (see the history of the federal Mammography Quality Standards Act and Program - MQSA). You can't be a "behind the scenes, in the dark" radiologist if you go into mammography.
From a computer science perspective, I one-hundred-percent agree that mammography is a "solvable" problem by AI. Standardization of mammography images (CC and MLO screening views) alongside BI-RADS makes for an amazingly well-labeled dataset. I fully expect AI to surpass the abilities of one-or-more mammographers in the near future.
And yet, amongst radiologists, mammographers are the least likely to be replaced. Why? For the same reason mammography is lucrative to begin with - Litigation, emotional patient-doctor interactions, and the social/political climate. Any one of these is a steep hill by itself. Most AI companies want none of the legal risk, and practically zero of them offer patient-facing results. And, I wonder if any of the AI startups have considered hiring a celebrity as their spokeperson. Perhaps politics/legislation might work in favor of AI if it becomes standard of care, but I wouldn't want to bank my startup on a federal regulatory change.
We've since changed the title. The submitted title ("94% of 36 AI systems evaluated were less accurate than a single radiologist") broke the site guidelines by editorializing. Submitters: please don't do that.
I work an an AI company where we screen for Diabetic Retinopathy. The company is a decade old, and has validated in huge studies (over 100k patients). It's a hard problem that we have all worked very hard to solve it. At the same time, it's easy to build an AI tool that looks at an image and says healthy or unheathly (or poor quality image). So the bar is low for making something that is appears functional.
But hardly anyone makes good AI tools, so studies that look at different AIs systems see lots of the bad ones. It's always a bummer when the headlines are all dismissive of AI in general. Googling "diabetic retinopathy AI" the top result is "Artificial Intelligence Falls Short in Detecting Diabetic Eye Disease" (https://healthitanalytics.com/news/artificial-intelligence-f...), yet if you read the article it says one tool is better than humans, which to me, is the real takeaway.
I agree. I think the implied conclusion that “none of the AI know what they’re doing!” is disingenuous.
This is like the comparisons of 200 top hedge funds to index performance and concluding that hedge funds can’t beat the market. Ignoring the other conclusion: most hedge funds suck
"Hedge funds" generally aren't supposed to beat the market. They're supposed to do different to the market. That might be, for example, generally a bit worse but hugely better if the market suffers a catastrophic loss for a simple example.
> This is like the comparisons of 200 top hedge funds to index performance and concluding that hedge funds can’t beat the market.
I think you've misread what conclusion you're supposed to draw from this kind of assessment.
For both the hedge funds and this the conclusion isn't that the job is impossible - but that we lack the tools to predict who will do the job. You can't know which hedge fund will outperform (though some will) and we don't seem to be able to predict which AI models will perform well enough to use. Also that your odds of picking a winner "from the field" are quite bad.
It doesn't mean there's anything wrong with using a hedge fund - but there's a level of risk that one shouldn't ignore.
Like, my conclusion from this is that I would not accept a pure AI solution (because most of them are bad), but I would be interested in AI assistance. For many people, the promise is in replacing not supplementing doctors and this is the same as failure.
Oh no, that could be the conclusion. Ever study quantum physics? Lots of people have tried to come up with some kind of model with inner "hidden" variables that tries to make HEP physics into a completely deterministic, almost classical, theory. By your logic, if we just trained and AI with enough subatomic interactions, it would eventually be able to predict with 100% accuracy the results of a quantum process.
That's utter bunk.
Markets are the same way: people think there must be some rules that can be worked out to make them 100% predictable, when in reality there is an element randomness that makes cryptographers jealous. It will never be possible to predict the next number in a random process. If you flip a coin 99 times and get 99 heads, you still can't predict what the 100th toss will be (and before someone "well ackshually"s with something about an unfair coin: don't)
I mean, the job certainly could be impossible, but those studies don't prove it. As it relates to the discussion at hand we know that we can do better at evaluating mammography because humans do it.
Those "hedge funds can't beat index funds or the market" are marketing/PR messaging for index ETFs. Many, if not most, hedge funds are not trying to "beat the market". For some hedge funds, if they did beat the market, then they would be sued by investors for taking far too much risk. Investing has more goals than "make me the most money possible" but most naive investors don't understand this. That said, retail investors don't invest with hedge funds anyway, so most hedge funds don't really care about such misleading messaging.
The article is simplified (a retrospective metastudy) and might not be indicative of what real-life performance. Even reader studies (which would be more rigorous) skip so much that would be crucial to actual deployment (integration into the clinical workflow being one such critical factor).
Exactly. PLUS: I am sure the radiologists used for this paper paid extra attention to their diagnosis, because they knew it was a competition and colleagues are watching.
I am sure the average radiologist, tired and bored at 8am in the morning, is much worse.
This is why the standard practice today for detecting cancer is to use two radiologists to look at every image- to account for that sort of variability. And you can see that a small number of AI's were better than a single doctor, because of that human variability and frailty. None of them were better than two, however, because that variability can be adjusted for, and computers can't. Maybe they will, eventually, but I'm much less optimistic than I was <mumble mumble> years ago when I was 23 and working on vehicle autonomy.
Under-rated comment here. The question is not whether some AI beats experts, even though the conclusion of this paper is clearly that the state-of-the-art AI does in fact beat them, but whether they are going to beat the radiologist at the free clinic in some poorly-served place.
The US healthcare system has little correlation between price and care quality. More expensive providers don't reliably delivery better patient outcomes.
Are you claiming that they found two AI systems that beat a skilled radiologist consistently? Because if they had, that would be the headline. People aren't stupid.
I never had anything serious, just sebaceous cysts and fascia inflammation. On both of those two radiologists worked on them, and everyone was pretty sure it was not cancerous or anything. I think two radiologists are standard for most things where they may recommend a biopsy.
I asked about AI and they couldn't use the one they had on the cyst because it was not trained for ankles.
That just means the scans were easy to interpret and/or the consequences of error were not mortal. While that's most scans, the ones where we care about the most accurate results are the ones the AI is worst.
selecting survivors of comparative study without understanding where the performances come from gives you no guarantee of future performances - which is more or less the issue with all these complex multivariate regressions that journalist, startups and venture capitalists insist calling AIs.
> Thirty four (94%) of 36 AI systems evaluated in these studies were less accurate than a single radiologist, and all were less accurate than consensus of two or more radiologists.
The guidelines say you should use the title as-is, but my attempt might put the focus on the "and 100% were worse than 2 radiologists" part, to mitigate exactly the problem of "well let's use the 6% that actually work" comments that exist without the context.
Not GP, but I'd prefer '34' over '94%'. (Especially since the latter was parenthesised - I'd remove the parenthetical before anything else.)
(Or just 94%, and then have people say 'but N= only 36!'.)
Percentages for small numbers, in particular below 100, are just annoying and often designed to mislead. (Though I'm not suggesting that here, it's standard in medicine if not other fields.)
I admit that radiology sounds like a tractable problem for machine learning, they'll probably get better.
It strikes me, without any real knowledge but this is the internet, that it just goes down the typical rabbit hole of human specialization. Make the data portable, move the data handling overseas, add computer optimization over time. You do have to wonder if radiologists (any here?) really need general medical knowledge or if high specialized training with pictures would be enough.
Huh? 6% of the systems outdoing a human who spent years training for her profession is supposed to be bad news? That's awesome! Especially considering that these are not supposed to replace radiologists (yet), but merely to augment expert judgment.
That's assuming they can consistently do it, that the results in the paper were not just random luck, and not taking into account the costs and risks associated with inaccurate diagnoses.
Sure, but this is all _very_ early. You should see the kind of AI "techniques" some of those papers are using. It's not Imagenet, but people just take "academic" models from papers (sometimes from prehistoric times, like 7-10 years ago) and try their luck with them. News flash: model architectures and training regimes are overfitted quite severely to academic datasets such as Imagenet, OpenImages and MSCOCO at this point.
When you see a bear riding a bicycle, you don't complain that the bear rides poorly. You're in awe that it can ride at all.
Wait until people who know what they're doing move in. Right now a lot of those people are in academia, trying to squeeze the last 0.01% of lift out of Imagenet dog breeds via architecture searches at $100K a pop.
only more accurate than a single radiologist review. Current practice is 2 radiologists review. 0% were more accurate when compared to the current practice.
Title could be something like "Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy" (which is the original title)