Hacker News new | past | comments | ask | show | jobs | submit login
International evaluation of an AI system for breast cancer screening (nature.com)
86 points by superfx on Jan 1, 2020 | hide | past | favorite | 26 comments



A bit unrelated, but I'm surprised because of how second readings are performed.

_In the UK, each mammogram is interpreted by two readers, and in cases of disagreement, an arbitration process may invoke a third opinion. These interpretations occur serially, such that each reader has access to the opinions of previous readers._

I wonder if it's sensible to condition the opinion of the second radiologist this way. My first impression is that this should be done blindly or the temptation of just confirming the previous reading because of fatigue/lazynes or to avoid confrontation could severely affect the results.

Similarly, a future in which the radiologist gets too much support could just impair their skills and turn them into a sort of a mechanical worker who simply accepts what the AI overlord has decided.


I’m not a huge fan of turning the BI-RADS classification scheme into a ROC curve. From what I’ve seen, BI-RADS is something like a yes / no / maybe scheme for mammograms. I don’t think it was designed to be treated like a test score, so using it to generate a ROC curve feels like an unfair comparison between the AI system and current clinical practice.

What they’re doing is interesting, but it’s still very academic. I have little doubt that eventually some sort of AI system will benefit clinical practice, but based on the sheer number of studies that fail to make it over the line, I’m not sure I have high hopes for this one. Why they’ve done so far is the equivalent of “It works in vitro...”


As I understand, for comparison, they also turned AI system output to BI-RADS class and then back to ROC curve.


Huh, where does it say that in the article? I don’t think I spotted that.

All the same, it feels sort of beside the point to me. It just doesn’t feel right to take a medical diagnostic tool - whose intended purpose is for communication among doctors - and treat it as a test score. That’s just... not what it was designed for.


In page 3, "Readers rated each case using the forced BI-RADS scale, and BI-RADS scores were compared to ground-truth outcomes to fit an ROC curve for each reader. The scores of the AI system were treated in the same manner (Fig. 3)."

This isn't as clear as I want it to be, but Fig. 3 shows both "AI system" and "AI system (non-parametric)" ROC curve. My understanding is that the former is fit from discrete BI-RADS class, and the latter is "raw" output.


> Notably, the additional cancers identified by the AI system tended to be invasive rather than in situ disease.

This is probably due to invasive cancers being more common (~80%) than in situ cancers. I am not sure why this natural explanation was not suggested.


> Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. [...]

> In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening.

So, there you have it: AI not "either/or" humans, but both, in conjunction, as a composition of the best of both worlds.

At the very least, that's how civilization will massively and intimately introduce true assistant AI.

It's also somewhat counter-intuitive to think that the most specialized tasks are the low hanging fruits; i.e. that the "difficult" to us, culminating years of training and experience for humans (e.g. how to read a medical scan) may be, per its natural advantages (like speed and parallelism), "easy" to the machine.

That space (where machine expertise is cheaper than human) roughly maps to the immense value attributed to the rise of industrial-age narrow AI; therein lies not a way to replace humans — we never did that in history, merely destroyed jobs to create ever more — but rather to augment ourselves once more to whole new levels of performance.

Anything more than this is AGI-level, science-fiction so far — and there's not even a shred of evidence that it's theoretically a sure thing, possible in the first place. Which is not to say that AI safety research isn't extremely important even for the narrow kind (manipulation comes to mind), but we shouldn't go as far as to bet future economic growth on its existence. Like fusion or interstellar travel, we just don't know. Yet, and for the foreseeable future, because scale.


Exactly this. This is where I see AI possibly going: To be a complimentary tool or second pair of eyes to speed up the work for the professionals rather than replacing them. I also see this research as a very positive step forward for using AI for good and especially bringing highly accurate results that can used as a aid for health professionals.

However, given that this research used a deep learning (DL) based AI system in the medical industry, there are still questions around this AI system explaining itself and its internal decision process for the sake of transparency, which will almost be ignored in other news reporting sites and will focus only on the accuracy. DL-based AI systems will still be a concern towards both patients and clinicians and I would expect this to be a focus point in the future, despite the welcoming results which is still very interesting anyways.

Other than the transparency issues behind the AI system, I'd say this is a great start into the new decade for AI.


Agreed. The ability of the someone/AI to explain their decision making process, is critical in determining whether such a decision has been adequately thought out or not. If a PhD must go through a viva, surely it is also incumbent on anybody pushing "AI" to also be able to "survive" such a viva. Otherwise, we might as well just go back to the days of reading entrails, flipping coins, etc. [edit: typo on viva]


Note that the system does produce localization. "In addition to producing a classification decision for the entire case, the AI system was designed to highlight specific areas of suspicion for malignancy."


How many years did centaurs reign supreme over pure AI in chess? 5-10 maybe? This "both" stuff is just a temporary stop on the way to meat obsolescence.


Agreed. At some point, doctors will be a completely redundant step in analyzing these scans. Even before then, the AI will reduce the amount of labor needed and partially commoditize some medical professionals.


The only issue is that humans don't seem to do well at jobs in which another agent is at least plausibly reliable. The Tesla autopilot is an example of that, we tend to disconnect pretty quickly.

Another thing I find interesting is that Google was able to train a neural network on retinas and can reliably distinguish sex based on retinal image alone...something opthamologists basically can't do. So not only are these systems approaching human capability in tasks we can do, they can do things we can't. As medical data becomes more freely flowing (presumably) over the next couple of decades, i think we'll find that 'AI' can become even more reliable.


I think machine's advantage can be summarized as "good at aggregating weak signals". Humans excel at analyzing complex signals, but basically can't use signals weaker than some point. Machines have no trouble with weak signals.


Perfectly describes what I was struggling to put into words, thank you.


There were like 80 different vendors showing assisted ai in many different spaces at RSNA 2019

Well known every major pacs vendor is looking for assisted findings (some have been available for ages, ex: icad bi-rads finding for telling rad to check) or even just case prioritization for radiologists. (Ex: aidoc has an algorithm for brain bleed for case prioritization, not a diagnosis).

They all are employing machine learning really (zebra medical claims 30 million scans processed).

Medical "ai" "algorithms" companies are vastly growing in the past few years


> Ex: aidoc has an algorithm for brain bleed for case prioritization, not a diagnosis).

Eh... well it's more complicated than that. These systems CAN diagnose, but their regulatory approval is only for use as an aid, not as a diagnosis tool.


Paywalled scientific research, no source code and the reviewers who don't mind calling run off the mill CNNs as "our AI system".

How far Nature has fallen these days? How long before Nature is merely PR agency for the big tech?


Unfortunately the biomedical field gets a bit too excited when Google publishes some work, so what should be minimum requirements get thrown out the window for fear of some other journal getting to publish it.


Not only is this "no code", it is also "no data"! At least data availability statement is now required so we know data is not publicly available...



Related from 2 days ago: https://news.ycombinator.com/item?id=21917747. This looks like different work though?


Yes, it is a different work. The current one published at Nature is from DeepMind.

It is interesting to note the differences. For example, DeepMind notes "In our reader study, all of the radiologists were eligible to interpret screening mammograms in the USA, but did not uniformly receive fellowship training in breast imaging." whereas DeepHealth notes "All readers were fellowship trained in breast imaging", so +1 to DeepHealth.

On the other hand, DeepMind says "Where data were available, readers were equipped with contextual information typically available in the clinical setting, including the patient’s age, breast cancer history, and previous screening mammograms." while DeepHealth says "Radiologists did not have any information about the patients (such as previous medical history, radiology reports, and other patient records)", so +1 to DeepMind. And so on. These differences make direct comparison between studies very difficult.


This "+1" thing is damaging and incorrect.

Depending on the context the model ends up being used in something that appears good may not be. For example the fellowship training thing - these non-fellowship trained radiologists are doing this task now, so it is absolutely reasonable to assess against them to test real-world performance.

It would be interesting to see if the fellowship trained radiologists did actually perform better in all circumstances (in some fields the better trained radiologists end up not using their skills on as broad a range of patients, so their performance is actually worse one some subsets of data).


+1 was mostly to indicate whether you should upgrade or downgrade the reported result to be comparable with other studies. I didn't mean to imply whether it improves clinical relevancy.


Yeah that is fair.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: