As a founder of another company in this field, let me start by saying that this approval is a big deal. Kudos to IDx. This is the very first time FDA has approved a fully automated CADx (computer aided diagnosis) device. Eyenuk is also on it's way to an FDA approval and it is a lot of work conducting the prospective clinical trials.
There are some misconceptions on the thread, so let me help clear them up. A screening test is indicated on an annual basis for anyone with diabetes who does NOT have visual symptoms. Diabetic retinopathy (DR) progresses without any symptoms and is preventable if detected early, but despite its preventable nature, DR is the leading cause of blindness in working-age adults even in the developed world.
The test is for screening rather than providing a full diagnosis and is not intended to replace a dilated ophthalmologist examination. You don't need a specialist to screen, but you need a specialist to diagnose and treat. Sensitivity implies the percentage of times the test is able to correctly identify the presence of more than mild diabetic retinopathy (in this case, 87.4 percent of the time) and specificity is the percentage of times the test was able to correctly identify those patients who did not have more than mild diabetic retinopathy (in this case, 89.5 percent of the time). Note that neither sensitivity nor specificity implies accuracy. The sensitivity and specificity generally compare well to that achieved by humans.
That's correct. UK is the only major (in some vague sense) Country where diabetic retinopathy is not the leading cause of preventable blindness in adults. This is very likely because they are able to screen more than 80% (nearing 85%) of their diabetic population, an impressive feat. This leads to another issue: they need to consistently grade the retinal images of over 2.2 million patients with diabetes. This is where AI could help -- in improving the consistency and turn-around time and we are working with the NHS UK to explore this.
I believe the self-service kiosks would be very much feasible. There are two key components: (1) automated non-mydriatic (not requiring dilation of the pupil) retinal imaging and (2) automated grading of images using AI.
The technology is there but there would be more work needed for a self-service kiosk to be FDA approved. Another thing that is not clear is whether it is commercially a good idea at this time, given that only a single disease (diabetic retinopathy) is approved. I can see a future where one can use such kiosks to look for multiple conditions and assess risks for various diseases including cardio-vascular disease, neurodegenerative diseases, stroke, and hypertension.
>> In one clinical trial that used more than 900 images, IDx-DR correctly detected retinopathy about 87 percent of the time, and could correctly identify those who didn’t have the disease about 90 percent of the time.
I read that as .87 accuracy, .9 specificity (True Negative Rate). However, I can't find in the link provided in the article above the sensitivity (recall or True Positive Rate).
I'm guessing it goes a bit like this (assuming perfectly balanced classes which in reality they aren't):
Predicted + Predicted - Total
Actual + 378 72 450
Actual - 45 405 450
-------------------------------------
Total 423 477 900
Accuracy: 0.8700
Error: 0.1300
True Positive Rate: 0.8400
True Negative Rate: 0.9000
Precision: 0.8936
Recall (TPR): 0.8400
F-Score: 0.8660
I'm not sure how good or bad is 10% false positives and 16% false negatives are for that kind of diagnosis. The linked trial page says that 40% of diabetes patients have some degree of diabetic retinopathy (DR), that early treatment reduces vision loss "by as much as 52%" and that only some 50%-60% of people with diabetes have a yearly eye exam.
Off the top of my head, it looks like automated screening will do some good and probably more good than harm, but without knowing how doctors judge good vs harm there's no way to know for sure how useful this device will really be.
This is not completely correct. Sensitivity (87% in this case) is not same as accuracy, nor is specificity (89.5% in this case) same as true negative rate.
Sensitivity generally gives an indication of safety of a medical device, and specificity gives a general indication of effectiveness.
The original article on Verge reported "87% accuracy" and 90% what sounded like TNR. The new link points to an FDA page that makes it more likely that ".87" is actually sensitivity:
Dx-DR was able to correctly identify the presence of more than mild diabetic retinopathy 87.4 percent of the time and was able to correctly identify those patients who did not have more than mild diabetic retinopathy 89.5 percent of the time.
So, I guess, something like this:
Predicted + Predicted - Total
Actual + 393 57 450
Actual - 47 403 450
-------------------------------------
Total 440 460 900
Accuracy: 0.8844
Error: 0.1156
True Positive Rate: 0.8733
True Negative Rate: 0.8956
Precision: 0.8932
Recall (TPR): 0.8733
F-Score: 0.8831
Thanks for this analysis. The balance in real world is more like 20-80, ie 20% of typically screened patients would have referable retinopathy (screen positive).
What's crazy is that if AI is better than doctors by some significant degree, what do we do when the doctor and AI disagree? Like if doctors are right 85% of the time, but the AI is 90%.
I guess we treat it as another doctor? Like if we have 4 opinions that agree, we go with that one regardless of the source of those opinions (as long as they meet some minimum competence threshold).
As others have said - perform more tests. I've worked around molecular pathology testing for cancer (which comes up with a diagnosis based on data analysis after running DNA or RNA sequencing). If the molecular report differs from what the surgical pathologist saw when looking under the microscope, it's typically not 'it's cancer' vs 'it is not cancer', but more 'this is molecularly non-small cell lung cancer' vs 'this appears to be prostate cancer'. So what will happen is they'll do more staining on the sample, specific to what the molecular report came out with - and a lot of times - bam. That tumor in someone's prostate is actually lung cancer and needs to be treated that way.
Currently, what happens is that if a diagnostic test comes back and it suggests something serious, say cancer, and the doctor does not pursue it, then the doctor would be liable if it did turn out to be cancer.
So if a machine disagreed with a doctor, then I would assume that the doctor will grudgingly have to investigate further until there is enough evidence to rule out that diagnosis.
#headache
What I can see happening is that patients will go to this machine for a second opinion. And if an opinion then returns that contradicts the primary physician, then an entire can of (legal) worms will be open.
--
Addendum:
To elaborate further, there is sometimes what's called the benefit of history.
Say a patient visits 10 doctors. The 10th doctor has an unfair advantage to the first 9 simply because he/she will have the prior knowledge of which diagnoses and treatments were incorrect.
Similarly for an AI vs Human Doctor situation, the incorporation of additional information (for the AI) would require considerable amount of big data to train in order to be able to recognize prior history, failed treatments, and such.
For image specific diagnoses (eg. recognizing melanoma, retinopathy), these do lend themselves to AI very nicely. For other diagnoses that contain a significant amount of, shall we say, "human factors", then less so.
Doctors aren't liable for failing to predict the future or making imperfect diagnosis.
If a doctor reviews the available data, reasonably concludes that it shouldn't be pursued further, and it later does turn out to be cancer, then that by itself does not mean that the doctor is liable for anything. Malpractice requires actual culpable negligence, such as missing something obvious, not interpreting a questionable situation in a manner that turns out to be wrong. The existence of a second, contrary opinion doesn't change that.
This isn't a new issue, there have been CAD systems that outperformed average clinicians (on very specific tasks) since at least the mid 90s. At the end of the day in some jurisdictions liability drives the resolution process, efficiency in others.
>> Like if doctors are right 85% of the time, but the AI is 90%.
There is always the possiblity that the doctors and the device are both right 90% of the time, but not the same 90% of the time.
Or that either the doctors or the device are right most of the time for the most severe cases but the other party is right only for the milder cases, etc.
I know that this is a totally whack-job comment, but the TV show "the Good Doctor" is kinda leaning this way. Instead of relying on a ton of personal bias - the main character is generally more similar to how a ML would diagnose things, obviously there's no way to establish any foundation to this since it's based on a TV show. But it offers a vision of what you're saying but instead of AI it's a Savant syndrome individual that is making better judgments than the rest of the doctors. That being said, I would imagine a Savant is being placed in a role like that is less likely than the show portrays so where does that place AI?
This is definitely going to be an issue. Even in cases where you're measuring your tool against "expert consensus" (often 3-5 physicians), there's a reasonable likelihood that the consensus may be wrong in certain types of cases.
Though even in those cases, you might be looking to show that your tool agrees with physicians at least as often as physicians agree with each other. Malpractice is usually about failing to offer the standard of care, and if you can show a reasonable level of concurrence with the standard of care in research and trials, you may be able to move forward and reach those higher levels of accuracy.
a) Usage still requires the presence of the Doctor.
b) The doctor does nothing but relay the AI’s message.
c) The doctor continues to charge the same and treat the same number of patients.
d) Everyone who expresses “hey isn’t the doctor redundant now? Shouldn’t we be treating more patients for cheaper” gets ridiculed as “one of those people”.
e) Edit: Also, the doctors’ association devotes significant resources to come up with memetically virulent reasons why the world would end if we took doctors out of the loop.
I mean, that’s how a lot of obviated jobs are currently treated...
Insurance companies assume the liability for the doctor's diagnoses. I'm not sure why they'd be unwilling to do the same for the software's diagnosis. Somewhere, an actuary is willing to estimate that risk.
The company is saying you don't need a specialist, but after bayes theorem (using 90%TN 87%TP and D(A)= 200,000 complication / 29,100,000 diabetes), the chance you have this condition after the machine says you do is 0.83%.
If the images are of sufficient quality, the software provides the doctor with one of two results: (1) “more than mild diabetic retinopathy detected: refer to an eye care professional” or (2) “negative for more than mild diabetic retinopathy; rescreen in 12 months.” If a positive result is detected, patients should see an eye care provider for further diagnostic evaluation and possible treatment as soon as possible.
Are you interpreting the Verge or talking about some other statement?
> IDx-DR founder Michael Abràmoff told Science News. “It makes the clinical decision on its own.”
But I would guess a specialist would still need to be involved since it's not a fool-proof system. A specialist might take other symptoms or variables into account when making the diagnosis or order further tests. While this tool might be useful for blanket screening considering that it is harmless, it seems like it's hardly going to be "making the decision on its own" and prescribing treatment.
90% of people are accurately detected as not having the disease, i.e. 10% FP rate. So ~3m people would be falsely diagnosed as having the complication, to 200k/3m have the disease.
10% isn't a great number, but it isn't clear from this coverage whether this complication is generally asymptomatic or not. If there are symptoms to go with it, the numbers may be far better.
Also keep in mind that if there are positives general practitioners would refer to a specialist anyways for treatment. These specialists would be more than equipped to detect false positives. Teleretina imaging is becoming more and more prevalent as well with eyePACS and Welch-Allyn having dedicated interpretation services, so patients wouldn't have to necessarily go somewhere for verification.
I'm more worried about the 87% accurately detected as having the disease, i.e. 13% of false negatives (FN). I don't know how many general practitioners would actually send a patient to a specialist if the device did not detect changes.
The retina seems well suited to AI approaches, though, so I'd be interested in what comes next from companies like this, DeepMind, and other researchers/organizations (look out for Lee et al over at the University of Washington)
Perhaps, that's one way to compare non-inferiority. Few if any primary care physicians take the time to look in ones eye (don't know how to use direct ophthalmoscope, don't have other specialized fundus cameras in clinic). Given that, if this tool forces them to take more retinal photographs of all patients, maybe we could detect diabetic disease before it is usually seen.
The standard practice today is that if a patient is determined to be diabetic then they get referred to an ophthalmologist visit once a year. In that case would comparing those rates of diagnosis be useful?
They aren't. With some luck their errors do not correlate with the ones of the machine, that gives us the best of both worlds: cheap diagnosis for healthy people and precise diagnostics for unhealty ones.
Keep in mind that the test wouldn't be administered on any randomly chosen diabetic patient but presumably rather on those who already exhibit some type of vision imparement consistent with the disease. As a result, the prior of 200k / 29M is not quite right and I'm guessing that the true prior is likely much higher.
>Keep in mind that the test wouldn't be administered on any randomly chosen diabetic patient but presumably rather on those who already exhibit some type of vision imparement consistent with the disease.
Not true. The aim is to treat patients before they become symptomatic. Outcomes are much worse otherwise.
Any machine or specialist that wants to do a diagnostic based on a test result will have to consider other factors apart from the result.
The false positive bayesian math is a good illustrative example, but reality is more complicated. And no doctor will base their diagnostic solely on one number.
Since a lot of the discussion was objecting to the title and/or exaggerated coverage, and the press release is more factual (let that sink in for a second), we changed the URL from https://www.theverge.com/2018/4/11/17224984/artificial-intel... to the press release.
Generally, the FDA (or the government body responsible for certifying medical devices) does not conduct code reviews in the sense of looking at the code and trying to find bugs.
The way it works is: the manufacturer of a medical device assesses the harm that can be caused by a software malfunction, and assigns it a safety classification (class A, B or C). Class A is used when no injury is possible, and class C is used when death or serious injury is possible (e.g. a surgical robot). The manufacturer also provides a "failure modes and effect analysis" document that looks at everything that could go wrong, what is the likelihood of the failure happening and what is the effect on the patient.
Based on the safety classification, IEC 62304 requires different levels of rigour. For example, the standard only requires blackbox testing for class B software, whereas for class C software it requires whitebox tests as well.
The manufacturer also needs to come up with a software development plan that ensures that all of the requirements of the standard are met, and an "argument" (supported by test reports, process documentation, source control history, etc) that the software was developed according to the plan.
And that is what the FDA audits: they look at the development process of a given feature and they check that the plan was followed. I think they rarely delve into the details of the implementation and are generally just checking that the safety arguments are sound and supported by evidence.
Worth noting that this is a rough description of how IEC 62304 looks at the problem, but adherence to that standard is not required by many regulatory bodies (including FDA, although there is guidance). It's a good approach to this, but there are others.
More generally, the regulatory body will be looking for you to have a formal engineering process in place and be able to demonstrate its efficacy. Part of that will be looking for how you do hazard and risk analysis, how you handle CAPA (corrective and preventative actions in FDA-speak), how you do system trace, design history file generation, etc. etc. That you have a software development plan and can demonstrate how you follow it.
So they aren't really interested in code reviews per se, but they are very interested in how you view code reviews, how you perform them when you do, what gets documented, how you perform trace an V&V etc.
Seems like they are using this to get a proper referral to a specialist rather than using this as sole diagnosis. The code itself is probably 15 lines using Tensorflow or other framework I'm guessing, but could be wrong.
Oddly enough that's what they're aiming to claim a monopoly over.
Here's[1] a patent they have filed towards the system. Claims 1-18 and 20 are focused on the training of the neural network. Looks like Claims 1-18 are going to be granted soon largely in that form also from looking at PAIR[2].
I got a book on TF last year and saw all the examples were fairly short and was impressed with the power. However it took me a good three months before I could reason about why reach line was there. It's pretty complex stuff that takes a while to even understand, let alone actually write.
A lot of machine learning in particular is like that. The best practitioners have very strong intuitive understandings of the data and the modeling problem, which is what allows them to construct effective models (wrt capacity, training algorithm, topology, use of certain optional training procedure additions). I would liken their job to that of a doctor writing a prescription: it doesn't take much to write the prescription itself, but it takes mastery to be able to analyze the problem and write the best prescription
They generally treat compute devices as black boxes. Performance is all that matters. But once an artifact is submitted, it's locked down. That and only that will be approved. You change a resistor, an if-then, it goes back to FDA.
In general though approval to market for particular indications is for one fixed configuration of a product, so your model parameters won't change.
All of this is in the process of being hashed out, but I expect for a while at least if you are doing on-line learning it will be in non-clinical configurations only and you will end up releasing an update periodically. Depending on the changes this may need a new 510(k) or not, but would definitely need a formal release.
As someone in this field (AI+medicine), I think this is the best analogy. Though a key distinction is that the human body is also a person whereas a plane is not. Physicians are there to promote the health of the person not just their mechanical parts. I've seen doctors fudge billing codes to help poor patients afford care. I've seen doctors pick up on domestic violence situations based on small social cues. There is a certain degree that healthcare relies on the humanness and empathy of physicians to promote human flourishing.
I'm a doctor and very much pro ML/AI. I'd love to have an autopilot I could watch in awe. Still a lot of practical tasks though that will be much harder to automate. And for the first few generation of AIs I guess someone will have to babysit them.
I think a lot of commercial pilots would take offense to this. Commercial pilots can still fly without autopilot. Just as much as you can drive on the highway without cruise control.
For those interested in the research side of this, Google Brain actually published a study in JAMA on the same topic. They did clinical trials in India and should be publishing those results eventually.
In terms of how well the "experts" perform: "For moderate or worse DR, the majority decision of ophthalmologists had a sensitivity of 0.838 and specificity of 0.981"
There are several comments about the accuracy of the algorithm, but Doctors also struggle with diagnosis. In the Deep Mind study on DR they found that for about 20% of referable cases doctors disagreed on the diagnosis about 40-50% of the time. In order to combat this they had at least 3 and up to 7 opthamologists grade each image.
I wonder what factors made this decision possible? I love the idea of automated diagnosis but the performance rates are 87% true positive and 90% true negative in the article. Seems a bit low.
Maybe people aren't getting diagnosed at very high rates? That would be a reasonable justification for deployment with somewhat less than perfect accuracy. Anyone have any insight?
For new stuff, FDA performs a holistic cost/benefit analysis.
In this case, they might weigh:
* How many new cases are caught by expanding access to specialist tools
* What fail safes exist in current course of care — how does a false negative result in a worse outcome for a patient than if they had had no diagnostic at all
* etc.
The summary of their decision is public record, but not the detailed analysis.
It also seems like a quick way to screen for this condition, vs the level of concern that a primary care doc might need to have before referring someone to an ophthalmologist. In other words, you could screen more people, even those that are asymptomatic (but have diabetes) and potentially catch retinal disease earlier. Even without mind-blowing sensitivity / specificity, this could preserve a lot of people’s vision.
The status quo is screening at the eye doctor. This enables screening at primary care visits. People mostly go to primary care more often than the eye doctor.
The medical risk is that people will forgo other screening for 12 months when given a negative result. The cost of additional screening for false positives is the other big downside (this is all the machine does, recommend a specialist visit or to rescreen in a year).
> But of course, not having a specialist “looking over the shoulder,” as Abràmoff puts it, raises the question of who will be responsible when the diagnosis is wrong
Ultimately, accountability and transparency will be the Achilles heel.
Ultimately it has to be the company providing the technology that is liable, but I bet you they have a terms of use clause that puts the responsibility on the clinics using the technology.
The more important question is what they're actually held to. In the case of a diagnostic test like this, the most reasonable thing is to hold them to the false negative and false positive rates that they say the product has. No real-world diagnostic test is perfect. Instead, we design the systems around the tests to accommodate the fact that they do have errors. As long as the device manufacturer is correct about their probabilities of errors, then we can incorporate them into a system that works better than what we had before. Or we can know that we cannot use them in a useful way, and just ignore it.
I feel like this has the same promise as self driving cars - raise the floor for the quality of service while also experiencing random unexplained failures that its human supervisors fail to notice in time.
Based on my experience watching doctors (good ones) trying to diagnose "weird things", they are just manually executing an expert system algorithm anyway. They aren't doing what an engineer might expect -- working from a basic understanding of how the body works and looking for plausible explanations. They're instead simply pattern matching against a database of facts.
A lot of diagnostic skill in complex cases is based on clinical intuition developed over many years of practice. That's qualitatively different from an expert system executing defined rules against facts.
You'd think, but honestly in my experience it wasn't like that. An expert system would have done as well, or better. Possibly my experiences were colored by the cross-specialization nature of the issue -- specialists seem reluctant to engage any thinking outside of their area, I found. Like a software engineer would never consider that the problem they're investigating is caused by memory bits flipping randomly.
The company's board (politcos) and syndicate (there is none) is a bit weird for a FDA approval...maybe there are some caveats or scope notes that I have not seen.
"The 207,130 images collected were reduced to the 108,312 OCT images (from 4686 patients) and used for training the AI platform. Another subset of 633 patients not in the training set was collected based on a sample size requirement of 583 patients to detect sensitivity and specificity at 0.05 marginal error and 95% confidence. The test images (n = 1000) were used to evaluate model and human expert performance." --
I don't see the big deal. We've been using 'AI' in medical equipment for nearly decade. Look at DeVinci surgery devices. Its just until recently the technology has been called AI.
Well, sure, but the software's architecture probably isn't too particular to this use-case; it's just computer vision. Given FDA approval for using CV for this, we'll probably quickly see many other companies attempting to drive similar technologies to market.
Or, rather, we would if there were any existing companies in this space that could take advantage of this. The creator of this device, IDx (https://www.eyediagnosis.net), seems to be rather unique in being an entrepreneurial medical-device manufacturer; MDMs are a rather hide-bound lot.
Honestly, if there's going to be a wave of innovation in this space, I might expect it to come from the inorganic-chem-focused pharma companies, since they have the expertise in both materials science and machine-learning (from doing novel small-molecule detection studies) required to come up with the innovations. I expect they'd likely partner with one or more of the MDMs to build the hardware, but they would write the software.
I think the title is fine, but a lot of the comments are applying what the software does to medicine in general.
What this software is used for is very specific, but also very useful in that it is a common medical problem. It is used only to help diagnose diabetic retinopathy (ie eye damage caused by diabetes).
This is AI Vision software used to analyze a photograph of someone's retina to detect damage. In essence it is much more like the programs that are used to analyze chest X rays to detect pneumonia that have been recently published. Where this is useful is that it can probably cut out a lot of human work in diagnosing retinopathy, however it is an incremental step. Even when I was a resident in a primary care clinic years ago the process was somewhat automated like this, with our medical assistants taking a photograph with a special machine, and then this photograph would be digitally sent to a specialist (I presume an opthalmologist but I could be wrong, maybe optometrists can be licensed to do this) for interpretation.
What this isn't, is diagnosing a patient based on taking a history and inputting examination findings and labs, etc... We are still quite a bit of a way from that but I'm sure people are working hard on that as well.
EDIT: In my opinion, where AI could really make a huge difference for my work as a hospitalist (a doctor that admits and rounds on hospital patients) is in voice recognition software, with eventual language processing to help me write notes faster. First, give me a program like dragon dictate but which I can use in the patient's room (obviously one would have to figure out the HIPAA compliance issues) that transcribes my voice and the patient or family member into a readable text file that I can review when I write their note.
Next step would be that same program can give me its attempt to summarize our interview into a reasonable note, which I can edit for accuracy. This would be in effect an AI scribe. A scribe, for those who don't know the medical jargon, is a hired person whose only job is to listen to a doctor interview a patient and help write medical notes, they are usually young pre-medical students. It's a relatively new position that became created as the burden of documenting in electronic medical records limited the amount of time providers could spend with patients. Very common in Emergency Medicine where high output is needed, sometimes also in primary care.
Next step, is you have a company with all this protected medical transcription data and eventual medical outcomes, and you use ML to find algorithms to try and tease out what variables ended up being the most useful for accurate diagnosis. Before that you could have the program prompt the doctor for questions that it thinks would be helpful, etc... Again, huge medico-legal barriers to this but there is a roadmap to becoming a billionaire in my opinion.
Eric Schmidt (former Google chairman) proposed building an AI scribe during his HIMSS 2018 conference keynote address. He claimed it should be possible within 10 years. I hope he's right, but I suspect he underestimates the difficulty of building reliable clinical systems.
> What this isn't, is diagnosing a patient based on taking a history and inputting examination findings and labs, etc... We are still quite a bit of a way from that but I'm sure people are working hard on that as well.
This is one of the classic AI application domains -- for example this is why Stanford's Knowledge Systems Lab was near to the medical school and why the mainframe Zork was developed was owned by MIT's Medical Decision Making group.
PHI is a real barrier to your wish list, it's harder to get right than most people think and it is very very hard to get right in a distributed environment. And all the current advances in these general areas (voice recognition, transcription, semantic reasoning, etc.) are leaning heavily on distributed processing.
It's definitely a well identified market and there are people working on it, but I haven't seen much progress (not that I'm looking terribly closely at the moment).
From the point of the view of the last step, the ML one, I suspect the consenting is as much an issue as the PHI. Getting any real traction on this will likely require massive data sets and significant clinician time to aid training.
The barrier is high enough that I heard that Microsoft is working on this with UW by using pretend patient encounters in an attempt to bypass PHI and get a database.
There are some misconceptions on the thread, so let me help clear them up. A screening test is indicated on an annual basis for anyone with diabetes who does NOT have visual symptoms. Diabetic retinopathy (DR) progresses without any symptoms and is preventable if detected early, but despite its preventable nature, DR is the leading cause of blindness in working-age adults even in the developed world.
The test is for screening rather than providing a full diagnosis and is not intended to replace a dilated ophthalmologist examination. You don't need a specialist to screen, but you need a specialist to diagnose and treat. Sensitivity implies the percentage of times the test is able to correctly identify the presence of more than mild diabetic retinopathy (in this case, 87.4 percent of the time) and specificity is the percentage of times the test was able to correctly identify those patients who did not have more than mild diabetic retinopathy (in this case, 89.5 percent of the time). Note that neither sensitivity nor specificity implies accuracy. The sensitivity and specificity generally compare well to that achieved by humans.