Hacker News new | past | comments | ask | show | jobs | submit login
AI Detects Heart Failure from One Heartbeat: Study (forbes.com/sites/nicholasfearn)
126 points by rusht on Oct 18, 2019 | hide | past | favorite | 67 comments



Wasn't a similar claim made about an AI detecting skin cancer from moles? Once the AI was deployed in the real world is failed miserably. I think it was a ton of false-positives because it was trained on images where cancerous moles all had images of rulers with them and the benign ones didn't have rulers in the image. So it just picked up on the ruler as a cancer indicator.


Would you happen to have a source for that story? My workplace has really swallowed the AI Kool-Aid lately, so I would like to have some cautionary counterexamples to demonstrate potential pitfalls of the technology.

It's got a lot of interesting applications for our field which I am excited about, but there seems to be a tendency among non-experts to consider it a magic bullet that can solve any sort of problem. In particular, I am concerned about applications where conventional approaches have already converged on an optimal solution that's used operationally, but somebody wants to throw AI at it because they thought it might be cool without first understanding the implications.


Citation: https://jamanetwork.com/journals/jamadermatology/article-abs...

From the abstract:

This study’s findings suggest that skin markings significantly interfered with the CNN’s correct diagnosis of nevi by increasing the melanoma probability scores and consequently the false-positive rate. A predominance of skin markings in melanoma training images may have induced the CNN’s association of markings with a melanoma diagnosis. Accordingly, these findings suggest that skin markings should be avoided in dermoscopic images intended for analysis by a CNN.


Stolen from another HN comment about using AI on medical records.

"Finasteride is a compound that is used in two drugs. Proscar is used for prostate enlargement. It is old, out-of-patent, and has cheap generics. Propecia is used for hair loss. It is a newer, and (at the time) very expensive. The only difference is that Propecia is a lower-dose formulation.

What people did was to ask their doctors to perscribe generic Proscar, and then break the pills up to take for hair loss. Doctors would then justify the prescription by "diagnosing" enlarged prostate. This would enter the patient's health records. If you apply deep learning without being aware of this "trick", you would learn that a lot of young men have enlarged prostates, and that Proscar is an effective, well-tolerated treatment for it.

Health records are often political-economic documents rather than medical."


Check this link out http://132.206.230.229/e706/gaming.examples.in.AI.html

There was a hn discussion about it a while back: https://news.ycombinator.com/item?id=18415031


What a great resource! The first example is perfect:

> Aircraft landing

> Evolved algorithm for landing aircraft exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score


Very amusing. Some of my favorites:

> In an artificial life simulation where survival required energy but giving birth had no energy cost, one species evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children).

> (Tetris) Agent pauses the game indefinitely to avoid losing

>Agent kills itself at the end of level 1 to avoid losing in level

>Robot hand pretending to grasp an object by moving between the camera and the object


Yeah. Long term it still looks like that idea might work out, but that was a funny story.

This paper offers like 1/1000th the evidence of that one.


> Once the AI was deployed in the real world is failed miserably.

They didn't show that.


As a doctor as opposed to an AI researcher, so many of the choices this study makes are baffling to me.

First of all, why just one heartbeat? You never capture just one heartbeat on an ECG anyway, and "Is the next heartbeat identical to the first one?" is such an important source of information, it seems completely irrational to exclude it. At least pick TWO heartbeats. If you're gonna pick one random heartbeat, how do you know you didn't pick an extra systole on accident? (Extra systoles look different, and often less healthy, than "normal" heart beats, as they originate from different regions of the heart.)

Secondly, why heart failure and not a heart attack? One definition of heart failure is "the heart is unable to pump sufficiently to maintain blood flow to meet the body's needs," which can be caused by all sorts of factors, many of them external to the actual function of the heart - do we even know for sure that there are ANY ECG changes definitely tied to heart failure? Why not instead try to detect heart attacks, which cause well-defined and well-researched known ECG changes?

(I realize AIs that claim to be able to detect heart attacks already exist. None of the ones I've personally worked with have ever been usable. The false positive rate is ridiculously high. I suppose maybe some research hospital somewhere has a working one?)


To add to this, looking at figure 4, why is their "average" heartbeat so messed up? That's not what a normal average heartbeat looks like. P is too flat, Q is too big, R is blunted, and there's an extra wave between S and T that's not supposed to be there at all. If their "healthy patient" ECGs were bad enough to produce this mess on average, it's no surprise their AI had no trouble telling the data sets apart.

(For comparison, the "CHF beat" looks a lot more like a healthy heartbeat.)


> First of all, why just one heartbeat?

I think it's a sort of academic machismo. "Look what we can do - isn't it amazing?"

I saw the same thing in Robotics recently. An academic came to give a talk on localisation using computer vision: they cross-referenced shop signs that were seen by a robotic camera with the shop's location on a map to get a rough estimate of where the robot was. My first question was "what is the incremental benefit of this approach when it was combined with GPS?". It turned out that the researchers just hadn't used GPS at all - almost like they considered it to be "cheating".

I feel like many academic disciplines have unwritten 'rules' that you need to follow if you want to be included in the conversation. Not all of those rules are sensible.


I'm going to sound like a skeptical jerk here, but 490,000 heartbeats is how many patients? From what I recall these public ECG datasets are like 20 patients who underwent longitudinal ECGs. 500k heart beats is like 5 person-days of ECG recordings.

Ninja Edit: N=~30 patients. For something like ECGs which are readily available, they really should have tried to get more patients. A single clinic anywhere does than 30 EKGs per day. Suggesting this is clinically applicable is ridiculous. It's way too easy to overfit. Chopping up a time series from one patient into 1000 pieces doesn't give you 1000x the patients.

I even think this approach probably will work. Very reasonable given recent work from Geisinger and Mayo. But why are ML people doing press releases about such underwhelming studies?


Yes, Table 5 shows that N is 18 without CHF and 15 with CHF. These come from separate data sets that have EKG data sampled at different frequencies.

Basically, they took 18 electrocardiographic tracings (sampled at 128 Hz) from participants without CHF, of whom 13 come from women. They compared them to 15 electrocardiographic tracings (sampled at 250 Hz) from participants with CHF, of whom 4 come from women.

Hard to even know where to begin with this one.


Statistically insignificant sample sizes showing desirable outcomes are the bread and butter of getting more funding in medical research


A lot of machine learning people don't really understand study design or power or things like that. It's gotten a little better over the past decade or so, but this is an area where the field has a lot of room to improve.


I disagree. Machine learning education almost always involves a lot of focus on design of experiments, causal inference, A/B testing and related topics.

I could agree with your claim if you meant bootcamp programs or data science sorts of coursework, but machine learning is generally grounded in both measure theoretic probability theory and a robust understanding of applied statistics before moving on. After that will be the basics of pattern classification, clustering, regression and dimensionality reduction. Last of all will be very domain-specific tools for NLP, computer vision, audio processing involving e.g. deep neural networks.


"A lot" of ML people perhaps, but also the overwhelming majority of clinical scientists and the near totality of doctors.


There are plenty of academic and academic center trained physicians that understand study design and are competent in research. They aren’t typically primary care/general practitioners so you just don’t encounter them as much. And yes they are the minority. But it’s not the totality.

Clinical research that isn’t making ridiculous claims tends to get much less press.

Furthermore, of all places to lap that crap up... this hacker news site is frankly one of the worst.

I mean look at this submission. Yes it’s true people are pillorying it here (including some doctors), but i don’t recall much interesting well designed medical research being discussed here (though arguably maybe not the place for it)


And also different sources for positive and negative samples.


So they built a model that can tell the difference between the two different sources.


Yeah, and recorded at different frequencies. Could easily be a downsampling artifact.


I mean, the issue would be in the structure of the cross validation approach. Say, set training = 29, test = 1, build models etc: how well did you do on the one? Rinse, wash hands, repeat 30x. This is your cross validation error rate.

Its not that difficult.


That would not solve this


Didn't they mention that they sampled only 5 minutes of heartbeats per patient? That would be n=1633, assuming a heart rate of 60.


> I'm going to sound like a skeptical jerk here

Well you're on the right site for it!

At least you didn't claim that you could come up with something better by tinkering on a rainy Sunday afternoon.


Probably could though


there it is


So first clarification is that heart failure != heart attack. Heart failure is a chronic condition where the heart is unable to pump hard enough to keep blood flowing through the body. Typically results in blood pooling in the leg, shortness of breath, etc.

The study avoids the obvious pitfall, which is to put different slices of one patient's data into both training and test. The press also reports the training accuracy (100%) when the test accuracy/sensitivity/precision metrics are all at around 98%.

Another encouraging sign is that when you dig into the 2% error rate, a majority of those errors turned out to be mislabeled data.

The study also acknowledges the following:

"Our study must also be seen in light of its limitations... First the CHF subjects used in this study suffer from severe CHF only...could yield less accurate results for milder CHF."

I think this is a good proof of concept but that the severe CHF and tiny sample size (33 patients) means that we're a long ways away from clinical usage.


The study looks at 33 patients total, and the cases and controls come from entirely different data sets, with data coming from different devices that recorded signal at different frequencies.

There is nothing to see here.


Apologies - I didn't catch that the HF / healthy patients are from two different datasets. Agreed that this essentially invalidates the result.

The missing experiment is to have a third dataset from yet another machine, with both positive/negative examples, and use it as the test dataset. Then transferability questions are at least somewhat addressed.


No apologies needed, we're all in agreement on the broad strokes! Your proposal is good: I would be quite surprised if it generalized, but that is definitely the way to find out.


You realize the fact that the validation dataset is independent of the training dataset is a YUGE red flag right?


Where do they say that the validation dataset is independent of the training dataset?

The case dataset is independent of the control dataset.


Well, the issues are fundamental to the calculation of their error statistics. These models and their error rates are, well, crap. If any of the people in my group came back to me with this as an error assessment, they'd be re-doing their work.


Pff, I can detect heart failure with no heartbeats at all.


Care to explain?


It sort of spoils the joke but if there are heartbeats it hasn't failed yet. So no heartbeats = failure.


Oh I really didn't get it. Thanks for explaining. I still don't find it funny.


That's the joke.


[::4kdb wooshing sound::]



The Reddit thread is worth a read. There's a healthy dose of scepticism about the paper there.


Claiming 100% accuracy with a single heartbeat is just hard to believe


And its also just clearly wrong on its face.


This is interesting, but more because it indicates that there's adequate data in a single heartbeat to do such diagnosis. In practical terms it's probably not nearly so relevant because it sounds like they were working with the raw data not tracing. By the time you have a patient hooked up to the proper equipment to do this diagnosis you're going to be getting adequate data anyway.

The main impact might be that if this holds up people could be tested with a short hook up in an office instead of with a 24-hour monitoring where they have to bring back a Holter device the next day. Of course, that 24 hour dataset may have independent value of its own for further diagnostics beyond just whether the patient has CHF.


The study is not worth paying attention to.

The datasets for positive cases and negative cases come from different databases. n=30 patients, on top of it.

All this does is recognize the patient/ECG technician who recorded the data. It's basically certain it doesnt generalize


IMO, the important part is Section 3.3 of the [paper](https://www.sciencedirect.com/science/article/pii/S174680941...), particularly the image at [https://ars.els-cdn.com/content/image/1-s2.0-S17468094193017.... To my eye the difference in shape of the orange and green signals could also be found through more traditional signal processing/statistical means that machine learning.

In a past job I did a combination of manual and machine-learning-based analysis of cardiac signals. We didn't have ECG, but did have PPG (blood flow) and PCG (sound) signals, and a pretty large study group. I recall there being one study participant who's signals were very clearly indicative of heart failure, enough that we raised the issue with our medical advisor about whether the subject should be deanonymized and contacted. In the paper they state that "the CHF subjects used in this study suffer from severe CHF only"; my suspicion is that a simpler, "hand rolled" model based on the features of the ECG could compete very well with this CNN approach for finding the same level of pathology in the ECG signal, without the "black box" of a CNN casting doubt on the technique.


Congestive heart failure can also be detected fairly reliably based on a sudden increase in weight. It causes fluid retention. There are several programs underway to give Internet connected scales to high risk patients and those report weight every day.


It could also be kidney failure. My weight ballooned by 30 lbs when I was progressing towards it. Even the doctor was saying the usual "you need to exercise and eat healthy" until he got my blood test results.


Heart failure patient here. This is kinda cool but tempered a bit by the fact that I've seen multiple cardiologists make a diagnosis by just glancing at a 12 lead ECG sheet. There are some pretty recognizable hallmarks.


Perhaps the title/premise might best be summarized as based on everything we know we can detect heart failure form monitoring one's body for one heartbeat.

Along with doing a lot of good and making a lot of early catches, I suspect that relying on AI to do medical analysis is going to bring into sharp relief just how much medical science DOESN'T know about the human body and its mysteries. I think we're a long, long way away from handling medical science over to AI and the real fun of AI-guided exploration is about to begin.


Clickbait title aside. I find that ethical issues around AI raised by Musk etc. shouldn't be around AI taking over the planet, but rather have ethics around overfitted models or otherwise unrealistic models being pushed for PR or whatever and responsibly playing with people's health and hopes.


One of our simplest “screening” questions for DS roles at my company is: “your model is 100% accurate. How do you feel?” If the answer is anything other than deep skepticism (Data leakage, trivial dataset etc), it’s a big red flag


How long until society becomes Gattaca? Sorry citizen our “AI” has detected genetic anomalies you will be a disposable factory worker. The rich would surely pay for their children to be genetically altered.


What exactly guarantees that reality? Why can't these tools be used for good moving forward? e.g. detecting heart problems. Seems a little arbitrary to spin technological advancements as progress towards some inevitable dystopian AI-driven future.


I'd say it's fairly certain given a capitalist society and human nature. Remember pre-existing conditions and insurance denials pre-Obamacare? Yeah, insurance companies would love to get their hands on your genetic data and tailor rates to your likelihood of future healthcare cost. At the same time, rich parents pretty much always pursue any advantage they can for their children - that's why private schools and homes in good school districts are so expensive. Add in the fact that we already have massive inequality and dropping social mobility in the US, and a Gattaca-like future starts looking pretty damn likely.


last year we've started to cloud recording obstetric ultrasound videos. we add more than 17,000 ultrasound exams to our platform each month. It's probably the largest dataset in the world of obstetric ultrasounds videos (~ 300,000 exams). reading news like this makes me think about how we can explore our dataset using ML/AI and help produce better diagnosis. I have no idea how (We're not an AI company).

If someone here wants to start a project with AI on top of ultrasounds, I'm all in.

let me know at hn at angra.ltd and I can give more details


I'm not sure that data set will mean anything without human-drawn conclusions about the patient (diagnoses, abormalities, etc.)


Obstetric ultrasound is very standarized and easy to evaluate with Hadlock (https://www.sciencedirect.com/science/article/abs/pii/000293...)

We, however, have access to the report too.


You absolutely can. Hitting you up now.


I guess adding "Congestive" to the title would've ruined the click bait.

Also how can the detection of a progressive disease be 100% accurate? I guess details ruin the click bait too.


Don't see any mention of how many false-positives they had in the article so... Yeah we'll see how effective this actually is.


Guy detects bullshit from one headline.


Or zero heart beat


if it doesn't beat the heart has failed.


Dear Elizabeth Holmes,

I found a great startup opportunity for you.

- Recruiter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: