Large language models are poor medical coders

extasia · 2024-05-06T17:40:04.000000Z

They're poor at recollecting the exact descriptions of the ~150,000 ICD codes, which is why these other approaches [1,2] give the known information (codes) in some way to the model and let it do the task of _assigning_ them to discharge notes, which is the hard part of the task!.

(Disclaimer: I am an author of one of these papers)

1. https://arxiv.org/abs/2311.13735

2. https://arxiv.org/abs/2310.06552

tuxracer · 2024-05-06T18:50:21.000000Z

"XCode is a poor tool to create apps that do X"

"We tried a particular approach using XCode to create an app that does X but were unsuccessful. Therefore, nobody is able to use Xcode to create apps that do X"

If there are 150,000 ICD codes an Agent may be able to accomplish this that leverages LLMs in the process. LLMs may be able to be used as _part of a process_ that does successfully accomplish this task.

Workaccount2 · 2024-05-06T18:37:53.000000Z

I'd be curious if you guys used AI studio and Gemini 1.5, you could context stuff a few thousand .pdf pages of ICD codes and ICD application before beginning testing. I imagine that performance would improve dramatically with a million tokens of ICD in play.

macksd · 2024-05-06T17:11:20.000000Z

>> Our study reaffirms the limitations of LLM tokenization

Because they used data that needs to be tokenized differently, and didn't really tune the models for use on that data. That's not really a limit of LLM tokenization per se.

>> We did not evaluate strategies known to improve LLM performance, including ... retrieval augmented generation

Which is a shame because this is exactly the kind of use case RAG is supposed to be good for and they largely observed problems it's supposed to help with.

Looking at the authors, it seems to me they're all subject matter experts in medicine and digital medicine, but their conclusion is the one in support of medical professionals and they really don't seem to have tried that hard to get good deep learning results.

I've had nightmares every time I've seen a doctor in the US, frequently because of things not being coded correctly. So honestly I'd just love to see a rigorous study of how often the human staff is messing it up too.

resource_waste · 2024-05-06T18:12:11.000000Z

This seems almost deceptive. Given how corrupt US medical is, it makes me think any sort of 'automation bad' is just the various special interest groups holding onto power a little while longer.

You know the Physician cartel is going to find some nonsensical reasoning to be anti-LLM even when LLMs will diagnose correctly more often.

When every other industry is finding uses, 'medical can't, its too hard', is going to be a normal line from the industry that still uses Faxes.

barfbagginus · 2024-05-06T23:26:25.000000Z

Yeah but it's going to be easy to tell they're bullshitting. They'll always be 2 steps behind the state of the art, so they'll make straw men arguments like judging LLMs without using RAG to provide a hierarchical index of medical codes. Anyone current on AI papers will immediately spot the issues, where investors and the public will not.

This leaves a huge opportunity for open source systems which provide better diagnostics for free, but will require us to hustle. Then, the open source users among us, at least, can pioneer free and open AI augmented health care. That can become a popular option once there's strong evidence that our virtual doctors are just more competent and understanding.

I am excited about the idea that one day soon I'll have a doctor that listens to me, understands the current science and evidence about my conditions, and does not sideline me because I'm autistic, anxious, and obsessive. And we might get to help build it and make it free for everyone!

barfbagginus · 2024-05-07T00:14:47.000000Z

The limitations section mentions the study omitted RAG and focused on base performance as a key bottle neck. But given the usefulness of RAG, and weakness of base LLMs for this kind of task, base recall performance is not necessarily relevant or a key bottle neck preventing accurate coding.

Adding even some slapdash RAG attempts would have provided a more realistic and still disappointing result, since assisted LLMs are still only around 75% accurate (see the RAG paper another author shares in their comment). I suppose the space of possible RAG solutions makes it hard to represent fairly, so is reasonably left to further research.

I appreciate testing base performance, with a STRONG proviso that a relevant conclusion requires more work, along the lines of RAG and other tools. I wish this was communicated more clearly in the intro and abstract, and wonder if the authors had some unstated reasons for not being more blatant about that.

The study does provide an interesting value. Its benchmark is open source and extensive. It should be easy to adapt and replicate in other systems. It could become a target benchmark for tool and retrieval enhanced medical coding LLMs.