It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Large language models (LLMs) and large multimodal models (LMMs) are increasingly being incorporated into medical settings — even as these groundbreaking technologies have not yet truly been battle-tested in such critical areas.
So how much can we really trust these models in high-stakes, real-world scenarios? Not much (at least for now), according to researchers at the University of California at Santa Cruz and Carnegie Mellon University.
In a recent experiment, they set out to determine how reliable LMMs are in medical diagnosis — asking both general and more specific diagnostic questions — as well as whether models were even being evaluated correctly for medical purposes.
Curating a new dataset and asking state-of-the-art models questions about X-rays, MRIs and CT scans of human abdomens, brain, spine and chests, they discovered “alarming” drops in performance.
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
Even advanced models including GPT-4V and Gemini Pro did about as well as random educated guesses when asked to identify conditions and positions. Also, introducing adversarial pairs — or slight perturbations — significantly reduced model accuracy. On average, accuracy dropped an average of 42% across the tested models.
“Can we really trust AI in critical areas like medical image diagnosis? No, and they are even worse than random,” Xin Eric Wang, a professor at UCSC and paper co-author, posted to X.
‘Drastic’ drops in accuracy with new ProbMed dataset
Medical Visual Question Answering (Med-VQA) is a method that assesses models’ abilities to interpret medical images. And, while LMMs have shown progress when tested on benchmarks such as VQA-RAD — a dataset of clinically generated visual questions and answers about radiology images — they fail quickly when probed more deeply, according to the UCSC and Carnegie Mellon researchers.
In their experiments, they introduced a new dataset, Probing Evaluation for Medical Diagnosis (ProbMed), for which they curated 6,303 images from two widely-used biomedical datasets. These featured X-ray, MRI and CT scans of multiple organs and areas including the abdomen, brain, chest and spine.
GPT-4 was then used to pull out metadata about existing abnormalities, the names of those conditions and their corresponding locations. This resulted in 57,132 question-answer pairs covering areas such as organ identification, abnormalities, clinical findings and reasoning around position.
Using this diverse dataset, the researchers then subjected seven state-of-the-art models to probing evaluation, which pairs original simple binary questions with hallucination pairs over existing benchmarks. Models were challenged to identify true conditions and disregard false ones.
The models were also subjected to procedural diagnosis, which requires them to reason across multiple dimensions of each image — including organ identification, abnormalities, position and clinical findings. This makes the model go beyond simplistic question-answer pairs and integrate various pieces of information to create a full diagnostic picture. Accuracy measurements are conditional upon the model successfully answering preceding diagnostic questions.
The seven models tested included GPT-4V, Gemini Pro and the open-source, 7B parameter versions of LLaVAv1, LLaVA-v1.6, MiniGPT-v2, as well as specialized models LLaVA-Med and CheXagent. These were chosen because their computational costs, efficiencies and inference speeds make them practical in medical settings, researchers explain.
The results: Even the most robust models experienced a minimum drop of 10.52% in accuracy when tested ProbMed, and the average decrease was 44.7%. LLaVA-v1-7B, for instance, plummeted a dramatic 78.89% in accuracy (to 16.5%), while Gemini Pro dropped more than 25% and GPT-4V fell 10.5%.
“Our study reveals a significant vulnerability in LMMs when faced with adversarial questioning,” the researchers note.
GPT and Gemini Pro accept hallucinations, reject ground truth
Interestingly, GPT-4V and Gemini Pro outperformed other models in general tasks, such as recognizing image modality (CT scan, MRI or X-ray) and organs. However, they did not perform well when asked, for instance, about the existence of abnormalities. Both models performed close to random guessing with more specialized diagnostic questions, and their accuracy in identifying conditions was “alarmingly low.”
This “highlights a significant gap in their ability to aid in real-life diagnosis,” the researchers pointed out.
When analyzing error on the part of GPT-4V and Gemini Pro across three specialized question types — abnormality, condition/finding and position — the models were vulnerable to hallucination errors, particularly as they moved through the diagnostic procedure. Researchers report that Gemini Pro was more prone to accept false conditions and positions, while GPT-4V had a tendency to reject challenging questions and deny ground-truth conditions.
For questions around conditions or findings, GPT-4V’s accuracy dropped to 36.9%, and for queries about position, Gemini Pro was accurate roughly 26% of the time, and 76.68% of its errors were the result of the model accepting hallucinations.
Meanwhile, specialized models such as CheXagent — which is trained exclusively on chest X-rays — were most accurate in determining abnormalities and conditions, but it struggled with general tasks such as identifying organs. Interestingly, the model was able to transfer expertise, identifying conditions and findings in chest CT scans and MRIs. This, researchers point out, indicates the potential for cross-modality expertise transfer in real-life situations.
“This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis,” the researchers write, “and current LMMs are still far from applicable to those fields.”
They note that their insights “underscore the urgent need for robust evaluation methodologies to ensure the accuracy and reliability of LMMs in real-world medical applications.”
AI in medicine ‘life threatening’
On X, members of the research and medical community agreed that AI is not yet ready to support medical diagnosis.
“Glad to see domain specific studies corroborating that LLMs and AI should not be deployed in safety-critical infrastructure, a recent shocking trend in the U.S.,” posted Dr. Heidy Khlaaf, an engineering director at Trail of Bits. “These systems require at least two 9’s (99%), and LLMs are worse than random. This is literally life threatening.”
Another user called it “concerning,” adding that it “goes to show you that experts have skills not capable of modeling yet by AI.”
Data quality is “really worrisome,” another user asserted. “Companies don’t want to pay for domain experts.”