ChatGPT’s medical diagnoses are accurate less than half of the time, a new study reveals.
Scientists asked the artificial intelligence (AI) chatbot to assess 150 case studies from the medical website Medscape and found that GPT 3.5 (which powered ChatGPT when it launched in 2022) only gave a correct diagnosis 49% of the time.
Previous research showed that the chatbot could scrape a pass in the United States Medical Licensing Exam (USMLE) — a finding hailed by its authors as “a notable milestone in AI maturation.”
But in the new study, published Jul. 31 in the journal PLOS ONE, scientists cautioned against relying on the chatbot for complex medical cases that require human discernment.
“If people are scared, confused, or just unable to access care, they may be reliant on a tool that seems to deliver medical advice that’s ‘tailor-made’ for them,” senior study author Dr. Amrit Kirpalani, a doctor in pediatric nephrology at the Schulich School of Medicine and Dentistry at Western University, Ontario, told Live Science. “I think as a medical community (and among the larger scientific community) we need to be proactive about educating the general population about the limitations of these tools in this respect. They should not replace your doctor yet.”
ChatGPT’s ability to dispense information is based on its training data. Scraped from the repository Common Crawl, the 570 gigabytes of text data fed into the 2022 model amounts to roughly 300 billion words, which were taken from books, online articles, Wikipedia and other web pages.
Related: Biased AI can make doctors’ diagnoses less accurate
AI systems spot patterns in the words they were trained on to predict what may follow them, enabling them to provide an answer to a prompt or question. In theory, this makes them helpful for both medical students and patients seeking simplified answers to complex medical questions, but the bots’ tendency to “hallucinate” —making up responses entirely — limits their usefulness in medical diagnoses.
To assess the accuracy of ChatGPT’s medical advice, the researchers presented the model with 150 varied case studies — including patient history, physical exam findings and images taken from the lab — that were intended to challenge the diagnostic abilities of trainee doctors. The chatbot chose one of four multiple-choice outcomes before responding with its diagnosis and a treatment plan which the researchers rated for accuracy and clarity.
The results were lackluster, with ChatGPT getting more responses wrong than right on medical accuracy, while it gave complete and relevant results 52% of the time. Nonetheless, the chatbot’s overall accuracy was much higher at 74%, meaning that it could identify and discard wrong multiple choice answers much more reliably.
The researchers said that one reason for this poor performance could be that the AI wasn’t trained on a large enough clinical dataset, making it unable to juggle results from multiple tests and avoid dealing in absolutes as effectively as human doctors.
Despite its shortcomings, the researchers said that AI and chatbots could still be useful in teaching patients and trainee doctors — providing the AI systems are supervised and their proclamations are accompanied with some healthy fact-checking.
“If you go back to medical journal publications from around 1995, you can see that the very same discourse was happening with ‘the world wide web. There were new publications about interesting use cases and there were also papers that were skeptical as to whether this was just a fad.” Kirpalani said. “I think with AI and chatbots specifically, the medical community will ultimately find that there’s a huge potential to augment clinical decision-making, streamline administrative tasks, and enhance patient engagement.”