ChatGPT Health — OpenAI’s new health-focused chatbot — frequently underestimated the severity of medical emergencies, according to a study published last week in the journal Nature Medicine.
In the study, researchers tested ChatGPT Health’s ability to triage, or assess the severity of, medical cases based on real-life scenarios.
Previous research has shown that ChatGPT can pass medical exams, and nearly two-thirds of physicians reported using some form of AI in 2024. But other research has shown that chatbots, including ChatGPT, don’t provide reliable medical advice.
ChatGPT Health is separate from OpenAI’s general ChatGPT chatbot. The program is free, but users must sign up specifically to use the health program, which currently has a waitlist to join. OpenAI says ChatGPT Health uses a more secure platform so users can safely upload personal medical information.
Over 40 million people globally use ChatGPT to answer health care questions, and nearly 2 million weekly ChatGPT messages are about insurance, according to OpenAI. In a detailed description of ChatGPT Health on its website, OpenAI says that it is “not intended for diagnosis or treatment.”
In the study, the researchers fed 60 medical scenarios to ChatGPT Health. The chatbot’s responses were compared with the responses of three physicians who also reviewed the scenarios and triaged each one based on medical guidelines and clinical expertise.
Each of the scenarios had 16 variations, changing things including the race or gender of the patient.
The variations were designed to “produce the exact same result,” according to lead study author Dr. Ashwin Ramaswamy, an instructor of urology at The Mount Sinai Hospital in New York City. This meant that an emergency case involving a man should still be classified as an emergency if the patient was a woman. The study didn’t find any significant differences in the results based on demographic changes.
The researchers found that ChatGPT Health “under-triaged” 51.6% of emergency cases. That is, instead of recommending the patient go to the emergency room, the bot recommended seeing a doctor within 24 to 48 hours.
The emergencies included a patient with a life-threatening diabetes complication called diabetic ketoacidosis and a patient going into respiratory failure. Left untreated, both lead to death.
“Any doctor, and any person who’s gone through any degree of training, would say that that patient needs to go to the emergency department,” Ramaswamy said.
In cases like impending respiratory failure, the bot seemed to be “waiting for the emergency to become undeniable” before recommending the ER, he said.
Emergencies like stroke, with unmistakable symptoms, were correctly triaged 100% of the time, the study found.
A spokesperson for OpenAI said the company welcomed research looking at the use of AI in health care, but said the new study didn’t reflect how ChatGPT Health is typically used or how it’s designed to function. The chatbot is designed for people to ask follow-up questions to give more context in medical situations, rather than give a single response to a medical scenario, the spokesperson said.
ChatGPT Health is only currently available to a limited number of users, and OpenAI is still working to improve the safety and reliability of the model before the chatbot is made more widely available, the spokesperson said.
Compared with the doctors in the study, the bot also over-triaged 64.8% of nonurgent cases, recommending a doctor’s appointment when it wasn’t necessary. The bot told a patient with a three-day sore throat to see a doctor in 24 to 48 hours, when at-home care was sufficient.
“There’s no logic, for me, as to why it was making recommendations in some areas versus others,” Ramaswamy said.
In suicidal ideation or self-harm scenarios, the bot’s response was also inconsistent.
When a user expresses suicidal intent, ChatGPT is supposed to refer users to 988, the suicide and crisis hotline. ChatGPT Health works the same way, the OpenAI spokesperson said.
In the study, however, ChatGPT Health instead referred users to 988 when they didn’t need it, and didn’t refer users to it when necessary.
Ramaswamy called the bot “paradoxical.”
“It was inverted to clinical risk,” he said. “And it was kind of backwards.”
‘A medical therapist’
Dr. John Mafi, an associate professor of medicine and a primary care physician at UCLA Health who wasn’t involved with the research, said more testing is needed on chatbots that can make health decisions.
“The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you’re making sure that the benefits outweigh the harms,” Mafi said.
Both Mafi and Ramaswamy said they’ve seen a number of their own patients using AI for medical questions.
Ramaswamy said people may turn to AI for health advice because it’s easy to access and has no limit on the number of questions a person can ask.
“You can go through every question, every detail, every document that you want to upload,” Ramaswamy said. “And it fulfills that need. People really, really want not just medical advice, but they also want a partner, like a medical therapist.”
OpenAI said in a January report that a majority of ChatGPT’s health-related messages occur outside of a doctor’s normal working hours, and over half a million weekly messages came from people living 30 or more minutes away from a hospital.
“A doctor can spend 15, 20 minutes with you in the room,” Ramaswamy said. “They’re not going to be able to address and answer every single question.”
Risks of using a chatbot for medical advice
Despite the benefits of its endless availability, when asked whether chatbots can currently safely provide health and medical advice, Ramaswamy said no.
Dr. Ethan Goh, executive director of ARISE, an AI research network, said that in many instances, AI can provide safe health and medical advice, but that it’s not a substitute for a physician’s advice.
“The reality is chatbots can be helpful for a vast number of things. It’s really more about being thoughtful and being deliberate and understanding that it also has severe limitations,” he said.
Monica Agrawal, an assistant professor in the department of biostatistics and bioinformatics and the department of computer science at Duke University, said it’s largely unknown how AI models are trained and what data is used to train them.
She said some training benchmarks may not indicate a bot’s potential to help.
“A lot of [OpenAI’s] earlier evaluations were based on, ‘We do this well on a licensing exam,’” she said. “But there’s a huge difference between doing well on a medical exam and actually practicing medicine.”
She added that when people use chatbots, the information users give is not always clear and can contain biases.
“Large language models are known for being sycophantic,” she said. “Which means they tend to agree with opinions posited by the user, even if they might not be correct. And this has the ability to reinforce patient misconceptions or biases.”
Mafi said AI tools are “designed to please you,” but as a doctor, “sometimes you have to say something that may not please the patient.”
Ramaswamy said not to rely on AI in an emergency, and using it in conjunction with a physician is key to preventing harm. He said collaborations between tech and health care companies are important for creating safer AI products.
“If these models get better and better, I can see the benefits of a patient-AI-doctor relationship, especially in rural scenarios, or in areas of global health,” he said.
This article was originally published on NBCNews.com

