AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

By Lisa D. Sparks

AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

Skip to main content

Live Science
Live Science

Search Live Science

View Profile

Sign up to our newsletter

Planet Earth

Archaeology

Physics & Math

Human Behavior

Science news

Life’s Little Mysteries

Science quizzes

Newsletters

Story archive

Live Science crossword puzzle
NASA confirms ‘interstellar visitor’
Shipwreck with ‘eye-watering’ treasure
Adult brain cells
Bird flu spread

Recommended reading

Artificial Intelligence
AI hallucinates more frequently the more advanced it gets. Is there any way of stopping it?

Artificial Intelligence
‘Meth is what makes you able to do your job’: AI can push you to relapse if you’re struggling with addiction, study finds

Artificial Intelligence
AI ‘hallucinates’ constantly, but there’s a solution

Artificial Intelligence
Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns

Artificial Intelligence
AI benchmarking platform is helping top companies rig their model performances, study claims

Artificial Intelligence
New study claims AI ‘understands’ emotion better than us

Artificial Intelligence
Advanced AI reasoning models generate up to 50 times more carbon dioxide than common LLMs

Artificial Intelligence

AI chatbots oversimplify scientific studies and gloss over critical details — the newest models are especially guilty

Lisa D. Sparks

5 July 2025

More advanced AI chatbots are more likely to oversimplify complex scientific findings based on the way they interpret the data they are trained on, a new study suggests.

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

(Image credit: Getty Images/peshkov)

Large language models (LLMs) are becoming less “intelligent” in each new version as they oversimplify and, in some cases, misrepresent important scientific and medical findings, a new study has found.

Scientists discovered that versions of ChatGPT, Llama and DeepSeek were five times more likely to oversimplify scientific findings than human experts in an analysis of 4,900 summaries of research papers.
When given a prompt for accuracy, chatbots were twice as likely to overgeneralize findings than when prompted for a simple summary. The testing also revealed an increase in overgeneralizations among newer chatbot versions compared to previous generations.

You may like

AI hallucinates more frequently the more advanced it gets. Is there any way of stopping it?

‘Meth is what makes you able to do your job’: AI can push you to relapse if you’re struggling with addiction, study finds

AI ‘hallucinates’ constantly, but there’s a solution

The researchers published their findings in a new study April 30 in the journal Royal Society Open Science.

“I think one of the biggest challenges is that generalization can seem benign, or even helpful, until you realize it’s changed the meaning of the original research,” study author Uwe Peters, a postdoctoral researcher at the University of Bonn in Germany, wrote in an email to Live Science. “What we add here is a systematic method for detecting when models generalize beyond what’s warranted in the original text.”
It’s like a photocopier with a broken lens that makes the subsequent copies bigger and bolder than the original. LLMs filter information through a series of computational layers. Along the way, some information can be lost or change meaning in subtle ways. This is especially true with scientific studies, since scientists must frequently include qualifications, context and limitations in their research results. Providing a simple yet accurate summary of findings becomes quite difficult.
“Earlier LLMs were more likely to avoid answering difficult questions, whereas newer, larger, and more instructible models, instead of refusing to answer, often produced misleadingly authoritative yet flawed responses,” the researchers wrote.

Sign up for the Live Science daily newsletter now
Get the world’s most fascinating discoveries delivered straight to your inbox.
Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.
Related: AI is just as overconfident and biased as humans can be, study shows
In one example from the study, DeepSeek produced a medical recommendation in one summary by changing the phrase “was safe and could be performed successfully” to “is a safe and effective treatment option.”
Another test in the study showed Llama broadened the scope of effectiveness for a drug treating type 2 diabetes in young people by eliminating information about the dosage, frequency, and effects of the medication.
If published, this chatbot-generated summary could cause medical professionals to prescribe drugs outside of their effective parameters.
Unsafe treatment options
In the new study, researchers worked to answer three questions about 10 of the most popular LLMs (four versions of ChatGPT, three versions of Claude, two versions of Llama, and one of DeepSeek).
They wanted to see if, when presented with a human summary of an academic journal article and prompted to summarize it, the LLM would overgeneralize the summary and, if so, whether asking it for a more accurate answer would yield a better result. The team also aimed to find whether the LLMs would overgeneralize more than humans do.
The findings revealed that LLMs — with the exception of Claude, which performed well on all testing criteria — that were given a prompt for accuracy were twice as likely to produce overgeneralized results. LLM summaries were nearly five times more likely than human-generated summaries to render generalized conclusions.
The researchers also noted that LLMs transitioning quantified data into generic information were the most common overgeneralizations and the most likely to create unsafe treatment options.
These transitions and overgeneralizations have led to biases, according to experts at the intersection of AI and healthcare.
“This study highlights that biases can also take more subtle forms — like the quiet inflation of a claim’s scope,” Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI technology company, told Live Science in an email. “In domains like medicine, LLM summarization is already a routine part of workflows. That makes it even more important to examine how these systems perform and whether their outputs can be trusted to represent the original evidence faithfully.”
Such discoveries should prompt developers to create workflow guardrails that identify oversimplifications and omissions of critical information before putting findings into the hands of public or professional groups, Rollwage said.
While comprehensive, the study had limitations; future studies would benefit from extending the testing to other scientific tasks and non-English texts, as well as from testing which types of scientific claims are more subject to overgeneralization, said Patricia Thaine, co-founder and CEO of Private AI — an AI development company.
Rollwage also noted that “a deeper prompt engineering analysis might have improved or clarified results,” while Peters sees larger risks on the horizon as our dependence on chatbots grows.
“Tools like ChatGPT, Claude and DeepSeek are increasingly part of how people understand scientific findings,” he wrote. “As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure.”

RELATED STORIES

—Cutting-edge AI models from OpenAI and DeepSeek undergo ‘complete collapse’ when problems get too difficult, study reveals
—’Foolhardy at best, and deceptive and dangerous at worst’: Don’t believe the hype — here’s why artificial general intelligence isn’t what the billionaires tell you it is
—Current AI models a ‘dead end’ for human-level intelligence, scientists agree
For other experts in the field, the challenge we face lies in ignoring specialized knowledge and protections.
“Models are trained on simplified science journalism rather than, or in addition to, primary sources, inheriting those oversimplifications,” Thaine wrote to Live Science.
“But, importantly, we’re applying general-purpose models to specialized domains without appropriate expert oversight, which is a fundamental misuse of the technology which often requires more task-specific training.”

In December 2024, Future Publishing agreed a deal with OpenAI in which the AI company would bring content from Future’s 200-plus media brands to OpenAI’s users. You can read more about the partnership here.

Lisa D. Sparks

Lisa D Sparks is a freelance journalist for Live Science and an experienced editor and marketing professional with a background in journalism, content marketing, strategic development, project management, and process automation. She specializes in artificial intelligence (AI), robotics and electric vehicles (EVs) and battery technology, while she also holds expertise in the trends including semiconductors and data centers.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

AI hallucinates more frequently the more advanced it gets. Is there any way of stopping it?

‘Meth is what makes you able to do your job’: AI can push you to relapse if you’re struggling with addiction, study finds

AI ‘hallucinates’ constantly, but there’s a solution

Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns

AI benchmarking platform is helping top companies rig their model performances, study claims

New study claims AI ‘understands’ emotion better than us

Latest in Artificial Intelligence

Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns

AI hallucinates more frequently the more advanced it gets. Is there any way of stopping it?

Advanced AI reasoning models generate up to 50 times more carbon dioxide than common LLMs

New study claims AI ‘understands’ emotion better than us

Your devices feed AI assistants and harvest personal data even if they’re asleep. Here’s how to know what you’re sharing.

Hurricanes and sandstorms can be forecast 5,000 times faster thanks to new Microsoft AI model

Latest in News

Oldest wooden tools unearthed in East Asia show that ancient humans made planned trips to dig up edible plants

Scientists transform pee into material fit for medical implants

Astronaut snaps giant red ‘jellyfish’ sprite over North America during upward-shooting lightning event

Can adults make new brain cells? New study may finally settle one of neuroscience’s greatest debates

Neanderthal DNA may refute 65,000-year-old date for human occupation in Australia, but not all experts are convinced

1,400-year-old temple ruins the size of a city block unearthed in Bolivia

LATEST ARTICLES

Spacecraft carrying cannabis and human remains crashes into the ocean

Why are men taller than women, on average?

Lenovo ThinkPad X9 14 Aura Edition review: A wonderfully vibrant prosumer device — with one catch

An ‘interstellar visitor’ and the oldest ancient Egyptian genome ever sequenced

Dell Pro 13 Premium review: Featherweight with a punch

Live Science is part of Future US Inc, an international media group and leading digital publisher. Visit our corporate site.

Contact Future’s experts

Terms and conditions

Privacy policy

Cookies policy

Accessibility Statement

Advertise with us

Web notifications

Editorial standards

How to pitch a story to us

Future US, Inc. Full 7th Floor, 130 West 42nd Street,

Please login or signup to comment

Please wait…

Read More…