An AI Model for Scientists OpenScholar Synthesizes Scientific Research, Cites Sources Equal to Human Experts

Aaron AupperleeWednesday, February 4, 2026

A team of researchers led by an incoming SCS professor built an open-source AI model specifically designed to synthesize current scientific research.

The Breakdown

  • OpenScholar is an AI model developed to help scientists keep up with new discoveries.
  • The model was trained on 45 million scientific papers and can incorporate new ones.
  • Scientists preferred OpenScholar answers to human answers 51% of the time.

***

A team of researchers led by an incoming Carnegie Mellon University professor has built an open-source artificial intelligence model designed specifically to synthesize current scientific research.

Akari Asai, who will join the Language Technologies Institute in CMU's School of Computer Science as an assistant professor this fall, said the new tool, OpenScholar, will help researchers stay on top of the latest scientific breakthroughs.

"Scientists see so many papers coming out every day that it's impossible to keep up," said Asai, the lead author on the paper describing OpenScholar. "But the existing AI systems weren't designed for scientists' specific needs. We've already seen a lot of scientists using OpenScholar and because it's open-source, others are building on this research and improving on our results."

Asai is currently a research scientist at The Allen Institute for AI (Ai2) and completed the OpenScholar research as a doctoral student in the University of Washington's Paul G. Allen School of Computer Science and Engineering. Asai was a visiting student at CMU in 2024 and worked with Graham Neubig, an associate professor in the LTI, who helped guide the OpenScholar project with other senior advisers.

Current AI models have shown promise in quickly synthesizing vast amounts of information, but they still tend to make things up or hallucinate. When the team studied a recent OpenAI model, GPT-4o, they found it fabricated 78-90% of its research citations. And general-purpose AI models like ChatGPT often can't access papers published after their training data was collected.

"Early on, we experimented with using an AI model with Google's search data, but we found it wasn't very good on its own," Asai said. "It might cite some research papers that weren't the most relevant, cite just one paper, or randomly pull from a blog post. We realized we needed to ground this in scientific papers. We then made the system flexible so it could incorporate emerging research through results."

Researchers trained the model and then created a set of 45 million scientific papers OpenScholar could pull from to ground its answers in established research. They coupled this with a technique called retrieval-augmented generation, which lets the model search for new sources, incorporate them and cite them after it's been trained.

"After we started this work, we put the demo online, and quickly, we got a lot of queries, far more than we'd expected," said senior author Hanna Hajishirzi, an associate professor in the Allen School and senior director at Ai2. "When we started looking through the responses, we realized our colleagues and other scientists were actively using OpenScholar. It really speaks to the need for this sort of open-source, transparent system that can synthesize research."

To test their system, the team created ScholarQABench, a benchmark they could use to test systems on scientific search. They gathered 3,000 queries and 250 long-form answers written by experts in computer science, physics, biomedicine and neuroscience.

"AI is getting better and better at real-world tasks," Hajishirzi said. "But the big question ultimately is whether we can trust that its answers are correct."

The team compared OpenScholar against other state-of-the-art AI models, such as OpenAI's GPT-4o and two models from Meta. ScholarQABench automatically evaluated AI models' answers on metrics such as their accuracy, writing quality and relevance.

OpenScholar outperformed all the systems it was tested against. The team had 16 scientists review answers from the models and compare them with human-written responses. The scientists preferred OpenScholar answers to human answers 51% of the time, but when they combined OpenScholar citation methods and pipelines with GPT-4o (a much bigger model), the scientists preferred the AI-written answers to human answers 70% of the time. They picked answers from GPT-4o on its own only 32% of the time.

Asai said the team is already working on a follow-up model, DR Tulu, which builds on OpenScholar's findings and performs multistep search and information gathering to produce more comprehensive responses.

The team published its findings Feb. 4 in Nature. The project's code, data and a demo are publicly available and free to use.

For More Information

Aaron Aupperlee | 412-268-9068 | aaupperlee@cmu.edu