Posted in

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

As large language models (LLMs) become our go-to sources for information across countless applications, their factual accuracy is no longer just a feature—it’s a foundational requirement. To truly advance in this critical area, we need a deeper understanding of where models stumble and better tools to measure their performance. That’s where the new FACTS Benchmark Suite comes in.

Today, in collaboration with Kaggle, we’re excited to introduce the comprehensive FACTS Benchmark Suite. This initiative builds upon the earlier FACTS Grounding Benchmark by adding three new, rigorous evaluations:

  • The Parametric Benchmark: Tests a model’s core ability to accurately recall and apply its internal knowledge to answer fact-based questions, without any external help.
  • The Search Benchmark: Evaluates how effectively a model can use a web search tool to find, retrieve, and synthesize information to answer complex queries.
  • The Multimodal Benchmark: Assesses a model’s skill in providing factually correct text responses to prompts that include images.

We’ve also released an updated Grounding Benchmark – v2 to further test a model’s capacity to ground its answers directly in the context provided. In total, this suite offers 3,513 carefully curated examples, now publicly available. Following standard practice, a private evaluation set is held back. The overall FACTS Score is calculated as the average accuracy across all four benchmarks on both public and private sets. Kaggle will manage the suite, test leading models, and host a public leaderboard.

Diving into the Benchmarks

The Parametric Benchmark focuses on pure knowledge recall. It presents “trivia-style” questions answerable via Wikipedia—a common LLM training source—to see how well a model can tap its internal memory. A typical question might be: “Who played harmonica on ‘The Rockford Files’ theme song?” The benchmark includes 1,052 public and 1,052 private items.


The Search Benchmark shifts the challenge to information retrieval. It tests a model’s ability to use a provided web search tool to answer questions that often require piecing together multiple facts. To ensure fair comparison, all models use the same search tool. This benchmark is designed to be tough, even with web access. An example prompt: “What is the sum of the birth years of the British boxer who defeated Vazik Kazarian at the 1960 Summer Olympics, the Moroccan boxer who also competed in the men’s light welterweight event at those same Olympics, and the Danish boxer who competed in both the 1960 and 1964 Summer Olympics?” It consists of 890 public and 994 private prompts.


The Multimodal Benchmark tackles the frontier of visual understanding. It evaluates how accurately a model can generate text about an input image, requiring it to integrate visual perception with its world knowledge. For instance, given an image of an animal, the prompt might ask: “What genus does this animal belong to?” This benchmark includes 711 public and 811 private items.


Initial Results and the Road Ahead

We’ve put leading LLMs through this rigorous new suite. The results are telling and highlight both progress and the challenges ahead.


Gemini 3 Pro leads the pack with an overall FACTS Score of 68.8%. The leap from its predecessor is significant, showing a 55% reduction in error rate on the Search benchmark and a 35% reduction on the Parametric benchmark. It’s worth noting that the Multimodal benchmark presented the toughest challenge for all models evaluated. Crucially, no model yet breaks the 70% overall accuracy barrier, signaling ample room for growth across the industry.

This progress isn’t isolated. Gemini’s improved factuality is also evident in the SimpleQA Verified benchmark, where its accuracy jumped from 54.5% to 72.1%, testing concise parametric knowledge.

The journey toward perfectly factual LLMs is ongoing. The FACTS Benchmark Suite and the results from models like Gemini 3 Pro underscore a committed push toward making information not just accessible, but reliably useful. We hope this framework sparks deeper research and collaboration, ultimately leading to more accurate and trustworthy AI tools for everyone.