Benchmarking AI Models

Summary of Benchmarking AI Models

by Ben Jaffe and Katie Malone

29mMarch 30, 2026

Overview of Linear Digressions — "Benchmarking AI Models"

This episode (hosts Ben Jaffe and Katie Malone) explains how researchers evaluate progress in large language models (LLMs) using benchmarks. It covers what benchmarks are, two canonical examples (MMLU and SWEbench), common failure modes (Goodhart’s law, data contamination, saturation, ambiguity/underspecification, non-determinism), and practical mitigation strategies (canary strings, encryption, human filtering). The hosts also propose a recurring mini‑series diving into individual benchmarks.

Key topics discussed

  • What benchmarks are and why they’re used (standardized tests to compare model capabilities).
  • MMLU (Massive Multitask Language Understanding): multi‑subject multiple‑choice exam used as a canonical benchmark.
  • SWEbench (Sweebench / SWEbench Verified): software engineering benchmark using real GitHub issues + repo tests.
  • Core limitations of many benchmarks:
    • Goodhart’s law (optimizing to the metric undermines its value).
    • Signal contamination / data leakage (evals appearing in training data).
    • Saturation / ceiling effects (models cluster near top accuracy).
    • Ambiguous or underspecified questions (multiple defensible answers).
    • Non‑determinism of LLM outputs.
  • Technical mitigations: canary strings, encryption of eval datasets, human‑verified subsets.
  • Practical recommendations for users and developers.

Examples & illustrations

MMLU (Massive Multitask Language Understanding)

  • 57 subject areas (history, law, medicine, philosophy, math, etc.), ~14k multiple‑choice questions.
  • Example questions mentioned:
    • Medicine: “Glucose is transported into the muscle cell …” (correct: GLUT4).
    • Law: a nuanced trespass/intent question (illustrates ambiguity).
    • Anatomy: embryological origin of the hyoid bone (shows domain‑knowledge questions).
  • Issues highlighted: ambiguous answers, not aligned with real‑world reasoning, contamination and saturation over time.

Sweebench (SWEbench)

  • Uses real open‑source repositories: issues and corresponding PR fixes.
  • Evaluation via repo unit tests to confirm a fix.
  • Advantages: simulates realistic software engineering tasks (comprehension, navigating repo context).
  • Challenges: multiple valid fixes may exist but tests/useful human context might only accept one; contamination and saturation issues exist here too.
  • SWEbench Verified: human‑filtered subset to address bad tests or underspecified tasks.

Limitations & failure modes (concise)

  • Goodhart’s law: benchmarks become targets; models get tuned to the test rather than true capabilities.
  • Data leakage: benchmarks or their answers end up in training corpora, inflating scores.
  • Saturation: once models cluster near perfect scores, the benchmark loses discriminative power.
  • Ambiguity/underspecification: some questions lack single objectively correct answers; context may be missing.
  • Non‑determinism: model outputs can vary across runs, making single scores noisy.

Mitigations & best practices

  • Canary strings: embed unique random tokens/phrases in eval data to detect or exclude contamination. Users can query models to see if they complete canary strings (a simple contamination test).
  • Encrypt or tightly control access to benchmark datasets and decryption keys to reduce accidental leak into training corpora.
  • Human‑verified benchmarks: curate or filter tasks/tests to remove broken or underspecified evaluations.
  • Create harder, more diverse benchmarks that place most models in the distributed middle (avoid early saturation).
  • Use task‑specific benchmarks (e.g., coding, medical, legal) and prefer ones aligned to your real workload.
  • Run your own representative evaluations for production/model selection rather than relying solely on public leaderboards.

Practical recommendations for listeners (developers / users)

  • Choose models optimized for your task (e.g., code‑specialized variants like “Codex” for programming).
  • Don’t rely solely on single public benchmarks to choose a model—validate on your own representative tasks and data.
  • Be skeptical of headline performance improvements: ask whether gains might be due to contamination or test‑tuning.
  • If evaluating a model yourself: include randomized trials (to account for non‑determinism), and consider canary checks for contamination.
  • Prefer human‑verified or curated benchmark subsets if you need more trustworthy evaluations.

Notable quotes & insights

  • “Benchmarks are basically tests for LLMs.”
  • “Goodhart’s law: as soon as a metric becomes the goal, it stops being a good measure of progress.”
  • Canary trick: embed unique identifiers in eval data to both exclude them from training and detect contamination by model completion.

Bottom line / Takeaways

  • Benchmarks are a necessary tool for tracking LLM progress but have important and evolving limitations.
  • Be aware of Goodhart effects, data leakage, and saturation when interpreting benchmark gains.
  • Use task‑aligned, curated benchmarks and run your own evaluations before adopting models for production.
  • Technical countermeasures (canaries, encryption, human curation) improve reliability but no benchmark is perfect—ongoing benchmark design and critical evaluation remain essential.

Further actions / next steps suggested by the episode

  • Expect the hosts to run a mini‑series (“better know a benchmark”) that dives into individual benchmarks in depth.
  • If you evaluate models, add canary checks, human‑verification, and representative in‑house tests to your evaluation pipeline.
  • Subscribe to the podcast’s Substack newsletter for episode highlights and extra content (they mentioned Linear Digressions’ Substack).