Overview of Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill‑Smith
A Latent Space interview with George Cameron and Micah Hill‑Smith (founders of Artificial Analysis) about how their independent benchmarking operation started, what they measure today, how they make money, and where they’re headed. The conversation covers AA’s public intelligence index and private benchmarking business, technical benchmarking tradeoffs (cost, repeats, parsing, endpoints), new evals (omniscience/hallucination, GDPVal/agentic tasks, Critical Point), openness and hardware indices, and the big trends they track (cost of intelligence vs. total spend, token/turn efficiency, sparsity).
Key takeaways
- Artificial Analysis (AA) is an independent benchmarking house that runs public, reproducible model and hosting comparisons and also offers paid enterprise reports and private benchmarking.
- The public site remains free and independent; commercial work is in enterprise subscriptions and custom private benchmarking.
- AA runs its own evals (not just reported lab numbers) to ensure consistent prompts, parsing, and measurement of trade‑offs (quality vs. throughput vs. cost).
- New evals and indices: Omniscience (knowledge + hallucination penalty), GDPVal (agentic, multi‑turn work tasks using an ELO approach), Critical Point (hard physics problems), and an Openness Index (disclosure of data/methods/training).
- Costs to run robust benchmarks have risen substantially (repeats for confidence intervals, many more models, larger eval complexity); AA reports a simplified cost but actual runs are costlier.
- Big industry trend: cost per unit of intelligence has fallen dramatically (100x–1000x for some tiers), yet total spend can go up because agentic workflows and long‑context uses consume many more tokens and compute.
- Practical tools: AA open‑sourced an agentic harness called Stirrup; they publish many interactive charts and allow users to customize which models are highlighted.
How Artificial Analysis makes money
- Two main revenue streams:
- Enterprise benchmark & insight subscriptions: standardized reports and deep guidance (e.g., model deployment choices: serverless vs. managed vs. owning hardware).
- Custom/private benchmarking engagements: bespoke benchmarks, private test harnesses, workshops for companies and AI providers.
- Public site and public benchmarking remain free and independent — being on the public charts is not pay‑to‑play.
Origin story and evolution
- Started as a side project (circa 2023/Jan 2024) to help developers choose models by comparing accuracy, speed, and cost consistently.
- Grew fast as new open‑weights models (e.g., Mistral) and new serverless providers expanded the landscape.
- The founders recognized that paper numbers and lab‑reported metrics were inconsistent (different prompts, formats, etc.), so AA runs standardized, independent evaluations across providers.
- Now ~20 people, evolved from simple QA evals to multi‑dimensional indices and agentic benchmarks.
Benchmarking methodology & core evals
- Intelligence Index (AII): a single synthesis number composed of ~10 datasets (Q&A, agentic tasks, long‑context reasoning, use‑case focused datasets). It’s a “best single number” but AA provides full breakdowns and charts.
- Key methodological principles:
- Run evaluations consistently across all models/providers (same prompts, parsing rules, repeat counts).
- Parse model outputs robustly (answer extraction, regexes, LLM answer extractors where appropriate).
- Use many repeats to achieve 95% confidence intervals (adds cost but reduces variance).
- Mystery‑shopper policy for private endpoints: AA will run blind tests on endpoints to ensure public vs. private endpoints aren’t being manipulated.
- New and notable evals:
- Omniscience index: measures embedded factual knowledge and penalizes false confident answers (score scale −100 to +100). Encourages “I don’t know” behavior over confident falsehoods.
- Hallucination/Calibration: evaluating whether models decline to answer when unsure (separate from raw intelligence).
- Critical Point: hard physics problems / frontier research questions (very low model success rates; sometimes hallucination is useful for brainstorming).
- GDPVal (AA’s agentic variant of OpenAI’s GDP benchmarks): multi‑turn, multi‑file, agentic tasks (zip inputs, spreadsheets, presentations). Uses an LM‑based judge plus ELO scoring because outputs are documents/media, not single correct answers.
- Agent harness “Stirrup”: open‑sourced harness for generalist agent evaluations (context management, web search/browsing, code execution, minimal toolset).
- Evaluator approach: for GDPVal AA uses LM judges (e.g., Gemini 3 Pro preview) to compare two outputs against task criteria. They validated this grading pipeline against human reference.
Technical, cost and integrity considerations
- Costs:
- Initial costs were low (hundreds of dollars) when models were few and evaluations simple; today costs have increased nonlinearly (many models, repeats, complex agentic tasks).
- AA reports a simplified cost metric on-site (often assuming single repeats) but actual internal costs include many repeats for CI.
- Variance & repeats:
- Short benchmark tests (small question sets, single runs) show high variance; AA runs many repeats to achieve tight confidence intervals.
- Parsing & format issues:
- Models return varied formats; deciding whether to penalize format noncompliance depends on the metric’s purpose.
- Endpoint integrity:
- Labs sometimes provide private endpoints; AA uses mystery‑shopper tests and transparency policies to ensure endpoints aren’t giving special treatment.
- Avoiding benchmark gaming:
- Recognizes that measured metrics become targets (researchers optimize to the benchmark). AA counters this by evolving evals and focusing on metrics linked to real developer use cases.
Openness index and hardware benchmarking
- Openness Index (score out of 18): measures how open models are beyond weights/licensing — includes disclosure of pre/post‑training data, training code, and license terms. Purpose: give a holistic view of “how open” a model is.
- Tradeoffs: more openness ≠ better intelligence; openness is scored objectively (what was released), not weighted by industry impact.
- Hardware benchmarking:
- AA added hardware/system benchmarks; renamed throughput metrics (output speed vs system throughput).
- Blackwell (NVIDIA) brings significant gains vs Hopper (~2–3x observed in AA’s framing), but the exact efficiency depends heavily on workload, serving speed targets, model sparsity and serving configuration.
- Sparse/mixture‑of‑experts models change “active parameter” economics; total parameter count still correlates strongly with factual knowledge recall (omniscience).
Trends, the “smiling curve” and cost dynamics
- Two simultaneous, seemingly contradictory trends:
- The cost to access a given tier of intelligence has dropped dramatically (100x–1000x for some GPT‑4‑level equivalents) thanks to model improvements, cheaper models, and hardware gains.
- Total spend on inference can be much higher now because agentic applications, long‑context reasoning, and multi‑turn workflows consume many more tokens and runtime. Companies are willing to spend more per employee on powerful agentic workflows.
- Practical result: lower per‑unit intelligence cost but higher overall spend when tackling complex, multi‑step tasks at scale — the “smiling curve” where both sides move.
- Token efficiency vs. number of turns:
- Important to measure both tokens per turn and total turns needed to solve a task. A higher per‑token cost model that resolves a task in fewer turns can be cheaper overall.
- Benchmarks are evolving from single‑turn to multi‑turn/agentic evaluations to better capture real application costs.
What’s new (recent launches) and roadmap (v3 → v4)
- Recent launches discussed:
- Omniscience (hallucination/knowledge index, with held‑out test data to avoid contamination).
- Stirrup (open source agent harness).
- GDPVal integration (agentic ELO scoring).
- Openness index & expanded hardware/performance benchmarks.
- Roadmap (v4 of the Intelligence Index):
- Incorporate GDPVal agentic performance, Critical Point (physics), omniscience/hallucination metrics.
- Careful re‑weighting and versioning to avoid misleading comparisons across versions — v4 will reset weighting choices; historical continuity will require paying attention to version tags.
- Broader direction:
- Move beyond raw intelligence to measure hallucination, calibration, personality/behavioral axes, and use‑case‑specific capabilities important to developers and enterprises.
How developers and enterprises can use Artificial Analysis
- Public site:
- Use the interactive charts to compare models across intelligence, output speed, cost, and openness; you can customize which models are highlighted.
- Check arenas for multimodal (image/video) preference votes and suggest new categories/prompts.
- Paid services:
- Buy benchmark & insight subscriptions for standardized enterprise reports (deployment strategy, cost tradeoffs).
- Commission custom/private benchmarks if you need tailored, confidential evaluation and comparisons.
- Community / tools:
- Try Stirrup (the open agent harness) as a starting point for building or evaluating generalist agents.
- Submit requests to AA for new arena categories or target behaviors you want measured — metrics drive labs’ priorities.
- Practical benchmarking advice from AA:
- Don’t trust single‑run, single‑metric comparisons; ask for repeatability, CI, identical prompts, and full tradeoff charts (accuracy + latency + cost).
- For application cost estimates, use metrics that reflect multi‑turn workflows and token/turn efficiency.
Notable quotes and insights
- “No one pays to be on the website. We've been very clear about that from the very start because there's no use doing what we do unless it [is] independent AI benchmarking.”
- “Once an eval becomes the thing that everyone's looking at the schools can get better on it without there being a reflection of overall generalized intelligence… The only way to counter that is to keep building new evals.”
- “The cost of intelligence has been falling dramatically… but it is clearly possible to spend quite a lot more on AI inference now than it was a couple of years ago” — the tradeoff between lower per‑unit cost and larger total spend for agentic workflows.
Practical next steps (action items)
- If you’re a developer or startup:
- Explore AA’s public charts and customize model comparisons relevant to your latency/cost/accuracy targets.
- Clone Stirrup to prototype agentic workflows and re‑use AA’s harness for your own testing.
- If you’re an enterprise:
- Consider an AA benchmark & insight subscription to guide architecture decisions (serverless vs managed vs owning hardware) and to commission private tests.
- Ask for custom agentic benchmarks that mirror your actual workflows (multi‑turn, multi‑file inputs).
- If you’re a researcher/open‑source contributor:
- Contribute new datasets or propose arena categories where models are under‑measured (e.g., specific hallucination cases, domain‑specific agentic tasks).
Final note
Artificial Analysis has evolved quickly from a side project to a 20‑person independent benchmarking house focused on reproducible, multi‑dimensional model comparisons. Their emphasis is on transparent public metrics plus paid enterprise services — they aim to keep benchmarks useful, hard to game, and aligned with real developer/use‑case needs as the field keeps moving fast.
