Summary of AI incidents, audits, and the limits of benchmarks Podcast Episode by Practical AI

Overview of Practical AI — “AI incidents, audits, and the limits of benchmarks”

This episode of the Practical AI podcast (hosts Daniel Whitenack and Chris Benson) features Sean McGregor — co‑founder and lead research engineer at the AI Verification and Evaluation Research Institute (AVERY) and founder of the AI Incident Database. The conversation covers what counts as an “AI incident,” how incidents are collected and used to improve safety, the role of third‑party audits versus benchmarks, failures that arise when systems are composed (e.g., guard model + base model), and lessons learned from a DEF CON red‑teaming exercise.

Key topics discussed

Sean McGregor’s background (reinforcement learning for wildfire policy, chip work, test & evaluation company, sale to Underwriters Laboratories, formation of AVERY).
The AI Incident Database: what it contains and why incident reporting matters.
Definitions and terminology around incidents, vulnerabilities, harms.
Sources of incident data (journalism, voluntary reports) and the limits of voluntary reporting vs. need for mandatory reporting.
Third‑party audits and verification for frontier/general‑purpose models.
Benchmarks vs. audits: why benchmark scores can mislead in practical deployments.
DEF CON Generative Red Team exercise (To Err as AI): methodology, findings, and the need for statistical / systematic evidence of vulnerabilities.
Recommendations for improving evaluation, reporting, and operational safety.

Main takeaways

Incident = an event where a harm has taken place. It’s intentionally broad to capture safety, security, abuse, and unintended harms.
The AI Incident Database has thousands of human‑annotated reports (over 5,000 annotations across >1,000 incident records) and plays a role like aviation/medical adverse‑event databases: learn from incidents so they don't repeat.
Most publicly available incident reports today come from journalism; that creates coverage bias and makes it hard to estimate rates. Mandatory reporting (e.g., proposed EU rules) would improve visibility and measurement.
Benchmarks are often created for research and understanding, not as guarantees for deployment. They frequently fail to represent the distribution or context of a specific real‑world application.
Meta‑evaluation (“benchmarking the benchmarks”) is necessary — audits should check whether evaluation evidence supports claims being made.
Composed systems (e.g., guard model + base model) create exploitable handoffs. These integration points are under‑tested and commonly cause failures.
Security communities (hackers) bring attack skills; combining that with statistical rigor (to show systematic vulnerabilities) produces more useful flaw reports than anecdotes alone.
Practical safety and business viability go together: incidents can and do affect companies’ reputations and stock prices.

Notable quotes / insights

“You don't want a bad thing to happen and that you don't want that bad thing to produce a harm.” — definition rationale for “incident.”
“We need to switch from voluntary reporting to more mandatory reporting” — on improving incident visibility and measuring rates.
“Benchmarks were produced for research purposes, not practical AI purposes.” — on why benchmark leaderboards can mislead deployers.
“Manage what you measure.” — emphasis on measuring risk to manage it.
Anecdote ≠ data: security exploits must be demonstrated as systematic (statistical evidence), not one‑off occurrences, to inform remediation.

Case studies & concrete examples

Traffic citation error: a traffic camera system misread a shirt/purse strap pattern as a license plate, producing wrongly mailed citations — demonstrates real‑world brittleness and edge cases.
DEF CON Generative Red Team 2: A challenge using a 7B open language model and a guard model. Attackers found that the guard/base‑model handoff configuration was exploitable; many submissions were anecdotal, requiring organizers to demand statistical evidence showing systematic vulnerability. Outcome: showed the need for formalized flaw reporting and different adjudication methods (not just “look, it broke once”).
Bench meta‑evaluation (Ventress project): Found many benchmarks lack practical guarantees and fail to capture real‑world failure modes and distributional differences.

Why benchmarks can be misleading (concise)

Purpose mismatch: many benchmarks are for research/knowledge, not deployment assurance.
Distribution mismatch: benchmark data and prompts may not match a deployer’s user population or prompts.
Overreliance on leaderboard points ignores context and integration effects.
Benchmarks often don’t evaluate composed systems or the guard/model handoff.
Meta‑evaluation is required to check whether benchmark evidence supports claimed properties.

Practical recommendations / action items for organizations

Treat incident reporting as part of safety: log incidents, encourage reporting, and push for standardized internal processes.
Adopt third‑party audits for high‑risk or general‑purpose models (audits validate claims and check evidence).
Run pilot programs in your deployment context — don’t assume benchmark results generalize.
Test composition points: explicitly evaluate the interfaces between guards, filters, and base models.
Demand statistical evidence of systemic vulnerabilities (not only anecdotes) when doing red teaming.
Establish or participate in a flaw‑reporting / bug‑bounty program adapted for ML (flaw reports for model/data issues).
Track metrics and risks you care about; “manage what you measure.”
Monitor regulation developments (e.g., EU severe‑incident reporting rules) and prepare for mandatory reporting requirements.

Resources mentioned / next steps

AI Incident Database (founder: Sean McGregor) — the incident collection used to analyze harms.
AI Verification and Evaluation Research Institute (AVERY) — third‑party audit and meta‑evaluation work.
DEF CON AI Village / Generative Red Team (To Err as AI) — example red‑teaming exercise and paper.
Underwriters Laboratories (UL) — referenced as an organization where Sean worked after selling earlier assets.
Practical AI website: practicalai.fm (webinars and episode notes).
Prediction Guard (partner/support noted on the podcast).

Who should listen / why it matters

Product leaders and engineers deploying ML/LLM systems — to understand limits of benchmarks and the need for pilots, audits, and integration testing.
Security and compliance teams — to learn why incidents need reporting and how red teams should provide statistical proof of exploitability.
Research and evaluation teams — to consider meta‑evaluation and building practical benchmarks with deployment context in mind.
Executives and investors — to appreciate why safety/auditing is becoming table stakes and may affect liability and investor decisions.

If you want to prioritize one practical step from this episode: start by inventorying your AI systems and define a simple internal flaw‑reporting + logging process that captures incidents, handoff points, and context so you can begin measuring the specific risks you face.

Summary of AI incidents, audits, and the limits of benchmarks

Practical AIby Practical AI LLC