Overview of AI Agent Failure Modes (The Agents Season, Episode 6)
Ben Jaffe and Katie Malone take a reality-check look at why AI agents fail, how those failures compound over multi-step workflows, and what current benchmarks actually say about agent reliability. The episode argues that while agents are improving quickly—especially on narrow coding tasks—they are still far from consistently dependable on realistic, multi-step work, and multi-agent setups introduce their own coordination failures.
Why AI Agent Failures Compound
A central theme of the episode is that agent failures are multiplicative, not additive.
The math of multi-step tasks
- If an agent is 90% reliable per step, then:
- A 10-step task has only about 35% end-to-end success
- A 100-step task becomes effectively 0%
- Even 99% per step only gets you to about 36.6% success over 100 steps
- To get above 90% end-to-end on a 100-step task, you need roughly 99.99% per-step reliability
Why one mistake poisons later steps
- A failure early in a workflow changes the state that later steps depend on
- The agent often doesn’t realize it made a mistake, so it keeps reasoning from a corrupted state
- This makes agent errors harder to notice than chatbot errors, because the damage can stay hidden until much later in the process
What the Benchmarks Show
The episode uses benchmarks to show both real progress and remaining limitations.
TAU-bench: customer service and tool use
- TAU-bench evaluates agentic customer-service tasks like:
- Returning purchases
- Changing airline reservations
- It emphasizes pass@k: success across repeated runs, not just once
- Early results:
- Best models in 2024 were below 50% on the base benchmark
- pass@8 in retail was below 25%
- Current leaderboard performance is much better:
- Top models are now around 60–70%
- But reliability still drops sharply when you require repeated success
SWE-bench Verified: coding agents on software issues
- A major coding benchmark built from real GitHub issues
- Performance has improved dramatically:
- Claude 2 in 2023 resolved <2%
- By late 2024, top systems passed 50%
- In early 2026, reported scores were 80%+
- Most recent result mentioned was about 94%
- This is a huge jump, but the hosts caution that it does not mean coding agents are “solved”
SWE-bench Mobile: harder, more realistic engineering
- A tougher benchmark for more industry-like mobile app development tasks
- Even the best systems were only around 12% success
- This shows how much benchmark results depend on task scope and realism
Important caveat: benchmark scores are not just model scores
The hosts stress that benchmark performance depends on:
- The base model
- The agent scaffold/harness
- Tools and APIs
- Memory management
- Evaluation setup
In some cases, the same model can vary by up to 6x depending on the surrounding agent architecture.
Common Failure Modes in Multi-Agent Systems
The episode highlights a 2025 UC Berkeley paper that analyzed more than 1,600 execution traces across seven agent frameworks and identified a taxonomy called MAST: Multi-Agent System Failure Taxonomy.
1. Specification and system design failures
These happen when the setup itself is flawed or the agent diverges early. Examples include:
- Ignoring or violating task instructions
- Repeating steps unnecessarily
- Getting stuck in loops
- Losing conversation history or context
- Poor task framing or context management
2. Inter-agent misalignment
These are failures unique to multi-agent systems, where coordination breaks down. Examples include:
- One agent hands off output that another agent cannot parse
- Agents reach contradictory conclusions with no resolution
- An orchestrator gives ambiguous instructions, causing sub-agents to diverge
3. Task verification and termination failures
These are “knowing when you’re done” problems. Examples include:
- Thinking the task is complete when it isn’t
- Producing outputs that look finished but don’t satisfy requirements
- Overshooting and doing more work than asked for
Are Multi-Agent Systems Better?
Not necessarily.
The episode points out that while multi-agent systems can provide:
- Specialization
- Parallelism
- Cross-checking
they also introduce:
- Communication overhead
- Handoff errors
- Misalignment between agents
In the Berkeley study, multi-agent setups sometimes performed worse than single-agent systems on the same tasks. The takeaway: coordination benefits are real, but they come with coordination costs.
Main Takeaways
- Agent failures compound quickly in multi-step workflows
- Benchmark progress is real, especially for narrow coding tasks
- But high benchmark scores do not mean general reliability
- Real-world task complexity still exposes major gaps
- Multi-agent systems are not automatically better than single-agent ones
- The best results often come from a combination of:
- Better models
- Better scaffolding
- Better task scoping
- Better human usage patterns
Practical Implications for Builders and Users
For builders
- Measure end-to-end reliability, not just single-step accuracy
- Be cautious about introducing multi-agent coordination unless it clearly helps
- Design robust verification and stopping conditions
- Treat benchmark scores as context-dependent, not universal truths
For users
- Scope tasks carefully
- Check in on agent output at appropriate intervals
- Use agents for tasks where their strengths match the problem
- Expect much better results when you learn how to prompt, supervise, and structure the work effectively
Bottom Line
AI agents are getting much better, but they are still not consistently reliable on complex, multi-step work. The episode’s core message is that agent failure is both a mathematical problem and an architectural one: success rates drop as tasks get longer, and multi-agent coordination introduces new failure modes. The result is a landscape where agents can be impressively useful in the right settings, but still fragile enough that careful design and human oversight matter a lot.