Summary of How to get multiple agents to play nice at scale Podcast Episode by The Stack Overflow Podcast

Overview of How to get multiple agents to play nice at scale

This Stack Overflow Podcast episode features Ryan Donovan interviewing Stephen Kalesha (Staff Software Engineer) and Chase Ruzin (Engineering Manager) from Intuit about the design, operational, and evaluation challenges of running multiple AI agents at enterprise scale. The guests describe Intuit’s composable AI architecture (built on prior investments like GenOS), why Intuit moved from many bespoke agents to a flattened “skills & tools + central planner” model, and how they handle determinism, evaluation, observability, cost, and uptime in production.

Core topics covered

Intuit’s history and investment in AI infrastructure (GenOS, platform teams, security, guardrails)
The shift from independent agents to a skills-and-tools architecture with a central orchestrator
Similarities and differences between agent architectures and microservices
Evaluations: offline, online, and human-in-the-loop approaches
Handling determinism (math/financial correctness) via tools and primitives
Orchestration patterns: plan-execute, progressive disclosure, and tool calls
Operational concerns: capacity, latency, token cost, observability, fallbacks, load testing
Product/UX considerations: when to surface humans and “done-for-you” actions

Main takeaways

Prior platform investments (security, composition, observability) let teams move quickly; reusable building blocks (like Lego pieces) accelerate product work.
For cross-domain customer queries, a flattened system (central planner that sees all skills/tools) outperforms many siloed sub-agents.
Determinism is achieved by using deterministic tools/primitives for calculations, reports, and other exact tasks, and by storing large context in data files instead of the LLM context window.
Evaluation-first engineering is essential—Intuit runs offline LLM judges, online customer feedback, and human expert labeling to validate agent behavior and guide tuning.
Operational excellence requires new practices: weekly load/performance tests, three-level observability (LLM, gateway, platform), token/cost monitoring, and FMEA-style thinking about fallback models.
UX should balance automation and human oversight: surface humans for sensitive/complex financial answers and allow customers to opt into human review when needed.

Architecture & technical approach

Composable AI “operating system”: central orchestrator + skills & tools + guardrails (security and policy).
Skills and tools model:
- Flatten intelligence into capabilities (skills) that reference deterministic tools.
- Central planner (Intuit Intelligence) sees all skills and creates multi-step plans for cross-domain tasks.
- Progressive disclosure: expose skill metadata/front matter to the planner so it can plan appropriately.
Plan-execute workflow: planner creates sequential steps, calls deterministic tools as needed, and synthesizes a single output for the user.
Determinism strategy:
- Primitive tools handle math, reporting, and exact operations to avoid hallucinations.
- Persist large datasets to files so LLMs reference stable data without bloating context windows.

Evaluations & monitoring

Evaluation types

Offline evals: teams contribute golden datasets; LLM judges run tests to check determinism, tool usage, and conversation quality.
Online evals: sampling of real user interactions and metrics to ensure customer expectations.
Human evals: expert labeling and human review for critical cases and building ground truth.
Continuous tuning: iterate on judges, prompts, intent detection, and agents based on evaluation feedback.

Observability & telemetry

Three observation layers: LLM-level (tokens, traces), gateway-level, and platform-level analytics.
Monitor token usage, response latencies, tool invocation counts, and model fallbacks.
Regular load/performance tests to validate capacity and behavior under scale.

Operational & cost considerations

Capacity constraints differ from classic microservices: LLM token costs, variable latency, and model availability change operational planning.
Token spend and cloud infra are significant and variable; costs depend on input/output token volume and frequency of usage (e.g., large PDFs).
Best practices:
- Prioritize customer value first; optimize costs iteratively.
- Maintain observability to detect prompt or tool changes that spike token usage.
- Evaluate fallback models: test and evaluate secondary models for differing personalities/outputs.
- Weekly performance/load tests and FMEA-style analysis for model and service failures.

When to involve humans

Use sampling and complexity heuristics to decide when to route to a human.
For financial/legal/regulated contexts, give customers an option to bring in an expert or escalate to human review.
Humans are integral both in production support and in the evaluation loop (human labeling to improve judgments).

Future direction (what Intuit is focusing on)

Continue maturing the skills & tools approach to reduce customer effort (move toward “done for you” actions where work is completed automatically).
Expand evaluation tooling and operational guardrails to keep pace with rapidly evolving foundation models.
Keep architecture nimble to adopt and evaluate frontier models and maintain product velocity.

Notable quotes / concise insights

“It’s like you want to build an organization and those are your employees — how do we make them all work together?” — on thinking of agents as employees with roles.
“Tools are definitely the way that we deal with determinism.” — use deterministic primitives for correctness.
“We’re evaluation-first here, first and foremost.” — emphasis on continuous testing and human-in-the-loop evaluation.
“The ideal utopia is you come in and it’s like, hey, the work’s done for you.” — product goal of full automation where appropriate.

Actionable recommendations (for teams building multi-agent systems)

Invest early in platform-level building blocks: security, telemetry, and composable services.
Prefer a central planning layer that can see available skills/tools when cross-domain reasoning is required.
Implement deterministic tool primitives for calculations, reports, and any safety-critical outputs.
Build a three-tier evaluation pipeline: offline automated tests, online sampling, and human labeling for edge/sensitive cases.
Add strong observability (token usage, latencies, traces) and run regular load tests to uncover scaling issues.
Plan for model fallbacks and evaluate alternate models’ outputs/personality—not just the primary model.
Prioritize customer correctness and safety first; optimize cost and performance iteratively.

If you want a quick one-sentence summary: Intuit moved from many isolated agents to a central planner + skills-and-tools architecture, backed by rigorous evaluation, deterministic primitives, and enterprise observability, to safely and scalably answer cross-domain customer questions.

Summary of How to get multiple agents to play nice at scale

The Stack Overflow Podcastby The Stack Overflow Podcast