Overview of How can you test your code when you don’t know what’s in it? (The Stack Overflow Podcast)
This episode (host Ryan Donovan; guest Fitz Nolan, VP of AI & Architecture at SmartBear) explores the testing challenges introduced by agentic LLM workflows and the Model Context Protocol (MCP). The core problem: agents and LLMs choose tool invocations dynamically, which breaks many traditional assumptions about deterministic workflows and makes standard testing approaches (unit tests, rigid assertions) insufficient on their own. Fitz explains practical strategies for testing MCP-driven systems, the evolving role of unit tests, how AI-native QA looks, and what companies can still monetize as AI lowers the barrier to building basic apps.
Key takeaways
- MCP and agentic workflows introduce non-determinism: LLMs decide which tools to call and when, so workflows are probabilistic rather than strictly deterministic.
- Two primary testing approaches:
- Skeleton/workflow assertions: validate that a sequence or pattern of tool invocations happens (ordered or unordered).
- Evals (LLM-based evaluations): use LLMs to judge outputs of other LLMs / agents — open-ended and iterative to get right.
- Unit tests still matter mainly for regression/continuity, but are less useful as sole proof that AI-authored code does the intended thing.
- AI-native QA platforms must operate at higher abstraction (intent, functionality, requirements) and combine automated AI checks with human oversight.
- Data is the new defensible asset: “data locality” and “data construction” (how you compose/transform data with AI) are likely to remain monetizable differentiators.
- Expect a spectrum of adoption: regulated sectors and large legacy systems will move slower; on-prem/local agent deployments and privacy-focused desktop solutions will grow.
Topics discussed
- Fitz Nolan’s background (CS PhD, startup Reflect, acquisition by SmartBear; now focuses on AI features across SmartBear products like Swagger/OpenAPI and Bugsnag)
- What MCP is and why it changes testing assumptions
- Non-determinism in agent decision-making and routing
- Strategies for making tests reproducible (workflow skeletons, evals)
- Role and limits of unit tests when code and tests may both be AI-generated
- AI-native QA: testing for intent, functionality, and requirements with human-in-the-loop oversight
- Vision models/OCR improvements enabling richer UI-level/visual testing
- Business implications: commoditization of CRUD, monetizable edges (data locality/construction), possible move to on-prem/local agents for privacy and control
- Legacy software and industries (finance, defense, healthcare) likely slower to adopt full AI-driven code generation/testing
Challenges and recommended approaches
Challenge: Non-deterministic tool invocation
- Problem: Agents may take multiple valid paths; you can’t rely on deterministic sequences.
- Recommended approach:
- Define named workflows (user journeys) and assert skeletons (e.g., Tool A → Tool B → Tool C) rather than exact syntactic matches.
- Create routing prompts/heuristics for common cases, but avoid overfitting to brittle “magical incantation” prompts.
Challenge: Evaluating correctness when LLMs can generate passing-but-meaningless tests
- Problem: An LLM can write unit tests that always pass (assert true) or craft tests to match its own code.
- Recommended approach:
- Use higher-level, application-oriented tests (interaction-level: UI clicks, CLI commands, end-to-end flows).
- Use separate eval LLMs or human reviewers to assess correctness and commonsense behavior.
- Keep unit tests for regression/protection against unintended changes, not as sole proof of correctness.
Challenge: Rapid dev velocity and maintaining QA coverage
- Problem: Dev velocity increases dramatically; QA must scale.
- Recommended approach:
- Build AI-native QA agents that can run at scale, validating functionality against the latest PR/spec.
- Implement human oversight for configuration, edge-case review, and governance.
- Combine deterministic checks (APIs, schemas) with probabilistic checks (evals, commonsense validation).
Practical actions / checklist for teams
- Identify and document key user journeys; implement workflow-skeleton tests for those journeys.
- Build eval suites: use LLMs (and humans) to judge open-ended outputs rather than relying solely on brittle assertions.
- Keep unit tests for regression: ensure continuity when models or AI-generated changes alter behavior.
- Avoid “prompt overfitting”: design prompts and constraints that let models improve with newer models.
- Invest in UI/visual testing using vision-capable models (OCR and image understanding are improving rapidly).
- Protect data and differentiate offering by:
- Emphasizing data locality (unique, rich customer data).
- Packaging unique multi-step prompt/data constructions as product features.
- Plan for on-prem or private-agent offerings for sensitive customers (finance, healthcare, regulated industries).
- Maintain human-in-the-loop processes for governance, compliance, and high-risk domains.
Notable quotes / insights
- “You really want to kind of meet the model, not beat the model and kind of grow with it.” — Fitz Nolan
- “Unit tests will ensure the next change the AI writes won't break the existing functionality, but they won't give you that the software is actually doing what it's supposed to be doing.” — Fitz Nolan
- Two monetizable things in an AI-native world: data locality and data construction (how you compose and transform data).
Where SmartBear fits / perspective
- SmartBear acquired Reflect (end-to-end automated web testing) and Fitz is working to infuse AI across SmartBear’s testing, API, and observability products (includes Swagger/OpenAPI and Bugsnag).
- Their stance: support the full spectrum of customers—from conservative, compliance-heavy organizations that need on-prem/testing stability, to advanced AI-native organizations wanting agentic QA platforms.
Contacts & episode notes
- Host: Ryan Donovan — podcasts at stackoverflow.com (also on LinkedIn)
- Guest: Fitz Nolan — fitz.nolan@smartbear.com (or contact via SmartBear / LinkedIn)
- Community shoutout: Alexander — answer winner mentioned in show notes
If you need a short list of next-step test patterns or example skeleton tests/eval templates based on MCP workflows, I can produce that.
