Summary — DevDay 2025: Apps SDK, Agent Kit, MCP, Codex and why Prompting is More Important than Ever
Hosts: swyx + Alessio
Guests: Sherwin & Christina (OpenAI Open Platform team)
Overview
This interview covers OpenAI’s recent DevDay product launches and platform direction: Apps SDK (ChatGPT-first integrations), Agent Kit (builder + SDK + runtime + evals), adoption of the MCP protocol, evals & prompt optimization, Codex usage internally, reliability tooling, and ecosystem/portability trade-offs. The conversation focuses on how these pieces fit together to help developers build, deploy, evaluate, and iterate on agents and conversational apps.
Key points & main takeaways
- Platform philosophy: OpenAI sees APIs and developer tooling as essential to distributing AI benefits broadly; recent launches are iterative steps to empower external builders.
- Apps SDK (ChatGPT-first integrations): Inverts the old website-with-a-chatbot pattern — ChatGPT becomes the top layer and embeds apps (retains brand control with custom UI components and polished widgets).
- MCP adoption: OpenAI integrated MCP (originally from Anthropic) into Agents SDK in March — they credit MCP as a useful, open protocol and are participating in its steering.
- Agent Kit (Builder + SDK + Runtime + Evals):
- Agent Builder (visual canvas) is intended as both a development playground and a deployment path (export to code or run via Chat Kit).
- Includes templates/playbooks (customer support, document discovery, data enrichment, planning, internal knowledge, etc.).
- Supports human-in-the-loop approval nodes and stateful workflows; roadmap includes richer modalities (voice, multimodal) and more complex approval workflows.
- Evals improvements:
- Now support running agent traces and grading full agent traces; roadmap to break traces into parts and apply rubrics / human-in-the-loop evaluation for each stage.
- Evals can target multiple model providers (via Open Router) to compare performance.
- Prompt optimization is increasingly central — OpenAI invests in automated prompt tuning tied to evals; “prompts are not dying, they’re more important than ever.”
- Codex/internal developer workflow:
- Codex is used heavily internally for feature implementation, PR previews, and PR review assist; tip: trust the model more (let it write larger chunks).
- Chat Kit & widgets:
- Chat Kit is an embeddable, opinionated iframe (kept evergreen, not planned to be open-sourced) that provides polished consumer-grade chat UX and widgets; widget studio exists to create UI components quickly.
- Portability & multi-model: OpenAI intends to support third-party and open models (evals can compare many providers), and is thinking about portability standards for stateful APIs.
- Reliability & observability: New per-org service health dashboard (personal SLOs, token velocity, TPM, response codes) to help customers monitor integrations; OpenAI is aggressively improving SRE to meet high availability goals.
- Cost / BYOK (bring-your-own-key):
- Many developers ask for BYOK for inference cost control; it is not available out-of-the-box but is top-of-mind and a common ask.
- Warning: state stores may be repurposed as databases; watch for cost/scale and operational impact.
Notable quotes & insights
- “It’s kind of inverted — there’s ChatGPT at the top layer and then the website embedded inside of it.” — on the Apps SDK experience.
- “Prompting is more important than ever.” — recurring theme: prompts + evals + optimization remain central to building effective agents.
- “Trust the model to do more.” — Codex power-user tip: let the model produce bigger, riskier outputs and iterate.
Topics discussed
- Apps SDK: intent, developer experience, brand-preserving integrations, widgets & Chat Kit
- Agent Kit: agent builder canvas, SDK, connectors, templates, human approval, export-to-code, deployment
- MCP protocol adoption and ecosystem (Anthropic origin, steering participation)
- Responses API, stateful APIs, and porting considerations
- Evals: agent traces, grading, rubrics, multi-model comparison
- Prompt optimization & automated prompt tuning
- Codex: productivity tips and internal adoption patterns
- Widgets, embeddable chat iframe, and trade-offs of open-sourcing
- Service health dashboard and reliability improvements
- BYOK, cost control, and state-as-database caution
- Roadmap expectations: more modalities (voice, multimodal), deeper human-in-loop support, and broader third-party model support
Action items / Recommendations (for developers)
- Try Agent Builder as a playground:
- Use it to prototype agents, iterate on prompts, and export to Agents SDK when ready.
- Leverage provided templates (customer service, document discovery, data enrichment) to accelerate builds.
- Integrate evals into development:
- Capture agent traces and run evals to measure end-to-end behavior; begin defining rubrics for long agentic tasks.
- Use evals to compare models (including open-source ones via Open Router) and to drive automated prompt optimization loops.
- Invest in prompt engineering:
- Treat prompts as first-class, iteratively optimize them (automated tuning where possible), and expect maintenance as models evolve.
- Use Codex to accelerate dev workflows:
- Try delegating larger chunks of code; use Codex-assisted PR previews and reviews to speed context switching.
- Monitor reliability:
- Enable and watch the service health dashboard to get personal SLOs and real-time telemetry for your org.
- Plan for cost & keys:
- Expect to manage inference costs; watch for possible BYOK options in the future and design guardrails (rate limiting, allow-lists).
- Give feedback:
- OpenAI is actively seeking developer input on Agent Builder trade-offs (deterministic vs LM-driven nodes, types of logical nodes, modality priorities).
Additional technical notes
- Agents SDK + Responses API launched earlier; MCP integrated into Agents SDK around March.
- Evals now can evaluate long traces from agent runs; future improvements include finer-grained part-by-part evaluation and multimodal evals.
- Chat Kit is an embeddable iframe optimized and kept evergreen (avoid rebuilding for model/modal changes).
- OpenAI participates in MCP steering and collaborates with other vendors to promote open protocols and multi-model ecosystem support.
If you want, I can:
- Produce a one-page checklist for adopting Agent Kit + Evals in your product.
- Draft example eval rubrics for common agent workflows (e.g., customer support ticket resolution).
