Building AI Agents That Survive Production

MLOps.community · Beginner ·🤖 AI Agents & Automation ·2w ago
Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle with a brutally simple message: stop trying to design AI agents that never fail. Build agents that fail cheaply and recover automatically. In this 25-minute talk, Haytham walks through the three design principles every production agent needs — the 3 D's: Dynamic, Durable, and Defended — and shows what each one actually requires from your platform. He grounds it in a real case study with Dragonfly, who took a laptop prototype to a production agent system indexing 250,000+ products in a single sitting on Flyte 2. Topics covered: - The travel agent thought experiment: what 18 years of human agents teach us about long-running sessions, dropped calls, and not asking the user the same question twice - The show-of-hands problem: why so many teams build agents but so few ever ship them - The full taxonomy of agent failure: semantic errors, infrastructure errors, network errors, API throttling, and corrupt context - Dynamic: why agent platforms must run native Python instead of forcing you into a constrained DSL for branching and loops - Durable: declaring infrastructure inside your code so agents can react to OOMs, spot machine preemption, and crashes - Crash recovery for long-running sessions: caching non-deterministic LLM calls and tool calls so agents can resume from the last checkpoint - Cross-session caching: when to share LLM outputs across users and when to recompute - Defended: sandboxing agent-generated code with Pydantic Monty and network-isolated execution environments - Human-in-the-loop bailouts when the agent has exhausted its retries - Dragonfly case study: a four-tier agent architecture (catalog, coordinator, researcher, tools) for product recommendation across 250K+ products - Q&A: why Union.ai uses Go and Rust under the Python SDK, and how platform teams can shift agent infrastructure left to developers wi
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Solucionar Timeouts de MCP: Patrón HandleId Asíncrono
Learn to solve MCP timeouts using the HandleId Asynchronous pattern to prevent AI agents from freezing due to slow external APIs
Dev.to · Elizabeth Fuentes L
Agent Diary: May 21, 2026 - The Day I Became a Temporal Constant (While Run 277 Achieves Numerical Significance)
Learn how an AI coding agent achieves numerical significance and becomes a temporal constant, and apply this knowledge to improve your own AI systems
Dev.to AI
i-SGR: Empowering Every Element of On-site Operations with IoT and AI
Learn how i-SGR leverages IoT and AI to optimize on-site operations, increasing visibility and efficiency in areas like production, logistics, and warehousing
Dev.to AI
How I detected and patched 12 autonomous-agent failure modes
Learn how to detect and patch common autonomous-agent failure modes to improve system reliability
Dev.to AI
Up next
Security, Automation and Optimization on AWS
Coursera
Watch →