Building AI Agents That Survive Production

Name: Building AI Agents That Survive Production
Uploaded: 2026-05-07T00:20:24Z
Channel: MLOps.community
Description: Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle w...

MLOps.community · Beginner ·🤖 AI Agents & Automation ·2w ago

Skills: Agent Foundations80%

Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle with a brutally simple message: stop trying to design AI agents that never fail. Build agents that fail cheaply and recover automatically. In this 25-minute talk, Haytham walks through the three design principles every production agent needs — the 3 D's: Dynamic, Durable, and Defended — and shows what each one actually requires from your platform. He grounds it in a real case study with Dragonfly, who took a laptop prototype to a production agent system indexing 250,000+ products in a single sitting on Flyte 2. Topics covered: - The travel agent thought experiment: what 18 years of human agents teach us about long-running sessions, dropped calls, and not asking the user the same question twice - The show-of-hands problem: why so many teams build agents but so few ever ship them - The full taxonomy of agent failure: semantic errors, infrastructure errors, network errors, API throttling, and corrupt context - Dynamic: why agent platforms must run native Python instead of forcing you into a constrained DSL for branching and loops - Durable: declaring infrastructure inside your code so agents can react to OOMs, spot machine preemption, and crashes - Crash recovery for long-running sessions: caching non-deterministic LLM calls and tool calls so agents can resume from the last checkpoint - Cross-session caching: when to share LLM outputs across users and when to recompute - Defended: sandboxing agent-generated code with Pydantic Monty and network-isolated execution environments - Human-in-the-loop bailouts when the agent has exhausted its retries - Dragonfly case study: a four-tier agent architecture (catalog, coordinator, researcher, tools) for product recommendation across 250K+ products - Q&A: why Union.ai uses Go and Rust under the Python SDK, and how platform teams can shift agent infrastructure left to developers wi

Watch on YouTube ↗ (saves to browser)