Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

AI Engineer · Intermediate ·🤖 AI Agents & Automation ·1h ago
Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis agent: tracing with Phoenix, reading traces before writing a single eval, categorizing failures by root cause, then building code evals, built-in LLM-as-a-judge evals, and a custom rubric with labeled examples. The sharpest lesson: choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13, because the model doesn't know what year it is and can't verify forward-looking financial data. The workshop closes on the thing most eval content skips — experiments that let you prove a prompt change actually worked, rather than eyeballing it and calling it a win. Speaker info: - https://x.com/seldo - https://www.linkedin.com/in/seldo/ - https://github.com/seldo
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Understanding Real-Time Customer Intent: The New Frontier for Retail AI Chatbots
Learn how retail AI chatbots can leverage real-time customer intent to drive sales and loyalty, and why it matters for modern retail
Medium · AI
Artificial Intelligence Is Not Replacing Humans - It’s Replacing Certain Behaviors
AI is replacing certain human behaviors, not humans themselves, and understanding this distinction is crucial for effective AI integration
Medium · AI
How I cut my LangChain agent's token costs by 93% with one import
Cut LangChain agent's token costs by 93% with a simple import and optimization technique
Dev.to · Mahika jadhav
5 Passive Income Streams Your AI Agent Can Run While You Sleep
Automate passive income streams with AI agents to earn money while you sleep, leveraging affiliate marketing, print-on-demand stores, and more
Dev.to AI
Up next
Introducing Interwhen: Steering reasoning agents with real-time verification
Microsoft Research
Watch →