Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

Name: Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Uploaded: 2026-05-14T18:00:06Z
Channel: AI Engineer
Description: Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn...

AI Engineer · Intermediate ·🤖 AI Agents & Automation ·1h ago

Skills: Agent Foundations90%Tool Use & Function Calling80%

Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis agent: tracing with Phoenix, reading traces before writing a single eval, categorizing failures by root cause, then building code evals, built-in LLM-as-a-judge evals, and a custom rubric with labeled examples. The sharpest lesson: choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13, because the model doesn't know what year it is and can't verify forward-looking financial data. The workshop closes on the thing most eval content skips — experiments that let you prove a prompt change actually worked, rather than eyeballing it and calling it a win. Speaker info: - https://x.com/seldo - https://www.linkedin.com/in/seldo/ - https://github.com/seldo

Watch on YouTube ↗ (saves to browser)