Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis agent: tracing with Phoenix, reading traces before writing a single eval, categorizing failures by root cause, then building code evals, built-in LLM-as-a-judge evals, and a custom rubric with labeled examples.
The sharpest lesson: choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13, because the model doesn't know what year it is and can't verify forward-looking financial data. The workshop closes on the thing most eval content skips — experiments that let you prove a prompt change actually worked, rather than eyeballing it and calling it a win.
Speaker info:
- https://x.com/seldo
- https://www.linkedin.com/in/seldo/
- https://github.com/seldo
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Agent Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Understanding Real-Time Customer Intent: The New Frontier for Retail AI Chatbots
Medium · AI
Artificial Intelligence Is Not Replacing Humans - It’s Replacing Certain Behaviors
Medium · AI
How I cut my LangChain agent's token costs by 93% with one import
Dev.to · Mahika jadhav
5 Passive Income Streams Your AI Agent Can Run While You Sleep
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI