The $7,000 AI Mistake That Changed How I Evaluate Every Model

Shane | LLM Implementation · Intermediate ·🛡️ AI Safety & Ethics ·7mo ago
A chatbot cost Air Canada $7,000. ChatGPT got lawyers sanctioned in court. These aren't edge cases. They're what happens when you "vibe check" your AI. This video will save you from the same mistakes. In 10 minutes, you'll learn: ✅ The "PhD thesis problem" that makes LLM evaluation fundamentally different. ✅ Why Perplexity is your first line of defense (with a 5-line Python demo). ✅ How to catch a model "cheating" on benchmarks using one simple test. ✅ The exact script used: distilgpt2 + Hugging Face evaluate library. ✅ Where Perplexity fails (and what to use instead). CRITICAL: This metric won't tell you if your model is truthful or helpful. But it WILL tell you if it actually understands language—or if it's just memorizing. 📊 Actual demo results: HTML code: Perplexity = 7.07 (highly predictable) Creative prose: Perplexity = 102.04 (14.4x more unpredictable!) 🎬 Full Evaluation Series: Part 1: The Stethoscope (Perplexity) - You are here! Part 2: The Two Pillars (Coming Soon) Part 3: The AI Judge (In Development) 💻 Resources: Demo Code: https://github.com/LLM-Implementation/Practical-LLM-Implementation/blob/main/AI-Engineering/demo/perplexity/demo.py Models: distilgpt2 (free to run) Chapters: 0:00 - The AI Evaluation Crisis: Preventing Costly Mistakes 0:37 - The Sin of "Vibe Checks" 1:27 - The Engineer's Stethoscope 2:00 - What is Perplexity? Measuring "Surprise" 3:16 - DEMO: Calculating Perplexity in 5 Lines of Python 4:45 - The Lie Detector: Spotting Benchmark "Cheaters" 5:51 - When The Stethoscope Isn't Enough 6:42 - Your Full Evaluation Toolkit 🔔 **Subscribe for practical AI insights** - we're breaking down how modern AI actually works, one video at a time. This presentation is inspired by the core concepts in the book "AI Engineering" by Chip Huyen. If you want a deeper dive into these topics, I highly recommend checking it out. 💬 **Questions?** Drop them in the comments - I read and respond to every one. 🎓 Join our FREE AI Engineering Community
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Project Glasswing Explained: Anthropic’s Push for Defensive Cybersecurity in the AI Era
Learn about Project Glasswing, Anthropic's initiative for defensive cybersecurity in the AI era, and its significance in protecting against AI-powered threats.
Dev.to · softpyramid
A Yale ethicist who has studied AI for 25 years says the real danger isn’t superintelligence. It’s the absence of moral intelligence.
A Yale ethicist argues that the real danger of AI isn't superintelligence, but the lack of moral intelligence in its development and deployment
Dev.to AI
Massive Layoffs, Meta Surveillance, DeepSeek-V4 in AI News
Meta's mandatory data harvesting for AI training raises concerns about surveillance and privacy
AI Supremacy
We Open-Sourced Our Prompt Defense Scanner: 200 Lines of Regex That Replace an LLM
Learn how to use a deterministic prompt defense scanner built with regex to replace LLMs for security checks, and why regex is better suited for this task
Dev.to · ppcvote

Chapters (8)

The AI Evaluation Crisis: Preventing Costly Mistakes
0:37 The Sin of "Vibe Checks"
1:27 The Engineer's Stethoscope
2:00 What is Perplexity? Measuring "Surprise"
3:16 DEMO: Calculating Perplexity in 5 Lines of Python
4:45 The Lie Detector: Spotting Benchmark "Cheaters"
5:51 When The Stethoscope Isn't Enough
6:42 Your Full Evaluation Toolkit
Up next
Stop Using RLHF: How to Align & Control LLMs (DPO Guide)
Shane | LLM Implementation
Watch →