From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

📰 ArXiv cs.AI

arXiv:2604.14137v1 Announce Type: cross Abstract: Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to supp

Published 16 Apr 2026
Read full paper → ← Back to Reads