Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

📰 ArXiv cs.AI

A reliability science framework for long-horizon LLM agents is introduced to measure consistent success across repeated attempts on tasks of varying duration

advanced Published 1 Apr 2026

Action Steps

Identify the limitations of existing benchmarks that only measure capability
Develop a reliability science framework with metrics that capture consistent success across repeated attempts
Apply the framework to long-horizon LLM agents to evaluate their reliability
Use the framework to inform model development and deployment decisions

Who Needs to Know This

AI researchers and engineers working on LLM agents can benefit from this framework to evaluate the reliability of their models, and product managers can use it to inform deployment decisions

Key Insight

💡 Reliability and capability are distinct properties that diverge as task duration grows, and existing benchmarks are insufficient for evaluating long-horizon LLM agents