Robust Reasoning Benchmark

📰 ArXiv cs.AI

arXiv:2604.08571v1 Announce Type: cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open w

Published 13 Apr 2026
Read full paper → ← Back to Reads