COMPOSITE-Stem

📰 ArXiv cs.AI

arXiv:2604.09836v1 Announce Type: new Abstract: AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathe

Published 14 Apr 2026

Read full paper → ← Back to Reads