ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

📰 ArXiv cs.AI

arXiv:2604.01591v2 Announce Type: replace Abstract: We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or criti

Published 8 Apr 2026
Read full paper → ← Back to News