Sober Reasoning Leaderboard 🍷

Andreas Hochlehnert*1, Hardik Bhatnagar*1, Vishaal Udandarao1,2, Samuel Albanie, Ameya Prabhu1, Matthias Bethge1

1Tübingen AI Center - University of Tübingen    2University of Cambridge

Evaluation reports Pass@1 accuracy (mean ± std) across six math benchmarks using standardized evaluation. The scores are across 10 seeds for AIME24, AIME25, and AMC23; and across 3 seeds for MATH500, Minerva and OlympiadBench.

Model Organization Based on Link AIME'24 AIME'25 AMC'23 MATH500 Minerva OlympiadBench Average