The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Authors: Jasper Dekoninck*, Ivo Petrov*, Kristian Minchev, Mislav Balunovic, Martin Vechev
Dataset Contributors: Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev

We present the Open Proof Corpus (OPC), the largest open dataset to date comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability in proof generation research and allows us to investigate the capabilities and limitations of LLMs in generating mathematical proofs.

Key Finding: LLMs are great at judging proof correctness

Models are impressive judges of proof correctness, and we can finetune a small 8B model on the OPC to match the performance of the best closed-source model, Gemini-2.5-Pro.

Table 1: Benchmarking LLMs as proof graders. Cost for running the model on the entire subset is given in USD.

Judge pass@1 maj@5 Cost
Human 90.4 - N/A
Gemini-2.5-Pro 85.4 88.1 135.47
OPC-R1-8B 83.8 88.1 N/A
o4-mini 83.8 85.3 29.57
o3 83.1 84.3 93.3
Qwen3-235B-A22B 81.8 84.6 3.79
Deepseek-R1 80.9 82.6 27.35
DeepSeek-R1-Qwen3-8B 70.7 71.3 N/A
Qwen3-8B 64.4 63.6 N/A

Dataset Statistics

The OPC consists of 1,010 problems drawn from 20 national and internal mathematical competitions. It contains 5,062 solutions generation by 6 state-of-the-art LLMs. 43% of the proofs in the OPC are correct.

1,010
Problems
5,062
Solutions
43%
Correct Proofs

To investigate important open questions regarding LLM proof generation capabilities, the dataset consists of five key splits:

  • MathArena: Problems from MathArena to compare final-answer correctness and proof validity. Approximately 30 questions from the SMT 2025 will only be published once the questions have been released by the competition authors.
  • PutnamBench: Problems from PutnamBench to compare formal and natural language proof generation.
  • Best-of-n: Problems from various mathematical competitions to investigate best-of-n selection strategies.
  • Generic (Train): Problems from various mathematical competitions to further research in the area of proof generation.
  • Generic (Test): Problems from various mathematical competitions to test judging capabilities of current LLMs.

Conclusions and Insights

Based on our analysis of the OPC, we draw four additional key conclusions regarding the current capabilities and limitations of LLMs in natural language proof generation.

  • LLMs demonstrate strong potential for generating correct proofs, with the best model currently being Gemini-2.5-Pro.
  • Natural language proof generation significantly outperforms formal proof generation by a factor of 11.
  • There is a significant difference between final-answer correctness and proof validity. This difference also differs between models
  • Best-of-n selection strategies are very effective, with the best strategies using pairwise comparisons for increased effectiveness.

Dataset Examples

This section showcases several examples from our dataset. More examples in the format as they were presented to the judges can be found at judgeproofs.xyz/sample.

Citation

@article{openproofcorpus2025,
  title={The Open Proof Corpus: A Large-Scale Human Study of LLM Proofs},
  author={Jasper Dekoninck and Ivo Petrov and  Kristian Minchev and Mislav Balunovic and Martin Vechev and Miroslav Marinov and Maria Drencheva and Lyuba Konova and Milen Shumanov and Kaloyan Tsvetkov and Nikolay Drenchev and Lazar Todorov and Kalina Nikolova and Nikolay Georgiev and Vanesa Kalinkova and Margulan Ismoldayev},
  journal={arXiv},
  year={2025},
}