Our First Proof submissions
TL;DR
OpenAI tested an internal AI model on 10 challenging First Proof math problems, with expert review suggesting at least 5 proof attempts are likely correct. This demonstrates progress in AI's ability to handle complex, research-level reasoning requiring expert verification.
Key Takeaways
- •OpenAI's internal model attempted all 10 First Proof problems, with at least 5 proof attempts (problems 4, 5, 6, 9, 10) considered likely correct by experts.
- •The First Proof challenge tests AI's ability to produce end-to-end arguments in specialized domains where correctness is hard to establish without expert review.
- •The model showed tangible improvement during training, solving increasingly difficult problems over time with limited human supervision.
- •This work builds on previous AI achievements in mathematical reasoning, including IMO gold medal-level performance and scientific collaboration case studies.
- •The authors acknowledge the need for more rigorous evaluation frameworks and look forward to community engagement on assessing research-grade AI reasoning.
Tags
We ran an internal model on all 10 First Proof(opens in a new window) problems, a research-level math challenge designed to test whether AI systems can produce correct, checkable proof attempts. Unlike short-answer or competition-style math, these problems require building end-to-end arguments in specialized domains, and correctness is hard to establish without expert review. The authors of the First Proof problems are leading experts in their respective fields, and at least a couple of the problems were open for years before the authors found solutions. An academic department that has substantial overlap with the subject areas could conceivably solve many of the problems in one week.
We shared(opens in a new window) our proof attempts on Saturday, February 14, 2026 at 12:00 AM PT. Based on feedback from experts, we believe at least five of the model’s proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct, and several others remain under review. We initially believed our attempt for problem 2 was likely correct. Based on the official First Proof commentary and further community analysis, we now believe it is incorrect. We’re grateful for the engagement and look forward to continued review. Our full set of proof attempts can be found here(opens in a new window). The preprint includes all ten proof attempts, plus a newly added appendix with prompt patterns and examples that aim to simulate our manual interactions with the models during the process.
We believe novel frontier research is perhaps the most important way to evaluate capabilities of next generation AI models. Benchmarks are useful, but they can miss some of the hardest parts of research: sustaining long chains of reasoning, choosing the right abstractions, handling ambiguity in problem statements, and producing arguments that survive expert scrutiny. Frontier challenges like First Proof help us stress-test those capabilities in settings where correctness is nontrivial to verify and the failure modes are informative.
“We’re currently training a new model for which a primary focus is increasing the level of rigor in its thinking, with the goal that the model can think continuously for many hours and remain highly confident in its conclusions. When the First Proof problems were announced, it seemed like the perfect testbed, so over the weekend I tried it out. Already it was able to solve two of the problems (#9 and #10). As it trained, it became increasingly capable, eventually solving–in our estimation–at least three more. We were particularly pleased when it solved #6 and then, two days later, #4, as those problems were from fields familiar to many of us. It’s pretty incredible to watch a model get tangibly smarter day by day.”
– James R. Lee (OpenAI Researcher, Reasoning)
We ran the model with limited human supervision. When prompting versions of the model along training, we sometimes suggested retrying strategies that appeared fruitful in earlier attempts. For some attempts, we asked the model to expand or clarify parts of a proof after receiving expert feedback, to make the reasoning easier to verify. We also facilitated a back-and-forth between this model and ChatGPT for verification, formatting, and style. For some problems, we present the best of a few attempts, selected by human judgment. This was a fast sprint, and our process was not as clean as we would like in a properly controlled evaluation. We look forward to discussions with the First Proof organizers about a more rigorous experiment and evaluation framework for future iterations.
This work builds on earlier results from frontier reasoning models in math and science. In July 2025, we reached gold medal-level performance(opens in a new window) on the International Mathematical Olympiad with a general-purpose reasoning model (35/42 points). In November 2025, we shared “Early experiments in accelerating science with GPT‑5”, a set of case studies where GPT‑5 helped researchers make concrete progress across math, physics, biology, and other fields, along with the limitations we observed. And most recently, we reported a physics collaboration where GPT‑5.2 proposed a candidate expression for a gluon-amplitude formula that was then formally proved by an internal model and verified by the authors.
We look forward to deeper engagement with the community on how to evaluate research-grade reasoning, including expert feedback on these attempts, and we’re excited to make these new capabilities available in future public models.