Formal Verification Is Having Its “GPT Moment”
Language models for theorem proving — it’s bigger than you think
If you squint hard enough, theorem proving is just glorified devops. Math verification—especially when formalized through tools like Lean or Isabelle—looks an awful lot like debugging some ancient, sacred, mathematically brittle codebase. So it’s no surprise that the same energy that’s gone into automating software testing and code generation is now being unleashed on the kingdom of proof.
Kaiyu Yang, one of the leading researchers in the space (FAIR, Caltech), recently presented (Apr 7, 2025) a sober reminder that we’re in the middle of an AI arms race in math and coding. But this isn’t just about tidy formalization anymore—models like DeepSeekMath, AlphaGeometry, and FunSearch are making headline-worthy breakthroughs.
DeepSeekMath 7B, an open-source model, is scoring over 51% on the MATH benchmark, beating most proprietary systems (arXiv:2402.03300). DeepMind’s AlphaGeometry has solved 84% of IMO geometry problems—on par with gold medalists—and their FunSearch recently contributed novel code-based insights to the long-standing cap set problem (MIT Tech Review, 2024; Nature, 2023).
The twist? This stuff works best in clean, tightly-scoped domains—Olympiad-style problems, symbolic logic, well-bounded constraint spaces. It’s a far cry from the messy, institutional ambiguity of finance or healthcare.
If you’re trying to align a model in a system with edge cases, conflicting incentives, and no one ground truth, formal methods still leave a lot on the table. But still, I’m bullish. This new wave of math-focused models isn’t about building super mathematicians, instead more auditing intelligence. And laying the groundwork for holding future models accountable—not because they’re correct, but because they can prove it.
Adapted from Yang’s Spring 2025 presentation to the UC Berkeley Advanced Large Language Model Agents MOOC, for which he served as a co-instructor