AGI-complete in general? Obviously. But still, a lot can be done. See e.g.:
- The Busy Beaver Challenge deciders
Not to be confused with tutoring company "Axiom Maths" which shows on top of Google results: axiommaths.com/ lol fuck.
They seem to do autoformalization, automated theorem proving and code generation, and they use Lean a lot. Sounds fun.
www.math.inc/careers
Suppose that today is June 1, 2025. We call a date "square" if all of its components (day, month, and year) are perfect squares. I was born in the last millennium, and my next birthday (relative to that date) will be the last square date in my life. If you sum the square roots of the components of that upcoming square birthday (day, month, year), you obtain my age on June 1, 2025. My mother would have been born on a square date if the month were a square number; in reality it is not a square date, but both the month and day are perfect cubes. When was I born, and when was my mother born?
One shot by GPT-5.1, possibly contaminated obviously:
www.principialabs.org
We combine large-scale pretraining with reinforcement learning to create models that can rederive and learn from the entire corpus of human mathematics. Our goal is automated mathematical discovery: AI that does the creative, generative work that was previously only possible for the world's best researchers—and can be deployed on the hardest problems in science and engineering.
Quick list: x.com/AlexKontorovich/status/1997051032384446629
deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
AI achieves silver-medal standard solving International Mathematical Olympiad problems
"Autoformalization" refers to automatically converting a traditional human readable mathematical proof to a formal proof.
This section is about benchmarks designed to test mathematical reasoning.
Even more than in other areas of benchmarking, in maths where you only have a right or wrong answer, and it is costly to come up with good sample problems, some benchmarks have adopted private test data sets.
The situation is kind of sad, in that ideally we should have open data sets and only test them on models that were trained on data exclusively published before the problem publish date.
However this is not practical for the following reasons:
Perhaps the ideal scenario therefore is what ARC-AGI has done: give a sizeable public dataset, which you feel is highly representative of the difficulty level of the private test data, while at the same time holding out some private test data. Half half seems reasonable.
This way, reproducible models can actually self test themselves reliably on the open data, while the closed data can then be used for the cases where the open data can't be used.
How they "ensure" that models are not contaminated:
Most of their problems come from high school knowledge olympiads and they are therefore completely irrelevant for 2025 LLMs.
Not too exciting because of the high school knowledge olympiad level, but respectable.
- Every problem has one final integer answer:Also unlike Project Euler and like IMO, all only limited computations are required, i.e. you are not expected to do full blown program generation to reach a final answer. Which makes this further less exciting.
This one doesn't seem to exciting to be honest, but it might be useful. Sample question:and it expects the correct answer down to the cents:It should be noted that Project Euler has such "precision matters" problems.
53892.27
This project initiated by Terence Tao aims to find the relations between various statements in abstract algebra by using a combination of automated theorem proving and human effort. As mentioned by Terence himself, this is a bit similar to the idea of the Busy Beaver Challenge:
Paper: arxiv.org/abs/2411.04872
arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ mentions what the official website is unable to clearly state out:
The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination
The expected answer output for all problems is one single SymPy expression, which is kind of a cool approach which allows either for large integers like Project Euler, but also for irrational expressions to be given, e.g. "An optimization problem in BMO space" from the sample problems has answer:Of course, when the output is not an integer, this leads to the question of simplification equivalence questions. Also, like Project Euler, solutions essentially expect you to write and execute code.
The most interesting aspect of this benchmark is the difficulty. Mathematical olympiad coach Evan Chen comments:[ref]
Problems in [the International Mathematical Olympiad] typically require creative insight while avoiding complex implementation and specialized knowledge [but for FrontierMath] they keep the first requirement, but outright invert the second and third requirement
Creator of FrontierMath.
Socials:
We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants.
AI code generation benchmark in which part of the benchmark includes producing a formal Lean proof of the implementation. Sweet.
Articles by others on the same topic
Automated theorem proving (ATP) is a branch of artificial intelligence and mathematical logic concerned with the development of algorithms and software that can automatically prove mathematical theorems. The goal of ATP systems is to determine the validity of logical statements and derive conclusions based entirely on formal logical reasoning, without human intervention.