Math AI benchmark

This section is about benchmarks designed to test mathematical reasoning.

Bibliography:

mathscholar.org/2025/02/deepseek-a-breakthrough-in-ai-for-math-and-everything-else/

Closed AI math benchmark

Even more than in other areas of benchmarking, in maths where you only have a right or wrong answer, and it is costly to come up with good sample problems, some benchmarks have adopted private test data sets.

The situation is kind of sad, in that ideally we should have open data sets and only test them on models that were trained on data exclusively published before the problem publish date.

However this is not practical for the following reasons:

some of the best models are closed source and don't have a reproducible training with specified cutoff
having a private test set allows you to automatically check answers from untrusted sources. If they get answers right, they are onto something, you don't even need to check their methodology

Perhaps the ideal scenario therefore is what ARC-AGI has done: give a sizeable public dataset, which you feel is highly representative of the difficulty level of the private test data, while at the same time holding out some private test data. Half half seems reasonable.

This way, reproducible models can actually self test themselves reliably on the open data, while the closed data can then be used for the cases where the open data can't be used.

List of math AI benchmarks

 0  0

MathArena

 0  0

matharena.ai/

This project tests various models against various competitions.

How they "ensure" that models are not contaminated:

By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination

Most of their problems come from high school knowledge olympiads and they are therefore completely irrelevant for 2025 LLMs.

MathArena Apex

 0  0

matharena.ai/apex/

A subsets of problems that they curate from competitions.

AI Mathematical Olympiad

 0  0

aimoprize.com

Not too exciting because of the high school knowledge olympiad level, but respectable.

www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-3/overview is round 3.
Every problem has one final integer answer:
In this competition, every ground-truth label is an integer between 0 and 99999
Non-integer results like square roots are just rounded off to produce an integer they mention:
$1 0^{4} 2 = 14142$
Also unlike Project Euler and like IMO, all only limited computations are required, i.e. you are not expected to do full blown program generation to reach a final answer. Which makes this further less exciting.

ORCA Benchmark (2025)

 0  0

arxiv.org/abs/2511.02589

This one doesn't seem to exciting to be honest, but it might be useful. Sample question:

If I deposit $50,000 at 5% APR, compounded weekly, what will my balance be after 18 months?

and it expects the correct answer down to the cents:

53892.27

It should be noted that Project Euler has such "precision matters" problems.

Equational theories project (2024)

 0  0

This project initiated by Terence Tao aims to find the relations between various statements in abstract algebra by using a combination of automated theorem proving and human effort. As mentioned by Terence himself, this is a bit similar to the idea of the Busy Beaver Challenge:

First Proof (2026)

 0  0

1stproof.org/

FrontierMath (2024)

 0  0

epoch.ai/frontiermath

Paper: arxiv.org/abs/2411.04872

arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ mentions what the official website is unable to clearly state out:

The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination

The expected answer output for all problems is one single SymPy expression, which is kind of a cool approach which allows either for large integers like Project Euler, but also for irrational expressions to be given, e.g. "An optimization problem in BMO space" from the sample problems has answer:

\frac{3}{36} + \frac{3}{6} e^{- 203 - \frac{1}{6}}

(1)

Of course, when the output is not an integer, this leads to the question of simplification equivalence questions. Also, like Project Euler, solutions essentially expect you to write and execute code.

The most interesting aspect of this benchmark is the difficulty. Mathematical olympiad coach Evan Chen comments:^[ref]

Problems in [the International Mathematical Olympiad] typically require creative insight while avoiding complex implementation and specialized knowledge [but for FrontierMath] they keep the first requirement, but outright invert the second and third requirement

Elliot Glazer

 0  0

Creator of FrontierMath.

Socials:

x.com/ElliotGlazer

IMProofBench (2025)

 0  0

improofbench.math.ethz.ch/

Paper: arxiv.org/html/2509.26076v1

Apparently also has human review as part of the process. Newbs. Just require Lean solutions and be done with it... They do address it in a section of the paper "Formal math benchmarks" but still meh. Review must be fully automated, none of that asking humans bullshit.

From: improofbench.math.ethz.ch/guidelines/

Required Characteristics
PhD-level difficulty: Suitable for qualifying exams, research papers, or advanced seminars
Requires genuine insight: Not solvable by routine application of known algorithms
Clear proof-based main question: Answer should be a complete mathematical argument, not just a number
2-3 unique-answer subquestions: Enable automated evaluation (e.g., "Is the statement true for n=5?", "What is the rank of this group?")

Example problem:

Example 1: Stable Graphs
Main question: Find a closed formula for the number $N (g)$ of stable graphs of genus $g$ with no legs and precisely 3 edges, for all $g \geq 2$ .
Subquestions:
What is $N (3)$ ?
What is $N (8)$ ?
What is $N (1000)$ ?

LiveBench (2025)

 1  0

livebench.ai

Math almost saturated as of 2025 release, so meh:

modified questions based on high school math competitions from the past 11 months, as well as harder versions of AMPS questions

Putnam-AXIOM (2025)

 0  0

openreview.net/forum?id=kqj2Cn3Sxr

We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants.