This project tests various models against various competitions.
How they "ensure" that models are not contaminated:
By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination
Most of their problems come from high school knowledge olympiads and they are therefore completely irrelevant for 2025 LLMs.
A subsets of problems that they curate from competitions.
Not too exciting because of the high school knowledge olympiad level, but respectable.
This one doesn't seem to exciting to be honest, but it might be useful. Sample question:
If I deposit $50,000 at 5% APR, compounded weekly, what will my balance be after 18 months?
and it expects the correct answer down to the cents:
53892.27
It should be noted that Project Euler has such "precision matters" problems.
arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/ mentions what the official website is unable to clearly state out:
The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination
The expected answer output for all problems is one single SymPy expression, which is kind of a cool approach which allows either for large integers like Project Euler, but also for irrational expressions to be given, e.g. "An optimization problem in BMO space" from the sample problems has answer:
Of course, when the output is not an integer, this leads to the question of simplification equivalence questions. Also, like Project Euler, solutions essentially expect you to write and execute code.
The most interesting aspect of this benchmark is the difficulty. Mathematical olympiad coach Evan Chen comments:[ref]
Problems in [the International Mathematical Olympiad] typically require creative insight while avoiding complex implementation and specialized knowledge [but for FrontierMath] they keep the first requirement, but outright invert the second and third requirement
Creator of FrontierMath.
Math almost saturated as of 2025 release, so meh:
modified questions based on high school math competitions from the past 11 months, as well as harder versions of AMPS questions
We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants.
AI code generation benchmark in which part of the benchmark includes producing a formal Lean proof of the implementation. Sweet.

Articles by others on the same topic (0)

There are currently no matching articles.