GPQA by Ciro Santilli 37 2025-12-02
Questions available to anyone under Hugging Face login / .zip with password, but you have to promise not to post them online. Lol. Either do the thing or don't.
LiveBench by Ciro Santilli 37 2025-12-02
Math almost saturated as of 2025 release, so meh:
modified questions based on high school math competitions from the past 11 months, as well as harder versions of AMPS questions
Project Euler problem 948 by Ciro Santilli 37 Created 2025-12-01 Updated 2025-12-02
Numerical solution:
1033654680825334184
Programs:
Poetiq by Ciro Santilli 37 2025-12-01
In 2025 they announced huge improvements on ARC-AGI-2, but they only tested on the public dataset, so the potential for contamination is overwhelming.
Not too exciting because of the high school knowledge olympiad level, but respectable.
This section is about formalization efforts of specific fields of mathematics.
ORCA Benchmark by Ciro Santilli 37 Created 2025-11-19 Updated 2025-11-30
This one doesn't seem to exciting to be honest, but it might be useful. Sample question:
If I deposit $50,000 at 5% APR, compounded weekly, what will my balance be after 18 months?
and it expects the correct answer down to the cents:
53892.27
It should be noted that Project Euler has such "precision matters" problems.
Closed AI math benchmark by Ciro Santilli 37 Created 2025-11-19 Updated 2025-11-30
Even more than in other areas of benchmarking, in maths where you only have a right or wrong answer, and it is costly to come up with good sample problems, some benchmarks have adopted private test data sets.
The situation is kind of sad, in that ideally we should have open data sets and only test them on models that were trained on data exclusively published before the problem publish date.
However this is not practical for the following reasons:
  • some of the best models are closed source and don't have a reproducible training with specified cutoff
  • having a private test set allows you to automatically check answers from untrusted sources. If they get answers right, they are onto something, you don't even need to check their methodology
Perhaps the ideal scenario therefore is what ARC-AGI has done: give a sizeable public dataset, which you feel is highly representative of the difficulty level of the private test data, while at the same time holding out some private test data. Half half seems reasonable.
This way, reproducible models can actually self test themselves reliably on the open data, while the closed data can then be used for the cases where the open data can't be used.
Video 1.
3D Printed Guns Are Easy To Make And Impossible To Stop by VICE News (2018)
Source.

There are unlisted articles, also show them or only show them.