GPQA 2025-12-02
Questions available to anyone under Hugging Face login / .zip with password, but you have to promise not to post them online. Lol. Either do the thing or don't.
LiveBench 2025-12-02
Math almost saturated as of 2025 release, so meh:
modified questions based on high school math competitions from the past 11 months, as well as harder versions of AMPS questions
Poetiq 2025-12-01
In 2025 they announced huge improvements on ARC-AGI-2, but they only tested on the public dataset, so the potential for contamination is overwhelming.
AI Mathematical Olympiad 2025-11-30
Not too exciting because of the high school knowledge olympiad level, but respectable.
Formalization of X 2025-11-30
This section is about formalization efforts of specific fields of mathematics.
ORCA Benchmark Created 2025-11-19 Updated 2025-11-30
This one doesn't seem to exciting to be honest, but it might be useful. Sample question:
If I deposit $50,000 at 5% APR, compounded weekly, what will my balance be after 18 months?
and it expects the correct answer down to the cents:
53892.27
It should be noted that Project Euler has such "precision matters" problems.

There are unlisted articles, also show them or only show them.