Closed AI math benchmark Created 2025-11-19 Updated 2025-12-13
Even more than in other areas of benchmarking, in maths where you only have a right or wrong answer, and it is costly to come up with good sample problems, some benchmarks have adopted private test data sets.
The situation is kind of sad, in that ideally we should have open data sets and only test them on models that were trained on data exclusively published before the problem publish date.
However this is not practical for the following reasons:
  • some of the best models are closed source and don't have a reproducible training with specified cutoff
  • having a private test set allows you to automatically check answers from untrusted sources. If they get answers right, they are onto something, you don't even need to check their methodology
Perhaps the ideal scenario therefore is what ARC-AGI has done: give a sizeable public dataset, which you feel is highly representative of the difficulty level of the private test data, while at the same time holding out some private test data. Half half seems reasonable.
This way, reproducible models can actually self test themselves reliably on the open data, while the closed data can then be used for the cases where the open data can't be used.
Giotto.ai Created 2025-04-24 Updated 2025-07-16
www.giotto.ai/
At Giotto.ai, our technology is designed to bridge the gap between current AI capabilities and the promise of Artificial General Intelligence (AGI).
Their website doesn't clearly explain their technology as of 2025.
They claim to have done some work on ARC-AGI which is cool, but no clear references to what they did or if there's anything public about it.
NDEA Created 2025-03-28 Updated 2025-07-16
ndea.com/
We believe program synthesis holds the key to unlocking AGI.
Cool. Founders are also very interested in ARC-AGI.
This section is about unofficial ARC-AGI-like problem sets.
These are interesting from both a:
  • practical point of view, as they provide more training data for potential solvers. If you believe that they are representative that is of course.
  • theoretical point of view, as they might help to highlight missing or excessive presumptions of the official datasets
github.com/neoneye/arc-dataset-collection contains a fantastic collection of such datasets, with visualization at: neoneye.github.io/arc/
Updates / ARC-AGI-2 Created 2025-10-18 Updated 2025-10-21
I've created a quick fork of ARC-DSL which defines a hand crafted Domain Specific Language (DSL) approach to help solve ARC-AGI problems.
I basically just merged outstanding pull requests on the original repo that were needed to make things run.
It would be cool to see if those rules also solve ARC-AGI-2 problems well, but lazy now.
ARC-AGI-2 is a very interesting benchmark which mixes some symbolic and other visual elements, and is readily solvable by non-expert humans, but has so far resisted transformers to a large degree.
Part of me would like to focus more on less visual aspects of AI, but it is still of interest.
It is funny how many early (semi)-retired fintech/bigtech bros that are interested in the project, I saw several of them on the forums.
I'd be tempted if I were in that position too I must confess. Maybe in 15 years time for me the way things are looking.
Kudos to these people who do something cool and open when they don't need money: www.reddit.com/r/Fire/comments/15x4w7r/comment/jx7dn16/ It is also the case of Jimmy Wales from Wikipedia for example, who used to work in finance.