AI Math benchmark Updated +Created
This section is about benchmarks designed to test mathematical reasoning.
LLM benchmark Updated +Created
Benchmarking LLMs is an extremely difficult issue.
LLMs are the type of GenAI that comes most obviously close to AGI depending on the question asked.
Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.
Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.