{tag=Computer benchmark}

Benchmarking LLMs is an extremely difficult issue.

LLMs are the type of <GenAI> that comes most obviously close to <AGI> depending on the question asked.

Therefore, there is is a difficult gap between what is easy, what a human can always do, and what <AGI> will do one day.

Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.

Bibliography:
* https://www.reddit.com/r/LocalLLaMA/comments/1b933of/llm_benchmarks_are_bullshit/


LLM benchmark

{c}
{tag=Computer benchmark}

= AI Math benchmark
{c}
{synonym}

This section is about benchmarks designed to test mathematical reasoning.

Bibliography:
* https://mathscholar.org/2025/02/deepseek-a-breakthrough-in-ai-for-math-and-everything-else/


Math AI benchmark

{tag=Benchmark}
{wiki}

* <CPU> benchmark: https://askubuntu.com/questions/634513/cpu-benchmarking-utility-for-linux/701532#701532
* <GPU> benchmark: https://askubuntu.com/questions/31913/how-to-perform-a-detailed-and-quick-3d-performance-test


Ciro Santilli @cirosantilli 40

 Tagged: Computer benchmark