{tag=Computer benchmark}

Benchmarking LLMs is an extremely difficult issue.

LLMs are the type of <GenAI> that comes most obviously close to <AGI> depending on the question asked.

Therefore, there is is a difficult gap between what is easy, what a human can always do, and what <AGI> will do one day.

Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.

Bibliography:
* https://www.reddit.com/r/LocalLLaMA/comments/1b933of/llm_benchmarks_are_bullshit/


 LLM benchmark (source code)