Benchmarking LLMs is an extremely difficult issue.
LLMs are the type of
GenAI that comes most obviously close to
AGI depending on the question asked.
Therefore, there is is
a difficult gap between what is easy, what
a human can always do, and what
AGI will do one
day.
Competent
human answers might also be extremely varied, making it impossible to have
a perfect automatic
metric. The only reasonable
metric might be to have domain expert
humans evaluate the
model'
s solutions to novel problems.