Benchmarking LLMs is an extremely difficult issue.
Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.
Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.
This pattern works well:Then stdout will contain only the output of the command and nothing else.
set prompt ">>> "
log_user 0
send "What is quantum field theory?\r"
expect -re "(.+)$prompt"
puts -nonewline [join [lrange [lmap line [split $expect_out(1,string) \n] {regsub {\r$} $line ""}] 1 end] "\n"]
Bibliography:
- unix.stackexchange.com/questions/239161/get-the-output-from-expect-script-in-a-variable/792645#792645
- stackoverflow.com/questions/45210358/expect-output-only-stdout-of-the-command-and-nothing-else/79517903#79517903
- stackoverflow.com/questions/57975853/how-to-read-the-send-command-output-in-expect-script title is wrong, OP wants exit status apparently not stdout
Object detection model.
You can get some really sweet pre-trained versions of this, typically trained on the COCO dataset.
This has happened a few times a day on Ubuntu 24.10 and Chromium 133. It has also been happening in previous versions of Ubuntu and Chromium.
As Ciro Santilli starts typing on the omnibox, sometimes the window freezes and the dreaded "is not responding" window shows up.
The only somewhat similar reports that Ciro Santilli could find as of 2025:
Their most interesting subset, the
-hard
one, appears to be present at: huggingface.co/datasets/bigcode/bigcodebench-hard in Parquet format. OMG why.The tests make free usage of the Python standard library and other major external libraries, e.g. huggingface.co/datasets/bigcode/bigcodebench-hard/viewer/default/v0.1.0_hf?views%5B%5D=v010_hf&row=0 uses FTPlib. Kind of cool.
They even test graph plotting? huggingface.co/datasets/bigcode/bigcodebench-hard/viewer/default/v0.1.0_hf?views%5B%5D=v010_hf&row=11 How does it evaluate?
The tests are present in a gzip inside the Git repo: github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz these researchers.
To get a quick overview of the problems with jq:
jq -r '"==== \(.task_id) \(.entry_point)\n\(.prompt)"' <HumanEval.jsonl
The first two problems are:so we understand that it takes as input an empty function with a docstring and you have to fill the function body.
==== HumanEval/0 has_close_elements
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
==== HumanEval/1 separate_paren_groups
from typing import List
def separate_paren_groups(paren_string: str) -> List[str]:
""" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
separate those group into separate strings and return the list of those.
Separate groups are balanced (each open brace is properly closed) and not nested within each other
Ignore any spaces in the input string.
>>> separate_paren_groups('( ) (( )) (( )( ))')
['()', '(())', '(()())']
"""
The paper also shows that there can be other defined functions besides the one you have to implement.
Appears to be a very small number of newly created problems?
- OpenAI's GPT-4-turbo can generate and run Python code if it detects that the prompt would be better answered by Python, e.g. maths
There are unlisted articles, also show them or only show them.