LLM benchmark Created 2025-03-20 Updated 2025-07-16
Benchmarking LLMs is an extremely difficult issue.
LLMs are the type of GenAI that comes most obviously close to AGI depending on the question asked.
Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.
Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.
Get output of send command on expect Created 2025-03-20 Updated 2025-07-16
This pattern works well:
set prompt ">>> "
log_user 0
send "What is quantum field theory?\r"
expect -re "(.+)$prompt"
puts -nonewline [join [lrange [lmap line [split $expect_out(1,string) \n] {regsub {\r$} $line ""}] 1 end] "\n"]
Then stdout will contain only the output of the command and nothing else.
You Only Look Once Created 2025-03-20 Updated 2025-07-16
You can get some really sweet pre-trained versions of this, typically trained on the COCO dataset.
AlexNet Created 2025-03-20 Updated 2025-07-16
Became notable for performing extremely well on ImageNet starting in 2012.
It is also notable for being one of the first to make successful use of GPU training rather than GPU training.
Expect HOWTO Created 2025-03-20 Updated 2025-07-16
Expect Created 2025-03-20 Updated 2025-07-16
List of convolutional neural networks Created 2025-03-20 Updated 2025-07-16
Value of life Created 2025-03-20 Updated 2025-07-16
Chromium bug Created 2025-03-20 Updated 2025-07-16
HumanEval Created 2025-03-20 Updated 2025-07-16
The tests are present in a gzip inside the Git repo: github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz These researchers.
To get a quick overview of the problems with jq:
jq -r '"==== \(.task_id) \(.entry_point)\n\(.prompt)"' <HumanEval.jsonl 
The first two problems are:
==== HumanEval/0 has_close_elements
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

==== HumanEval/1 separate_paren_groups
from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """
so we understand that it takes as input an empty function with a docstring and you have to fill the function body.
The paper also shows that there can be other defined functions besides the one you have to implement.
Image segmentation Created 2025-03-20 Updated 2025-07-16
AI code generation framework that tries to run code Created 2025-03-20 Updated 2025-07-16
  • OpenAI's GPT-4-turbo can generate and run Python code if it detects that the prompt would be better answered by Python, e.g. maths
Fastest gun in the West problem Created 2025-03-20 Updated 2025-07-16

Unlisted articles are being shown, click here to show only listed articles.