AI code generation benchmark

Bibliography:

Appears to be a very small number of newly created problems?

HumanEval (2021)

The tests are present in a gzip inside the Git repo: github.com/openai/human-eval/blob/master/data/HumanEval.jsonl.gz These researchers.

To get a quick overview of the problems with jq:

jq -r '"==== \(.task_id) \(.entry_point)\n\(.prompt)"' <HumanEval.jsonl

The first two problems are:

==== HumanEval/0 has_close_elements
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

==== HumanEval/1 separate_paren_groups
from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """

so we understand that it takes as input an empty function with a docstring and you have to fill the function body.

The paper also shows that there can be other defined functions besides the one you have to implement.

AlgoTune

 0  0

algotune.io/

This one focuses on improving speed of important numerical algorithms as compared to popular implementations.

The general pattern can be seen by observing the optimization of: algotune.io/aes_gcm_encryption_anthropic_claude-opus-4-1-20250805.html This shows the chat that the system had.

They define an OS interface to edit files and run right on the prompt, and tell at each stage how many credits are left for a given API and what the speedup was. Amazing. Each task has a $1 budget per provider. Then their software parses commands out of the LLM output and sends formatted responses back. Quite amazing that it works at all.

All pieces of code seem to be in Python, and the speedups come mainly from using more advanced external computing libraries, like compiling with Cython or using faster external libraries that are pre-compiled, or more parallel. So it is not that impressive from a purely algorithmic point of view, but it is not bad either.

Correctness is checked automatically by comparing the optimized solution to the original non-optimized one likely for certain inputs.

BigCodeBench (2024)

 0  0

Their most interesting subset, the -hard one, appears to be present at: huggingface.co/datasets/bigcode/bigcodebench-hard in Parquet format. OMG why.

The tests make free usage of the Python standard library and other major external libraries, e.g. huggingface.co/datasets/bigcode/bigcodebench-hard/viewer/default/v0.1.0_hf?views%5B%5D=v010_hf&row=0 uses FTPlib. Kind of cool.

They even test graph plotting? huggingface.co/datasets/bigcode/bigcodebench-hard/viewer/default/v0.1.0_hf?views%5B%5D=v010_hf&row=11 How does it evaluate?

SWE-bench (2024)

 0  0

www.swebench.com/

By Princeton people.

This one aims to solve GitHub issues. It appears to contain 2,294 real-world GitHub issues and their corresponding pull requests.

Evaluation is simply based on "does the pull request make some pre-written failing test cases pass".

The dataset appears to be at: huggingface.co/datasets/princeton-nlp/SWE-bench in Parquet format.

AI code generation benchmark

Can AI code

HumanEval (2021)

AlgoTune

BigCodeBench (2024)

SWE-bench (2024)

SWE-Lancer (2025)

 Ancestors (8)

 Incoming links (2)

 Discussion (0)

 Articles by others on the same topic (0)

 Discussion (0)  Subscribe (1)

 Discussion (0)