mlcommons.org/en/ Their homepage is not amazingly organized, but it does the job.

Benchmark focused on deep learning. It has two parts:

training: produces a trained network
inference: uses the trained network

Furthermore, a specific network model is specified for each benchmark in the closed category: so it goes beyond just specifying the dataset.

Results can be seen e.g. at:

Those URLs broke as of 2025 of course, now you have to click on their Tableau down to the 2.1 round and there's no fixed URL for it:

And there are also separate repositories for each:

E.g. on mlcommons.org/en/training-normal-21/ we can see what the the benchmarks are:

Dataset	Model
ImageNet	ResNet
KiTS19	3D U-Net
OpenImages	RetinaNet
COCO dataset	Mask R-CNN
LibriSpeech	RNN-T
Wikipedia	BERT
1TB Clickthrough	DLRM
Go	MiniGo

MLperf v2.1 ResNet

 0  0

Instructions at:

Ubuntu 22.10 setup with tiny dummy manually generated ImageNet and run on ONNX:

sudo apt install pybind11-dev

git clone https://github.com/mlcommons/inference
cd inference
git checkout v2.1

virtualenv -p python3 .venv
. .venv/bin/activate
pip install numpy==1.24.2 pycocotools==2.0.6 onnxruntime==1.14.1 opencv-python==4.7.0.72 torch==1.13.1

cd loadgen
CFLAGS="-std=c++14" python setup.py develop
cd -

cd vision/classification_and_detection
python setup.py develop
wget -q https://zenodo.org/record/3157894/files/mobilenet_v1_1.0_224.onnx
export MODEL_DIR="$(pwd)"
export EXTRA_OPS='--time 10 --max-latency 0.2'

tools/make_fake_imagenet.sh
DATA_DIR="$(pwd)/fake_imagenet" ./run_local.sh onnxruntime mobilenet cpu --accuracy

Last line of output on P51, which appears to contain the benchmark results

TestScenario.SingleStream qps=58.85, mean=0.0138, time=0.136, acc=62.500%, queries=8, tiles=50.0:0.0129,80.0:0.0137,90.0:0.0155,95.0:0.0171,99.0:0.0184,99.9:0.0187

where presumably qps means queries per second, and is the main results we are interested in, the more the better.

Running:

tools/make_fake_imagenet.sh

produces a tiny ImageNet subset with 8 images under fake_imagenet/.

fake_imagenet/val_map.txt contains:

val/800px-Porsche_991_silver_IAA.jpg 817
val/512px-Cacatua_moluccensis_-Cincinnati_Zoo-8a.jpg 89
val/800px-Sardinian_Warbler.jpg 13
val/800px-7weeks_old.JPG 207
val/800px-20180630_Tesla_Model_S_70D_2015_midnight_blue_left_front.jpg 817
val/800px-Welsh_Springer_Spaniel.jpg 156
val/800px-Jammlich_crop.jpg 233
val/782px-Pumiforme.JPG 285

where the numbers are the category indices from ImageNet1k. At gist.github.com/yrevar/942d3a0ac09ec9e5eb3a see e.g.:

817: 'sports car, sport car',
89: 'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita',

and so on, so they are coherent with the image names. By quickly looking at the script we see that it just downloads from Wikimedia and manually creates the file.

TODO prepare and test on the actual ImageNet validation set, README says:

Prepare the imagenet dataset to come.

Since that one is undocumented, let's try the COCO dataset instead, which uses COCO 2017 and is also a bit smaller. Note that his is not part of MLperf anymore since v2.1, only ImageNet and open images are used. But still:

wget https://zenodo.org/record/4735652/files/ssd_mobilenet_v1_coco_2018_01_28.onnx
DATA_DIR_BASE=/mnt/data/coco
export DATA_DIR="${DATADIR_BASE}/val2017-300"
mkdir -p "$DATA_DIR_BASE"
cd "$DATA_DIR_BASE"
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
mv annotations val2017
cd -
cd "$(git-toplevel)"
python tools/upscale_coco/upscale_coco.py --inputs "$DATA_DIR_BASE" --outputs "$DATA_DIR" --size 300 300 --format png
cd -

Now:

./run_local.sh onnxruntime mobilenet cpu --accuracy

fails immediately with:

No such file or directory: '/path/to/coco/val2017-300/val_map.txt

The more plausible looking:

./run_local.sh onnxruntime mobilenet cpu --accuracy --dataset coco-300

first takes a while to preprocess something most likely, which it does only one, and then fails:

Traceback (most recent call last):
  File "/home/ciro/git/inference/vision/classification_and_detection/python/main.py", line 596, in <module>
    main()
  File "/home/ciro/git/inference/vision/classification_and_detection/python/main.py", line 468, in main
    ds = wanted_dataset(data_path=args.dataset_path,
  File "/home/ciro/git/inference/vision/classification_and_detection/python/coco.py", line 115, in __init__
    self.label_list = np.array(self.label_list)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (5000, 2) + inhomogeneous part.

TODO!

Run MLperf v2.1 ResNet on Imagenette

 0  0

Let's run on this Imagenet10 subset called Imagenette.

First ensure that you get the dummy test data run working as per MLperf v2.1 ResNet.

Next, in the imagenette2 directory, first let's create a 224x224 scaled version of the inputs as required by the benchmark at mlcommons.org/en/inference-datacenter-21/:

#!/usr/bin/env bash
rm -rf val224x224
mkdir -p val224x224
for syndir in val/*: do
  syn="$(dirname $syndir)"
  for img in "$syndir"/*; do
    convert "$img" -resize 224x224 "val224x224/$syn/$(basename "$img")"
  done
done

and then let's create the val_map.txt file to match the format expected by MLPerf:

#!/usr/bin/env bash
wget https://gist.githubusercontent.com/aaronpolhamus/964a4411c0906315deb9f4a3723aac57/raw/aa66dd9dbf6b56649fa3fab83659b2acbf3cbfd1/map_clsloc.txt
i=0
rm -f val_map.txt
while IFS="" read -r p || [ -n "$p" ]; do
  synset="$(printf '%s\n' "$p" | cut -d ' ' -f1)"
  if [ -d "val224x224/$synset" ]; then
    for f in "val224x224/$synset/"*; do
      echo "$f $i" >> val_map.txt
    done
  fi
  i=$((i + 1))
done < <( sort map_clsloc.txt )

then back on the mlperf directory we download our model:

wget https://zenodo.org/record/4735647/files/resnet50_v1.onnx

and finally run!

DATA_DIR=/mnt/sda3/data/imagenet/imagenette2 time ./run_local.sh onnxruntime resnet50 cpu --accuracy

which gives on P51:

TestScenario.SingleStream qps=164.06, mean=0.0267, time=23.924, acc=87.134%, queries=3925, tiles=50.0:0.0264,80.0:0.0275,90.0:0.0287,95.0:0.0306,99.0:0.0401,99.9:0.0464

where qps presumably means "querries per second". And the time results:

446.78user 33.97system 2:47.51elapsed 286%CPU (0avgtext+0avgdata 964728maxresident)k

The time=23.924 is much smaller than the time executable because of some lengthy pre-loading (TODO not sure what that means) that gets done every time:

INFO:imagenet:loaded 3925 images, cache=0, took=52.6sec
INFO:main:starting TestScenario.SingleStream

Let's try on the GPU now:

DATA_DIR=/mnt/sda3/data/imagenet/imagenette2 time ./run_local.sh onnxruntime resnet50 gpu --accuracy

which gives:

TestScenario.SingleStream qps=130.91, mean=0.0287, time=29.983, acc=90.395%, queries=3925, tiles=50.0:0.0265,80.0:0.0285,90.0:0.0405,95.0:0.0425,99.0:0.0490,99.9:0.0512
455.00user 4.96system 1:59.43elapsed 385%CPU (0avgtext+0avgdata 975080maxresident)k

TODO lower qps on GPU!