{file}
{tag=CPU microbenchmark}

This is the only way that we've managed to reliably get a single `inc` instruction loop, by using <inline assembly>, e.g. on we do <x86>:
``
loop:
  inc %[i];
  cmp %[max], %[i];
  jb loop;
``

For 1s on <Ciro Santilli's hardware/P14s> <Ubuntu 25.04> GCC 14.2 -O0 x86_64 we need about 5 billion:
``
time ./inc_loop_asm.out 5000000000
``


c/inc_loop_asm.c

{file}
{tag=CPU microbenchmark}

This is a quick <Microarchitectural benchmark> to try and determine how many <functional units> our CPU has that can do an `inc` instruction at the same time due to <superscalar architecture>.

The generated programs do loops like:
``
loop:
  inc %[i0];
  inc %[i1];
  inc %[i2];
  ...
  inc %[i_n];
  cmp %[max], %[i0];
  jb loop;
``
with different numbers of inc instructions.

\Image[https://raw.githubusercontent.com/cirosantilli/media/refs/heads/master/c/inc_loop_asm_n_manual.png]
{title=<c/inc_loop_asm_n.sh>{file} results for a few CPUs}
{description=
Quite clearly:
* <AMD 7840U> can run INC on 4 functional units
* <Intel i7-7820HQ> can run INC on 2 functional units
and both have low instruction count effects that destroy performance, AMD at 3 and Intel at 3 and 5. TODO it would be cool to understand those better.

Data from multiple CPUs manually collated and plotted manually with \a[c/inc_loop_asm_n_manual.sh].
}
{height=480}


Ciro Santilli @cirosantilli 37

 Tagged: CPU microbenchmark