c/inc_loop_asm.c Updated +Created
This is the only way that we've managed to reliably get a single inc instruction loop, by using inline assembly, e.g. on we do x86:
loop:
  inc %[i];
  cmp %[max], %[i];
  jb loop;
For 1s on P14s Ubuntu 25.04 GCC 14.2 -O0 x86_64 we need about 5 billion:
time ./inc_loop_asm.out 5000000000
c/inc_loop_asm_n.sh Updated +Created
This is a quick Microarchitectural benchmark to try and determine how many functional units our CPU has that can do an inc instruction at the same time due to superscalar architecture.
The generated programs do loops like:
loop:
  inc %[i0];
  inc %[i1];
  inc %[i2];
  ...
  inc %[i_n];
  cmp %[max], %[i0];
  jb loop;
with different numbers of inc instructions.
Figure 1.
c/inc_loop_asm_n.sh results for a few CPUs
.
Quite clearly:
and both have low instruction count effects that destroy performance, AMD at 3 and Intel at 3 and 5. TODO it would be cool to understand those better.
Data from multiple CPUs manually collated and plotted manually with c/inc_loop_asm_n_manual.sh.