Ubuntu 25.04 GCC 14.2 -O0 x86_64 produces a horrendous:
    11c8:       48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
    11cd:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    11d1:       48 3b 45 e8             cmp    -0x18(%rbp),%rax
    11d5:       72 f1                   jb     11c8 <main+0x7f>
To do about 1s on P14s we need 2.5 billion instructions:
time ./inc_loop.out 2500000000
and:
time ./inc_loop.out 2500000000
gives:
          1,052.22 msec task-clock                       #    0.998 CPUs utilized             
                23      context-switches                 #   21.858 /sec                      
                12      cpu-migrations                   #   11.404 /sec                      
                60      page-faults                      #   57.022 /sec                      
    10,015,198,766      instructions                     #    2.08  insn per cycle            
                                                  #    0.00  stalled cycles per insn   
     4,803,504,602      cycles                           #    4.565 GHz                       
        20,705,659      stalled-cycles-frontend          #    0.43% frontend cycles idle      
     2,503,079,267      branches                         #    2.379 G/sec                     
           396,228      branch-misses                    #    0.02% of all branches
With -O3 it manages to fully unroll the loop removing it entirely and producing:
    1078:       e8 d3 ff ff ff          call   1050 <strtoll@plt>
}
    107d:       5a                      pop    %rdx
    107e:       c3                      ret
to is it smart enough to just return the return value from strtoll directly as is in rax.
This is the only way that we've managed to reliably get a single inc instruction loop, by using inline assembly, e.g. on we do x86:
loop:
  inc %[i];
  cmp %[max], %[i];
  jb loop;
For 1s on P14s Ubuntu 25.04 GCC 14.2 -O0 x86_64 we need about 5 billion:
time ./inc_loop_asm.out 5000000000
This is a quick Microarchitectural benchmark to try and determine how many functional units our CPU has that can do an inc instruction at the same time due to superscalar architecture.
The generated programs do loops like:
loop:
  inc %[i0];
  inc %[i1];
  inc %[i2];
  ...
  inc %[i_n];
  cmp %[max], %[i0];
  jb loop;
with different numbers of inc instructions.
Figure 1.
c/inc_loop_asm_n.sh results for a few CPUs
.
Quite clearly:
and both have low instruction count effects that destroy performance, AMD at 3 and Intel at 3 and 5. TODO it would be cool to understand those better.
Data from multiple CPUs manually collated and plotted manually with c/inc_loop_asm_n_manual.sh.

Articles by others on the same topic (0)

There are currently no matching articles.