Quantum Information course of the University of Oxford Hilary 2023 / 1 / 1 / a Updated +Created
It is the norm induced by the complex dot product over :
CPU functional unit Updated +Created
Cavendish Professor of Physics Updated +Created
As beautifully put in The Eighth Day of Creation:
For more than a hundred years, the Cavendish Professorship has been the chair of experimental physics in the University of Cambridge. The man in that chair rules the university's research in physics. Indeed, for most of that hundred years the Cavendish Professor was preeminent in British science, with an authority that made him, as it were, the archbishop of physics
Superscalar processor Updated +Created
University of Cambridge alumnus Updated +Created
Intel CPU Updated +Created
C-peptide Updated +Created
B chain of insulin Updated +Created
A chain of insulin Updated +Created
Proinsulin Updated +Created
c/inc_loop_asm_n.sh Updated +Created
This is a quick Microarchitectural benchmark to try and determine how many functional units our CPU has that can do an inc instruction at the same time due to superscalar architecture.
The generated programs do loops like:
loop:
  inc %[i0];
  inc %[i1];
  inc %[i2];
  ...
  inc %[i_n];
  cmp %[max], %[i0];
  jb loop;
with different numbers of inc instructions.
Figure 1.
c/inc_loop_asm_n.sh results for a few CPUs
.
Quite clearly:
and both have low instruction count effects that destroy performance, AMD at 3 and Intel at 3 and 5. TODO it would be cool to understand those better.
Data from multiple CPUs manually collated and plotted manually with c/inc_loop_asm_n_manual.sh.
c/inc_loop_asm.c Updated +Created
This is the only way that we've managed to reliably get a single inc instruction loop, by using inline assembly, e.g. on we do x86:
loop:
  inc %[i];
  cmp %[max], %[i];
  jb loop;
For 1s on P14s Ubuntu 25.04 GCC 14.2 -O0 x86_64 we need about 5 billion:
time ./inc_loop_asm.out 5000000000
c/inc_loop.c Updated +Created
Ubuntu 25.04 GCC 14.2 -O0 x86_64 produces a horrendous:
    11c8:       48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
    11cd:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    11d1:       48 3b 45 e8             cmp    -0x18(%rbp),%rax
    11d5:       72 f1                   jb     11c8 <main+0x7f>
To do about 1s on P14s we need 2.5 billion instructions:
time ./inc_loop.out 2500000000
and:
time ./inc_loop.out 2500000000
gives:
          1,052.22 msec task-clock                       #    0.998 CPUs utilized             
                23      context-switches                 #   21.858 /sec                      
                12      cpu-migrations                   #   11.404 /sec                      
                60      page-faults                      #   57.022 /sec                      
    10,015,198,766      instructions                     #    2.08  insn per cycle            
                                                  #    0.00  stalled cycles per insn   
     4,803,504,602      cycles                           #    4.565 GHz                       
        20,705,659      stalled-cycles-frontend          #    0.43% frontend cycles idle      
     2,503,079,267      branches                         #    2.379 G/sec                     
           396,228      branch-misses                    #    0.02% of all branches
With -O3 it manages to fully unroll the loop removing it entirely and producing:
    1078:       e8 d3 ff ff ff          call   1050 <strtoll@plt>
}
    107d:       5a                      pop    %rdx
    107e:       c3                      ret
to is it smart enough to just return the return value from strtoll directly as is in rax.
Inline assembly Updated +Created

There are unlisted articles, also show them or only show them.