The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.
For each layer (L):So the total sum is:
- for each attention head (h):
- K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
- Q = d_model * d_head (same)
- K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
- new value vector for new token: d_model * d_model
- new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
- fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)
L * (
h * (
2 * d_model * d_head +
n_ctx * d_head +
d_model * d_model +
n_ctx * d_model
) +
2 * d_model * d_ff
)
This is coded at: llm_count_mults.py.
This example attempts to keep temperature to a fixed point by turning on a fan when a thermistor gets too hot.
You can test it easily if you are not in a place that is too hot by holding the thermistor with your finger to turn on the fan.
In Ciro's ASCII art circuit diagram notation:
+----------FAN-----------+
| |
| |
RPI_PICO_W__gnd__gpio26Adc__3.3V@36__gpio2
| | |
| | |
| | |
| +-THERMISTOR
| |
| |
R_10-+
For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.
It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.
Bibliography:
This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.
This section discusses techniques that can be used to make LLMs infer with lower latency or greater throughput.
There were still no amazing open source implementations as of 2025.
This section is about emulation setups that simulate both the microcontroller as well as the electronics it controls.
Bibliography:
Unlisted articles are being shown, click here to show only listed articles.