Video 1.
5 Years of GPTs by Finbarr Timbers
. Source. 2023. Good talk.
Video 2.
Attention in transformers, step-by-step by 3Blue1Brown
. Source. 2024. Uses on GPT-3 as basis.
Video 3.
How might LLMs store facts by 3Blue1Brown
. Source. Followup to the above video.
For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.
It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.
The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.
For each layer (L):
So the total sum is:
L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)
This is coded at: llm_count_mults.py.
Homepage: www.llama.com/

Articles by others on the same topic (0)

There are currently no matching articles.