The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.
For each layer (L):
So the total sum is:
L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)
This is coded at: llm_count_mults.py.

Articles by others on the same topic (0)

There are currently no matching articles.