Generative pre-trained transformer (GPT)

Video 1.

5 Years of GPTs by Finbarr Timbers

. Source. 2023. Good talk.

Video 2.

Attention in transformers, step-by-step by 3Blue1Brown

. Source. 2024. Uses on GPT-3 as basis.

Video 3.

How might LLMs store facts by 3Blue1Brown

. Source. Followup to the above video.

Table of contents
- ChatGPT Generative pre-trained transformer
- GPT model Generative pre-trained transformer
  - Theoretical peak performance of GPT inference GPT model
    - Number of multiplications per token in a GPT model Theoretical peak performance of GPT inference
  - List of GPT models GPT model
    - GPT model by OpenAI List of GPT models
    - Llama (language model) List of GPT models
      - Llama 2 Llama (language model)
        Llama 2 7B Llama 2
      - Llama 3 Llama (language model)
        Llama 3.1 Llama 3
        Llama 3.1 8B Llama 3.1
        Llama 3.1 70B Llama 3.1
        Llama 3.1 405B Llama 3.1

Theoretical peak performance of GPT inference

 0  0

For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.

It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.

Bibliography:

8 jax-ml.github.io/scaling-book/

Number of multiplications per token in a GPT model

 0  0

The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.

For each layer (L):

for each attention head (h):
- K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
- Q = d_model * d_head (same)
- K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
- new value vector for new token: d_model * d_model
- new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)

So the total sum is:

L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)

This is coded at: llm_count_mults.py.

Bibliography:

Vocabulary size (V): 50,257
Hidden size (d_model): 768
Context length (n_ctx): 1024
Q V size: (d_head): 64
Attention heads (h): 12
FFN inner size (d_ff): 3072
Layers (L): 12

Vocabulary size (V): 50,257
Hidden size (d_model): 12,288
Context length 2048
Q V size: (d_head): 128
Attention heads (h): 96
FFN inner size (d_ff) 4 × 12,288 = 49,152
Layers (L): 96

GPT-4

 0  0

GPT 4 Turbo

 0  0

platform.openai.com/docs/models/gpt-4-turbo

GPT-5

 0  0

Llama (language model)

 0  0

Homepage: www.llama.com/

Llama 2 (2023)

 0  0

Page: www.llama.com/llama2/

Llama 2 7B

 0  0

Llama 3 (2024)

 0  0

www.llama.com/models/llama-3/

Llama 3.1

 0  0

Llama 3.1 8B

 0  0

Llama 3.1 70B

 0  0

Llama 3.1 405B

 0  0

 Ancestors (13)

 Synonyms (1)

cirosantilli/gpt

 View article source

 Discussion (0)

New discussion

There are no discussions about this article yet.

 Articles by others on the same topic (0)

There are currently no matching articles.

  See all articles in the same topic Create my own version

Generative pre-trained transformer (GPT)

ChatGPT

GPT model

Theoretical peak performance of GPT inference

Number of multiplications per token in a GPT model

List of GPT models

GPT model by OpenAI

GPT-1 (117 M parameters, 2019-06)

Improving Language Understanding by Generative Pre-Training (GPT-1 paper)

GPT-2 (124 M parameters, 2019-11-05)

Language Models are Unsupervised Multitask Learners (GPT-2 paper)

GPT-2 implementation

GPT-2 implementation in PyTorch

nanoGPT

GPT-2 variant

GPT-2 medium (355 M parameters)

GPT-2 large (774 M parameters)

GPT-2 XL

GPT-3 (175 B parameters, 2020-06)

GPT-4

GPT 4 Turbo

GPT-5

Llama (language model)

Llama 2 (2023)

Llama 2 7B

Llama 3 (2024)

Llama 3.1

Llama 3.1 8B

Llama 3.1 70B

Llama 3.1 405B

 Ancestors (13)

 Synonyms (1)

 Discussion (0)

 Articles by others on the same topic (0)

 Discussion (0)  Subscribe (1)

 Discussion (0)