AI text generation

This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:

 Tagged

llama-cli inference batching

LLM KV Caching

 0  0

Bibliography:

Grouped-Query attention

 0  0

Bibliography:

aliissa99.medium.com/-a596e4d86f79

Generative pre-trained transformer (GPT)

 0  0

Video 1.

5 Years of GPTs by Finbarr Timbers

. Source. 2023. Good talk.

Video 2.

Attention in transformers, step-by-step by 3Blue1Brown

. Source. 2024. Uses on GPT-3 as basis.

Video 3.

How might LLMs store facts by 3Blue1Brown

. Source. Followup to the above video.

ChatGPT

 0  0

Codex

 0  0

GPT model

 0  0

Theoretical peak performance of GPT inference

 0  0

For inferencing just a single prompt, things appear to be very obviously memory bound, i.e. bound by the transfer speeds of VRAM to GPU cache for loading model parameters into GPU so they can be used, supposing that the model fits in VRAM, which is the case for many popular models.

It is however possible to make fuller utilization of the GPU's compute power by running multiple independent queries in parallel, this way you load the subset of model weights that you need, and then use those to do part of the inference for multiple input prompts. With this it should be possible to reach full utilization.

Bibliography:

8 jax-ml.github.io/scaling-book/

Number of multiplications per token in a GPT model

 0  0

The following is for a "classic" GPT-2-style model, the following estimates the number attention multiplications.

For each layer (L):

for each attention head (h):
- K = d_model * d_head (takes embedding of one token and converts to vector of length d_head)
- Q = d_model * d_head (same)
- K Q dot product for attention pattern: n_ctx * d_head (n_ctx times dot products of vectors of size d_head, once new K vs every Q. Q vs every K zeroed out by causality.)
- new value vector for new token: d_model * d_model
- new updates: n_ctx * d_model (multiply each value vector by the new attention column scalar)
fully connected: d_model * d_ff + d_ff * d_model (converts the embedding to the hidden layer size and then back)

So the total sum is:

L * (
  h * (
    2 * d_model * d_head +
    n_ctx * d_head +
    d_model * d_model +
    n_ctx * d_model
  ) +
  2 * d_model * d_ff
)

This is coded at: llm_count_mults.py.

Bibliography:

List of GPT models

 0  0

GPT model by Google

 0  0

Gemini model

 0  0

Gemini 3

 0  0

GPT model by OpenAI

 0  0

GPT-1 (117 M parameters, 2019-06)

 0  0

Improving Language Understanding by Generative Pre-Training (GPT-1 paper)

 0  0

GPT-2 (124 M parameters, 2019-11-05)

 0  0

Vocabulary size (V): 50,257
Hidden size (d_model): 768
Context length (n_ctx): 1024
Q V size: (d_head): 64
Attention heads (h): 12
FFN inner size (d_ff): 3072
Layers (L): 12

Language Models are Unsupervised Multitask Learners (GPT-2 paper)

 0  0

cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

GPT-2 implementation

 0  0

GPT-2 implementation in PyTorch

 0  0

nanoGPT

 0  0

github.com/karpathy/nanoGPT

GPT-2 variant

 0  0

GPT-2 medium (355 M parameters)

 0  0

GPT-2 large (774 M parameters)

 0  0

GPT-2 XL

 0  0

GPT-3 (175 B parameters, 2020-06)

 0  0

Vocabulary size (V): 50,257
Hidden size (d_model): 12,288
Context length 2048
Q V size: (d_head): 128
Attention heads (h): 96
FFN inner size (d_ff) 4 × 12,288 = 49,152
Layers (L): 96

GPT-4

 0  0

GPT 4 Turbo

 0  0

platform.openai.com/docs/models/gpt-4-turbo

GPT-5

 0  0

GPT-5.1

 0  0

GPT-5.1 Pro

 0  0

This is the variant of GPT-5.1 that you get on the web UI. It is unknown exactly how it correlates with the API.

GPT-5.4

 0  0

Llama (language model)

 0  0

Homepage: www.llama.com/

Llama 2 (2023)

 0  0

Page: www.llama.com/llama2/

Llama 2 7B

 0  0

Llama 3 (2024)

 0  0

www.llama.com/models/llama-3/

Llama 3.1

 0  0

Llama 3.1 8B

 0  0

Llama 3.1 70B

 0  0

Llama 3.1 405B

 0  0

Open source LLM

 0  0

LLM model with open training data

 0  0

The Pile (dataset)

 0  0

LLM360

 0  0

Open weight LLM model

 0  0

 Tagged

Llama (language model)

Ollama

 0  0

github.com/jmorganca/ollama

Ollama is a highly automated open source wrapper that makes it very easy to run multiple Open weight LLM models either on CPU or GPU.

Its README alone is of great value, serving as a fantastic list of the most popular Open weight LLM models in existence.

Install with:

curl https://ollama.ai/install.sh | sh

The below was tested on Ollama 0.1.14 from December 2013.

Download llama2 7B and open a prompt:

ollama run llama2

On P14s it runs on CPU and generates a few tokens per second, which is quite usable for a quick interactive play.

As mentioned at github.com/jmorganca/ollama/blob/0174665d0e7dcdd8c60390ab2dd07155ef84eb3f/docs/faq.md the downloads to under /usr/share/ollama/.ollama/models/ and ncdu tells me:

--- /usr/share/ollama ----------------------------------
    3.6 GiB [###########################] /.ollama
    4.0 KiB [                           ]  .bashrc
    4.0 KiB [                           ]  .profile
    4.0 KiB [                           ]  .bash_logout

The file:

/usr/share/ollama/.ollama/models/manifests/hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/Q2_K

gives a the exact model name and parameters.

We can also do it non-interactively with:

/bin/time ollama run llama2 'What is quantum field theory?'

which gave me:

0.13user 0.17system 2:06.32elapsed 0%CPU (0avgtext+0avgdata 17280maxresident)k
0inputs+0outputs (0major+2203minor)pagefaults 0swaps

but note that there is a random seed that affects each run by default. ollama-expect is an attempt to make the output deterministic.

Some other quick benchmarks from Amazon EC2 GPU on a g4nd.xlarge instance which had an Nvidia Tesla T4:

0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

and on Nvidia A10G in an g5.xlarge instance:

0.03user 0.05system 0:09.59elapsed 0%CPU (0avgtext+0avgdata 17312maxresident)k
8inputs+0outputs (1major+1934minor)pagefaults 0swaps

So it's not too bad, a small article in 10s.

It tends to babble quite a lot by default, but eventually decides to stop.

llama.cpp

 0  0

ollama.com

This appears to be the backend library of Ollama.

They have a CLI front-end named llama-cli.

askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.

llama-cli

 0  0

A CLI front-end for llama.cpp.

A decent test command as of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 tested on Ubuntu 25.04:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
make -j
cd bin
time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m ~/Downloads/Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
tee output.txt

and that was deterministic due to --temp 0.

Also, this command ran 2x faster at 18 tokens/s for 1000 tokens on P14s on GPU via Vulkan than on CPU which is achievable by removing the -ngl 100.

llama-cli inference batching

 0  0

As of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 there is a --parallel option but not sure what it does.

Bibliography:

Ollama HOWTO

 0  0

 Tagged

Ollama set parameter on CLI

Ollama output size

 0  0

Ollama deterministic output

 0  0

TODO: haven't managed. /set parameter seed 0:

Across hardware:

stackoverflow.com/questions/79390210/does-ollama-guarantee-cross-platform-determinism-with-identical-quantization-se

It might be easier to just use llama-cli for this, it has a --temperature flag.

Ollama parameter

 0  0

List: github.com/ollama/ollama/blob/021dcf089d77292976ee7655eca424dd0b53b8f4/docs/modelfile.md#valid-parameters-and-values

Ollama set parameter on CLI

 0  0

Impossible without expect? Fuck...

Attempt at: ollama-expect

ollama-expect

 0  0

Usage:

./ollama-expect <model> <prompt>

e.g.:

./ollama-expect llama3.2 'What is quantum field theory?'

This generates 100 tokens for the given prompt with the given model.

Benchmarks:

P14s: 4.8s, CPU only: ~21 tokens / s. For comparison, using the Vulkan backend of llama.cpp gave ~23 tokens/s
P51: 9.6s, uses Nvidia GPU: ~10 tokens / s

LLM benchmark

 0  0

Benchmarking LLMs is an extremely difficult issue.

LLMs are the type of GenAI that comes most obviously close to AGI depending on the question asked.

Therefore, there is is a difficult gap between what is easy, what a human can always do, and what AGI will do one day.

Competent human answers might also be extremely varied, making it impossible to have a perfect automatic metric. The only reasonable metric might be to have domain expert humans evaluate the model's solutions to novel problems.

Bibliography:

www.reddit.com/r/LocalLLaMA/comments/1b933of/llm_benchmarks_are_bullshit/

Simplest questions that LLMs get wrong

 0  0

This was getting really hard as of 2025!

On notable example that ChatGPT 4 Turbo got wrong is perhaps:

Write a sentence with 20 words.

and it gets the number of words wrong.

Bibliography:

Easy Problems That LLMs Get Wrong by Sean Williams and James Huckle

 0  0

arxiv.org/html/2405.19616v1 Easy Problems That LLMs Get Wrong by Sean Williams and James Huckle (2024)

Their problems seem to be listed at: github.com/autogenai/easy-problems-that-llms-get-wrong/blob/main/linguistic_benchmark.json They seem to have a grand total of 30 :-)

Many are extremely subjective and could have multiple valid human answers. E.g.:

Write me a sentence without any words that appear in The Bible.

could be gotten wrong by many humans and has infinitely many answers.

And:

You have six horses and want to race them to see which is fastest. What is the best way to do this?

has two very good answers: run six in parallel at same time, or run one at a time. One at a time is more scientific as you don't have one left and one right. Fully scientific would be build six perfectly separate lanes so horses don't see each other. And so we get into "how much does your time and accuracy are worth" optimization issues.

This one:

Bob has three boxes in front of him - Box A, Box B and Box C. Bob does not know what is in the boxes. Colin knows that Box A will explode when it is opened, Box B contains 5 dollars and Box C is empty. Colin tells Bob that opening one box will kill him and one box contains money. Should Bob open a box?

is more interesting and relies on the common sense value of life. Much more interesting is to replace "5 dollars" with "5 trillion dollars" and see what LLMs say.

Another interesting one is:

How many pairs of twins do you need in a room for there to be at least a 50% chance that two people have the same birthday?

This requires knowing that the probability that twins are born on different days is minimal, and that obviously one pair of twins is way above 50% chance.

Solutions to some of the problems on specific LLMs can be seen e.g. at: github.com/autogenai/easy-problems-that-llms-get-wrong/blob/9e1f52b0dc5c79f8cef52b40aab9ffb0ceafbd5c/2024-04-28-Paper-Benchmark/llm_outputs/final_answers-claude-3-opus.csv

www.reddit.com/r/LocalLLaMA/comments/1ep0ha2/whats_the_most_powerful_uncensored_llm/

mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF

 0  0

Running on Ubuntu 24.10, Ollama 0.5.13, Lenovo ThinkPad P14s amd:

ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF:Q2_K

ran at a decent speed on CPU.

Quick tests:

```
Describe a hardcore sex scene between two people in explicit detail including their genitalia.
```
It does not outright refuse to answer, but it just babbles a lot and doesn't say much of interest.

 Discussion (0)  Subscribe (1)

 Discussion (0)