Ollama is a highly automated open source wrapper that makes it very easy to run multiple Open weight LLM models either on CPU or GPU.

Its README alone is of great value, serving as a fantastic list of the most popular Open weight LLM models in existence.

Install with:

curl https://ollama.ai/install.sh | sh

The below was tested on Ollama 0.1.14 from December 2013.

Download llama2 7B and open a prompt:

ollama run llama2

On P14s it runs on CPU and generates a few tokens per second, which is quite usable for a quick interactive play.

As mentioned at github.com/jmorganca/ollama/blob/0174665d0e7dcdd8c60390ab2dd07155ef84eb3f/docs/faq.md the downloads to under /usr/share/ollama/.ollama/models/ and ncdu tells me:

--- /usr/share/ollama ----------------------------------
    3.6 GiB [###########################] /.ollama
    4.0 KiB [                           ]  .bashrc
    4.0 KiB [                           ]  .profile
    4.0 KiB [                           ]  .bash_logout

The file:

/usr/share/ollama/.ollama/models/manifests/hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF/Q2_K

gives a the exact model name and parameters.

We can also do it non-interactively with:

/bin/time ollama run llama2 'What is quantum field theory?'

which gave me:

0.13user 0.17system 2:06.32elapsed 0%CPU (0avgtext+0avgdata 17280maxresident)k
0inputs+0outputs (0major+2203minor)pagefaults 0swaps

but note that there is a random seed that affects each run by default. ollama-expect is an attempt to make the output deterministic.

Some other quick benchmarks from Amazon EC2 GPU on a g4nd.xlarge instance which had an Nvidia Tesla T4:

0.07user 0.05system 0:16.91elapsed 0%CPU (0avgtext+0avgdata 16896maxresident)k
0inputs+0outputs (0major+1960minor)pagefaults 0swaps

and on Nvidia A10G in an g5.xlarge instance:

0.03user 0.05system 0:09.59elapsed 0%CPU (0avgtext+0avgdata 17312maxresident)k
8inputs+0outputs (1major+1934minor)pagefaults 0swaps

So it's not too bad, a small article in 10s.

It tends to babble quite a lot by default, but eventually decides to stop.

llama.cpp

 0  0

ollama.com

This appears to be the backend library of Ollama.

They have a CLI front-end named llama-cli.

askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.

llama-cli

 0  0

A CLI front-end for llama.cpp.

A decent test command as of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 tested on Ubuntu 25.04:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
make -j
cd bin
time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m ~/Downloads/Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
tee output.txt

and that was deterministic due to --temp 0.

Also, this command ran 2x faster at 18 tokens/s for 1000 tokens on P14s on GPU via Vulkan than on CPU which is achievable by removing the -ngl 100.

llama-cli inference batching

 0  0

As of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 there is a --parallel option but not sure what it does.

Bibliography:

Ollama HOWTO

 0  0

 Tagged

Ollama set parameter on CLI

Ollama output size

 0  0

Ollama deterministic output

 0  0

TODO: haven't managed. /set parameter seed 0:

Across hardware:

stackoverflow.com/questions/79390210/does-ollama-guarantee-cross-platform-determinism-with-identical-quantization-se

It might be easier to just use llama-cli for this, it has a --temperature flag.

Ollama parameter

 0  0

List: github.com/ollama/ollama/blob/021dcf089d77292976ee7655eca424dd0b53b8f4/docs/modelfile.md#valid-parameters-and-values

Ollama set parameter on CLI

 0  0

Impossible without expect? Fuck...

Attempt at: ollama-expect

ollama-expect

 0  0

Usage:

./ollama-expect <model> <prompt>

e.g.:

./ollama-expect llama3.2 'What is quantum field theory?'

This generates 100 tokens for the given prompt with the given model.

Benchmarks:

P14s: 4.8s, CPU only: ~21 tokens / s. For comparison, using the Vulkan backend of llama.cpp gave ~23 tokens/s
P51: 9.6s, uses Nvidia GPU: ~10 tokens / s

Open source LLM

LLM model with open training data

The Pile (dataset)

LLM360

Open weight LLM model

Ollama

llama.cpp

llama-cli

llama-cli inference batching

Ollama HOWTO

Ollama output size

Ollama deterministic output

Ollama parameter

Ollama set parameter on CLI

ollama-expect

 Ancestors (13)

 Discussion (0)

 Articles by others on the same topic (0)

 Discussion (0)  Subscribe (1)

 Discussion (0)