llama.cpp

This appears to be the backend library of Ollama.

They have a CLI front-end named llama-cli.

askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for Ubuntu. There was no nicely pre-packaged one for Ubuntu 25.04, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed Vulkan support before Ollama did: github.com/ollama/ollama/pull/5059 and it did seem to work, using up my AMD GPU.

Table of contents
- llama-cli llama.cpp
  - llama-cli inference batching llama-cli

llama-cli

 0  0

A CLI front-end for llama.cpp.

A decent test command as of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 tested on Ubuntu 25.04:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
make -j
cd bin
time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m ~/Downloads/Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
tee output.txt

and that was deterministic due to --temp 0.

Also, this command ran 2x faster at 18 tokens/s for 1000 tokens on P14s on GPU via Vulkan than on CPU which is achievable by removing the -ngl 100.

llama-cli inference batching

 0  0

As of llama.cpp 79e0b68c178656bb0632cb8602d2940b755077f8 there is a --parallel option but not sure what it does.

Bibliography:

 Articles by others on the same topic (0)

There are currently no matching articles.

  See all articles in the same topic Create my own version

llama.cpp

llama-cli

llama-cli inference batching

 Ancestors (15)

 Incoming links (3)

 Discussion (0)

 Articles by others on the same topic (0)

llama.cpp

llama-cli

llama-cli inference batching

 Ancestors (15)

 Incoming links (3)

 Discussion (0)  Subscribe (1)

 Articles by others on the same topic (0)

 Discussion (0)