{file}

Usage:
``
./ollama-expect <model> <prompt>
``
e.g.:
``
./ollama-expect llama3.2 'What is quantum field theory?'
``
This generates 100 tokens for the given prompt with the given model.

Benchmarks:
* <Ciro Santilli's hardware/P14s>: 4.8s, CPU only: ~21 tokens / s. For comparison, using the <Vulkan> backend of <llama.cpp> gave ~23  tokens/s
* <Ciro Santilli's hardware/P51>: 9.6s, uses <Nvidia> GPU: ~10 tokens / s


ollama-expect

{c}

A <CLI> front-end for <llama.cpp>.

A decent test command as of <llama.cpp> 79e0b68c178656bb0632cb8602d2940b755077f8 tested on <Ubuntu 25.04>:
``
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
make -j
cd bin
time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m ~/Downloads/Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
tee output.txt
``
and that was deterministic due to `--temp 0`.

Also, this command ran 2x faster at 18 tokens/s for 1000 tokens  on <Ciro Santilli's hardware/P14s> on GPU via Vulkan than on CPU which is achievable by removing the `-ngl 100`.


llama-cli

{tag=LLM inference batching}

As of <llama.cpp> 79e0b68c178656bb0632cb8602d2940b755077f8 there is a `--parallel` option but not sure what it does.

Bibliography:
* https://github.com/ggml-org/llama.cpp/discussions/3222
* https://www.reddit.com/r/LocalLLaMA/comments/12aj0ze/what_is_batchsize_in_llamacpp_also_known_as_n/
* https://www.reddit.com/r/LocalLLaMA/comments/12gtanv/batch_queries/
* related for server:
  * https://www.reddit.com/r/LocalLLaMA/comments/1f19t2l/parallel_requests_using_llamaserver


<llama-cli> inference batching

llama-cli inference batching

<llama cli> inference batching

{c}

https://ollama.com

This appears to be the backend library of <Ollama>.

They have a <CLI> front-end named <llama-cli>.

https://askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for <Ubuntu>. There was no nicely pre-packaged one for <Ubuntu 25.04>, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed <Vulkan> support before <Ollama> did: https://github.com/ollama/ollama/pull/5059 and it did seem to work, using up my <AMD GPU>.


Ciro Santilli @cirosantilli 37

 Incoming links: llama.cpp