{tag=LLM inference batching}

As of <llama.cpp> 79e0b68c178656bb0632cb8602d2940b755077f8 there is a `--parallel` option but not sure what it does.

Bibliography:
* https://github.com/ggml-org/llama.cpp/discussions/3222
* https://www.reddit.com/r/LocalLLaMA/comments/12aj0ze/what_is_batchsize_in_llamacpp_also_known_as_n/
* https://www.reddit.com/r/LocalLLaMA/comments/12gtanv/batch_queries/
* related for server:
  * https://www.reddit.com/r/LocalLLaMA/comments/1f19t2l/parallel_requests_using_llamaserver


<llama-cli> inference batching

llama-cli inference batching

<llama cli> inference batching

{c}

https://ollama.com

This appears to be the backend library of <Ollama>.

They have a <CLI> front-end named <llama-cli>.

https://askubuntu.com/questions/1461564/install-llama-cpp-locally has some tutorials for <Ubuntu>. There was no nicely pre-packaged one for <Ubuntu 25.04>, but build worked on 79e0b68c178656bb0632cb8602d2940b755077f8 In particular it exposed <Vulkan> support before <Ollama> did: https://github.com/ollama/ollama/pull/5059 and it did seem to work, using up my <AMD GPU>.


llama.cpp

{c}

TODO: haven't managed. `/set parameter seed 0`:
* https://github.com/ollama/ollama/issues/3775
* https://github.com/ollama/ollama/issues/2773#issuecomment-2732874259
* https://www.reddit.com/r/ollama/comments/1jmnb8b/testability_of_llms_the_elusive_hunt_for/

Across hardware:
* https://stackoverflow.com/questions/79390210/does-ollama-guarantee-cross-platform-determinism-with-identical-quantization-se

It might be easier to just use <llama-cli> for this, it has a `--temperature` flag.


Ollama deterministic output

{c}

A <CLI> front-end for <llama.cpp>.

A decent test command as of <llama.cpp> 79e0b68c178656bb0632cb8602d2940b755077f8 tested on <Ubuntu 25.04>:
``
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
make -j
cd bin
time ./llama-cli \
  --no-display-prompt \
  --single-turn \
  --temp 0 \
  -c 16384 \
  -cnv \
  -m ~/Downloads/Llama-3.1-Tulu-3-8B-Q8_0.gguf \
  -n 1000 \
  -ngl 100 \
  -p 'What is quantum field theory?' \
  -t 10 |
tee output.txt
``
and that was deterministic due to `--temp 0`.

Also, this command ran 2x faster at 18 tokens/s for 1000 tokens  on <Ciro Santilli's hardware/P14s> on GPU via Vulkan than on CPU which is achievable by removing the `-ngl 100`.


Ciro Santilli @cirosantilli 37

 Incoming links: llama-cli