{tag=LLM inference batching}

As of <llama.cpp> 79e0b68c178656bb0632cb8602d2940b755077f8 there is a `--parallel` option but not sure what it does.

Bibliography:
* https://github.com/ggml-org/llama.cpp/discussions/3222
* https://www.reddit.com/r/LocalLLaMA/comments/12aj0ze/what_is_batchsize_in_llamacpp_also_known_as_n/
* https://www.reddit.com/r/LocalLLaMA/comments/12gtanv/batch_queries/
* related for server:
  * https://www.reddit.com/r/LocalLLaMA/comments/1f19t2l/parallel_requests_using_llamaserver


<llama-cli> inference batching

llama-cli inference batching

<llama cli> inference batching

{c}

<LLM inference batching> means running multiple independent queries in parallel on a given model.

This can be used to overcome the fact that most single prompt inference will be heavily <memory bound>, see also: <theoretical peak performance of GPT inference>{full}. Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:
* https://medium.com/@yohoso/llm-inference-optimisation-continuous-batching-2d66844c19e9
* https://www.hyperstack.cloud/technical-resources/tutorials/static-vs.-continuous-batching-for-large-language-model-inference


Ciro Santilli @cirosantilli 40

 Tagged: LLM inference batching