{c}

<LLM inference batching> means running multiple independent queries in parallel on a given model.

This can be used to overcome the fact that most single prompt inference will be heavily <memory bound>, see also: <theoretical peak performance of GPT inference>{full}. Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:
* https://medium.com/@yohoso/llm-inference-optimisation-continuous-batching-2d66844c19e9
* https://www.hyperstack.cloud/technical-resources/tutorials/static-vs.-continuous-batching-for-large-language-model-inference


 LLM inference batching

ID: llm-inference-batching