= LLM inference batching
{c}
<LLM inference batching> means running multiple independent queries in parallel on a given model.
This can be used to overcome the fact that most single prompt inference will be heavily <memory bound>, see also: <theoretical peak performance of GPT inference>{full}. Batching helps increase the GPU compute utilization and balance it out with the memory.
Bibliography:
* https://medium.com/@yohoso/llm-inference-optimisation-continuous-batching-2d66844c19e9
* https://www.hyperstack.cloud/technical-resources/tutorials/static-vs.-continuous-batching-for-large-language-model-inference
Back to article page