This section discusses techniques that can be used to make LLMs infer with lower latency or greater throughput.
LLM inference batching means running multiple independent queries in parallel on a given model.
This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.

Tagged

Articles by others on the same topic (0)

There are currently no matching articles.