LLM inference batching

LLM inference batching means running multiple independent queries in parallel on a given model.

This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:

 Articles by others on the same topic (0)

There are currently no matching articles.

  See all articles in the same topic Create my own version