This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.
Articles by others on the same topic
There are currently no matching articles.