LLM inference optimization

This section discusses techniques that can be used to make LLMs infer with lower latency or greater throughput.

Bibliography:

developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

Table of contents

LLM inference batching

LLM inference batching means running multiple independent queries in parallel on a given model.

This can be used to overcome the fact that most single prompt inference will be heavily memory bound, see also: Section "Theoretical peak performance of GPT inference". Batching helps increase the GPU compute utilization and balance it out with the memory.

Bibliography:

llama-cli inference batching

LLM KV Caching

Bibliography:

Grouped-Query attention

Bibliography:

aliissa99.medium.com/-a596e4d86f79

 Ancestors (13)

 View article source

 Discussion (0)

There are no discussions about this article yet.

 Articles by others on the same topic (0)

There are currently no matching articles.

  See all articles in the same topic Create my own version