LLM Engineering

LLM Serving

Importance metrics

Note (Databricks: [LLM Inference Performance Engineering: Best Practices](https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices))

Time To First Token (TTFT): How quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time Per Output Token (TPOT): ime to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the “speed” of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Optimization

NVIDIA: Mastering LLM Techniques: Inference Optimization

Speculative Sampling

DeepMind: Accelerating large language model decoding with speculative sampling¹
Google: Fast inference from transformers via speculative decoding²

DeepMind 和 Google 前后分别发了一篇 Speculative Sampling 的文章，内容比较相似 (还是有些许不同)。

feifeibear/LLMSpeculativeSampling 给出了两个算法的 PyTorch 实现
jaymody/speculative-sampling 给出了 DeepMind 算法的 Jax 实现
ai-glimpse/toyllm 给出了 DeepMind 算法的 PyTorch 实现 (基于 GPT2 做验证)

KV Cache

图解大模型推理优化之 KV Cache 给出了非常详细且简洁的解释，文章中参考的 HuggingFace 代码来自 transformers/models/decision_transformer/modeling_decision_transformer.py

1
query = self._split_heads(query, self.num_heads, self.head_dim)
2
key = self._split_heads(key, self.num_heads, self.head_dim)
3
value = self._split_heads(value, self.num_heads, self.head_dim)
4

5
if layer_past is not None:
6
    past_key, past_value = layer_past
7
    key = torch.cat((past_key, key), dim=-2)
8
    value = torch.cat((past_value, value), dim=-2)

核心的逻辑就是如果有历史的 key 和 value，就把当前的 key 和 value 拼接到历史的 key 和 value 上，以此减少计算量。

MQA

Multi-Query Attention(MQA)

GQA

Grouped-query Attention(GQA)

Flash Attention

PagedAttention

Charlie Chen et al., “Accelerating Large Language Model Decoding with Speculative Sampling,” arXiv Preprint arXiv:2302.01318, 2023.↩
Yaniv Leviathan et al., “Fast Inference from Transformers via Speculative Decoding,” International Conference on Machine Learning, 2023, 19274–86.↩