LLMs don’t just get slower; they actively resist faster inference as you push them harder, a phenomenon often masked by simple batching.
Let’s watch a real-time inference request flow through a typical LLM serving setup. Imagine a user sends a prompt: "Tell me a story about a brave knight."
{
"request_id": "req-abc123",
"timestamp": "2023-10-27T10:00:00Z",
"prompt": "Tell me a story about a brave knight.",
"max_tokens": 100,
"temperature": 0.7
}
This request hits an API Gateway, which forwards it to a load balancer. The load balancer, say Nginx, might have a configuration like this:
http {
upstream llm_servers {
server 10.0.1.10:8000 weight=1;
server 10.0.1.11:8000 weight=1;
server 10.0.1.12:8000 weight=1;
}
server {
listen 80;
location /infer {
proxy_pass http://llm_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
}
}
The request lands on one of the inference servers, say 10.0.1.10, which is running a Python FastAPI application. This app uses a framework like vLLM to manage the model and handle concurrent requests.
from fastapi import FastAPI
from vllm import LLM, SamplingParams
import asyncio
app = FastAPI()
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf") # Model loaded here
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
@app.post("/infer")
async def infer(request: dict):
prompt = request["prompt"]
max_tokens = request.get("max_tokens", 100)
temperature = request.get("temperature", 0.7)
sampling_params.max_tokens = max_tokens
sampling_params.temperature = temperature
# vLLM's async API for batching and efficient inference
results = await asyncio.get_event_loop().run_in_executor(
None, # Use default executor
lambda: llm.generate(prompt, sampling_params)
)
return {"text": results[0].outputs[0].text, "request_id": request.get("request_id")}
# To run this: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
When vLLM receives this request, it doesn’t just run it immediately. It queues it up. If multiple requests arrive in quick succession, vLLM intelligently batches them together to improve GPU utilization. This is where the magic and the latency issues begin.
The core problem is that LLM inference is a sequential process per token. Even with batching, the GPU has to process each token generation step. When you ask for max_tokens=100, the model performs 100 steps of computation. If a single request takes 500ms and you have 10 requests, batching them might reduce the total time to, say, 600ms (for a throughput gain), but the P99 latency for a single request could still be around 500ms, or even higher if the batching adds overhead or if the batch size isn’t optimal.
The system we’re optimizing for is the end-to-end journey of a single user’s request, measured from when their prompt is sent to when they receive the first token of the generated response. Our goal is to ensure that 99% of these requests complete within a specific Service Level Objective (SLO), say 1000ms.
The most surprising truth about LLM latency is that it’s not just about raw computational speed; it’s a complex interplay of model architecture, hardware, request patterns, and crucially, the inference serving framework’s ability to manage concurrency and memory.
The system works by breaking down the prompt into tokens, feeding these tokens into the transformer layers of the LLM, and then predicting the next token based on the context. This process repeats until an end-of-sequence token is generated or max_tokens is reached. The key challenge is that each token generation step requires a forward pass through the entire network, which is computationally expensive.
To achieve P99 SLOs, you need to understand and control several levers:
-
Model Quantization: Reducing the precision of model weights (e.g., from FP16 to INT8 or even INT4) can significantly speed up computation and reduce memory bandwidth requirements.
- Diagnosis: Use
nvidia-smito monitor GPU utilization and memory usage. If memory is consistently high and utilization is moderate, quantization might help. Compare inference times of FP16 vs. quantized models on representative prompts. - Fix: Load a quantized model. For example, using
vLLMwithquantization="awq"orquantization="gptq".llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="awq"). This reduces model size and speeds up matrix multiplications. - Why it works: Lower precision requires fewer bits to represent weights and activations, leading to faster arithmetic operations and less data to move around.
- Diagnosis: Use
-
Kernel Optimization (e.g., FlashAttention): Standard attention mechanisms are memory-bandwidth bound. Optimized kernels can reduce memory reads/writes.
- Diagnosis: Observe GPU memory bandwidth usage. If it’s maxed out while compute utilization is not, optimized kernels can help.
- Fix: Ensure your inference framework (like
vLLM) is compiled with or automatically uses optimized kernels like FlashAttention-2. ForvLLM, this is often automatic if your CUDA version and hardware support it. You might need to ensure PyTorch is built with CUDA support and that thevllmpackage is installed correctly with appropriate CUDA toolkit versions. - Why it works: FlashAttention reorders the computation to avoid materializing large intermediate attention matrices, reducing memory I/O.
-
Batching Strategies: Dynamic batching (grouping requests that arrive close in time) is crucial. Continuous batching, as implemented by
vLLM, is even better as it allows requests to enter and leave the batch dynamically, avoiding static batching’s inefficiencies.- Diagnosis: Monitor the queue length and the average batch size within your inference server. If requests are frequently waiting in a queue even when GPUs are not fully saturated, batching might be suboptimal.
- Fix: Use a framework like
vLLMwhich implements continuous batching. Ensure yourmax_num_batched_tokensandmax_num_seqsparameters are tuned for your workload.llm = LLM(model="...", max_num_batched_tokens=2048, max_num_seqs=128). These parameters control how many tokens and sequences can be processed concurrently. - Why it works: Continuous batching maximizes GPU utilization by always having work to do, as new requests can join a running batch as soon as a sequence within that batch finishes.
-
KV Cache Optimization: The Key-Value cache stores intermediate attention states, which can consume a lot of GPU memory. Efficient management is key.
- Diagnosis: Monitor GPU memory. If you can only serve a few concurrent requests before hitting OOM errors, KV cache is likely the bottleneck.
- Fix: Use techniques like PagedAttention (used by
vLLM). EnsurevLLMis configured to use it.vLLM’s default settings leverage PagedAttention. If you were using a different framework, you might need to implement or enable similar mechanisms. - Why it works: PagedAttention manages the KV cache in a more memory-efficient way, similar to operating system virtual memory, by dividing it into fixed-size blocks.
-
Request Scheduling and Prioritization: For mixed workloads (e.g., interactive vs. batch), prioritizing shorter requests or those with tighter latency budgets can improve P99.
- Diagnosis: Analyze latency distributions for different request types or lengths. If short requests are frequently delayed by long ones, a scheduler can help.
- Fix: Implement a multi-queue scheduler or use an inference server that supports request prioritization. For example, one might have a "priority" queue for interactive requests and a "fair-share" queue for background tasks, with the priority queue always being serviced first.
- Why it works: Ensures that high-priority or low-latency-budget requests are processed without being significantly blocked by longer-running ones.
-
Hardware and Network: Under-provisioned GPUs or high network latency between your application and the inference servers can be major bottlenecks.
- Diagnosis: Use
nvidia-smifor GPU metrics andping/traceroutefor network latency. High CPU utilization on the inference server might indicate it’s bottlenecked by data loading or pre/post-processing. - Fix: Upgrade GPUs, ensure your inference servers are in the same availability zone/region as your application, or use faster network interconnects (e.g., InfiniBand if applicable). Ensure the inference server itself isn’t CPU-bound by optimizing data handling.
- Why it works: Faster hardware can perform computations quicker, and lower network latency means less time waiting for data to travel.
- Diagnosis: Use
The most overlooked aspect of LLM latency optimization is the impact of sampling parameters like temperature and top_p on the actual number of decoding steps required. While max_tokens sets an upper bound, the model might stop generating tokens much earlier if it reaches a high probability state, especially with low temperatures. Conversely, higher temperatures can lead to more "exploratory" token choices, potentially increasing the effective number of steps taken before a natural stopping point.
Once you’ve hit your P99 SLOs, the next challenge is often managing the cost of serving these optimized LLMs at scale, which involves deep dives into efficient deployment strategies and autoscaling.