The most surprising thing about GPU cost per token is that it’s often cheaper to run a smaller, more efficient model on a high-end GPU than a massive model on a cheaper one, even if the smaller model requires more tokens per second.

Let’s see this in action. Imagine we’re running a hypothetical LLM inference workload on an NVIDIA A100 GPU.

Here’s a simplified view of the cost calculation:

  • GPU Cost: Let’s say an A100 costs $10 per hour.
  • Inference Speed: The model can process 1000 tokens per second.
  • Cost Per Second: $10 / 3600 seconds = $0.00278 per second.
  • Cost Per Token: $0.00278 / 1000 tokens = $0.00000278 per token.

Now, consider a different scenario: a smaller, more optimized model running on the same A100, but achieving 2000 tokens per second.

  • GPU Cost: Still $10 per hour.
  • Inference Speed: 2000 tokens per second.
  • Cost Per Second: $0.00278 per second.
  • Cost Per Token: $0.00278 / 2000 tokens = $0.00000139 per token.

This smaller model, despite potentially having lower "quality" for some tasks, is half the cost per token because it leverages the GPU’s compute more effectively.

The core problem this solves is understanding the true operational cost of serving LLM requests. It’s not just about the raw compute power of the GPU, but how efficiently that power is utilized for the specific task of token generation. This cost per token is the fundamental unit economics for any LLM inference service.

Internally, this calculation hinges on two primary levers:

  1. GPU Instance Cost: This is the most straightforward. It’s the hourly or per-second cost of renting or owning the GPU hardware. This varies wildly by cloud provider, GPU type (e.g., A100, H100, L4, T4), and region. You can find these prices on cloud provider websites or through cost management tools. For example, an AWS p4d.24xlarge instance with 8 A100 GPUs might cost around $32 per hour, meaning each A100 is roughly $4/hour.
  2. Inference Throughput (Tokens Per Second): This is the critical, workload-dependent factor. It’s a measure of how many tokens your LLM can generate within a given second on that specific GPU. This is influenced by:
    • Model Size and Architecture: Larger models (more parameters) generally require more computation per token.
    • Quantization: Using lower-precision data types (e.g., FP16, INT8) can significantly speed up computation with minimal quality loss.
    • Batch Size: Processing multiple requests concurrently (batching) can improve GPU utilization, but too large a batch can increase latency.
    • KV Cache Optimization: Efficiently managing the Key-Value cache for attention mechanisms is crucial for throughput.
    • Software Stack: The inference server (e.g., vLLM, TGI, Triton) and underlying libraries (e.g., CUDA, cuDNN) play a massive role.
    • Prompt Length: Longer prompts require more computation for the initial processing phase.

To get accurate numbers, you need to benchmark your specific model on your target hardware. Tools like nvidia-smi can show GPU utilization, but for token throughput, you’ll need to use your inference framework’s metrics. For instance, vLLM provides detailed throughput statistics in its API response or logs.

The actual calculation is:

Cost per Token = (GPU Hourly Cost / 3600 seconds) / Tokens per Second

If you’re using a managed service that abstracts away the GPU, you’re looking at the provider’s per-token API cost, which already bakes in their hardware and operational expenses. But when self-hosting, this direct calculation is your key to profitability.

A crucial, often overlooked aspect of GPU cost per token is the impact of intermittent high utilization versus sustained moderate utilization. A GPU might show 99% utilization during a burst of tokens, but if that burst is short and followed by idle time, the average cost per token over a longer period will be higher than if the GPU maintained a steady 70% utilization generating tokens continuously. This is because the hourly cost of the GPU is constant, regardless of whether it’s fully busy or partially idle. Therefore, optimizing for sustained throughput, even if it means slightly lower peak utilization, is often more cost-effective.

The next concept you’ll grapple with is optimizing for latency versus throughput, and how they directly conflict and influence your cost per token.

Want structured learning?

Take the full Gpu course →