LLM Quantization: INT4, INT8, GPTQ, AWQ Compared (2026)

LLM quantization is not about making models smaller to save disk space; it’s about making them runnable on less powerful hardware by reducing the precision of their weights.

Let’s see what this looks like in practice. Imagine a massive language model, like Llama 2 70B, with its parameters (weights) stored as 16-bit floating-point numbers (FP16). This means each weight takes 2 bytes. For 70 billion parameters, that’s a staggering 140 GB of memory just for the weights. Trying to load this onto a single GPU, even a high-end one with 48GB or 80GB, is impossible without special techniques.

Quantization attacks this by reducing the number of bits used to represent each weight. The most common targets are 8-bit integers (INT8) and 4-bit integers (INT4).

INT8 Quantization: Reduces each weight to 1 byte. For Llama 2 70B, this cuts the model size down to approximately 70 GB. This is often achievable with minimal performance degradation and can be done in a few ways.
- Post-Training Quantization (PTQ): This is the simplest. You take a fully trained FP16 model and convert its weights to INT8.
```
python -m llama_recipes.inference.quantization --model_name /path/to/llama-2-70b-hf --quantization_type int8 --output_dir /path/to/llama-2-70b-int8
```
  This works because the distribution of weights in a trained LLM is often not uniform, and many weights cluster around zero. INT8 can represent this range effectively.
- Quantization-Aware Training (QAT): This involves simulating the quantization process during training. The model learns to compensate for the precision loss. This yields better accuracy but requires retraining.

INT4 Quantization: This is where things get even more aggressive, reducing each weight to just 0.5 bytes. Llama 2 70B would then be around 35 GB. This offers the biggest memory savings but also the highest risk of accuracy loss. Several methods exist to mitigate this:

GPTQ (Generative Pre-trained Transformer Quantization): This is a very popular PTQ method. It quantizes weights layer by layer, using a second-order approximation to minimize the quantization error. It’s computationally intensive to create the GPTQ model but fast to run.

# Example using AutoGPTQ library
from auto_gptq import AutoGPTQ, BaseQuantizeConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name_or_path = "/path/to/llama-2-70b-hf"
quantize_config = BaseQuantizeConfig(
    bits=4,
    use_double_quant=True,
    quant_type="nf4", # or "fp4"
)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)

quant_model = AutoGPTQ.quantize_model(
    model,
    quantize_config,
    do_calibration=True, # Use calibration data for better accuracy
    calibration_data_file="/path/to/calibration_data.jsonl", # Path to a file with representative text samples
    use_cuda=True, # If you have CUDA enabled GPU
    disable_perchannel_quant=False,
    percdamp=2**10, # Example parameter, adjust as needed
)
quant_model.save_quantized("/path/to/llama-2-70b-gptq", use_safetensors=True)
tokenizer.save_pretrained("/path/to/llama-2-70b-gptq")

GPTQ works by minimizing the error introduced by quantizing a single weight by considering the Hessian (second-order derivative) of the loss function with respect to that weight. It effectively "looks ahead" to see how quantizing one weight impacts others.

AWQ (Activation-aware Weight Quantization): This method is also a PTQ technique but focuses on protecting "salient" weights. It observes that not all weights are equally important; weights that are multiplied by large activation values have a bigger impact. AWQ identifies these important weights and skips quantizing them (or quantizes them with higher precision) while quantizing the rest more aggressively.

# Example using AWQ library
from awq import AutoAWQ
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = '/path/to/llama-2-70b-hf'
quant_path = '/path/to/llama-2-70b-awq'
quant_config = { "wbits": 4, "groupsize": 128, "scheme": "sym", "desc_act": True } # Common AWQ config

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize the model
AutoAWQ.quantize(
    model,
    tokenizer,
    quant_config=quant_config,
    # AWQ often uses calibration data, but the library handles it internally if not specified explicitly for some ops
    # For detailed control, you might pass calibration_data=...
    save_dir=quant_path,
    calib_len=128, # Number of calibration samples
    calib_filter="cuda", # Use CUDA for calibration if available
    calib_batch_size=1,
)
# The library usually saves the model and tokenizer automatically after quantization

AWQ’s core idea is that the magnitude of activations matters. It analyzes a small sample of activations to find the top 1% of weights that are most impacted by large activations and protects them from aggressive quantization.

The trade-off is always between model size/speed and accuracy. INT4 models, especially those produced by GPTQ or AWQ, can be remarkably close to their FP16 counterparts for many tasks, making them viable for consumer hardware.

The next hurdle after getting a quantized model to load is managing the activation memory, which can still be substantial and lead to CUDA out of memory errors even with a quantized model.