LLM quantization is not about making models smaller to save disk space; it’s about making them runnable on less powerful hardware by reducing the precision of their weights.
Let’s see what this looks like in practice. Imagine a massive language model, like Llama 2 70B, with its parameters (weights) stored as 16-bit floating-point numbers (FP16). This means each weight takes 2 bytes. For 70 billion parameters, that’s a staggering 140 GB of memory just for the weights. Trying to load this onto a single GPU, even a high-end one with 48GB or 80GB, is impossible without special techniques.
Quantization attacks this by reducing the number of bits used to represent each weight. The most common targets are 8-bit integers (INT8) and 4-bit integers (INT4).
- INT8 Quantization: Reduces each weight to 1 byte. For Llama 2 70B, this cuts the model size down to approximately 70 GB. This is often achievable with minimal performance degradation and can be done in a few ways.
- Post-Training Quantization (PTQ): This is the simplest. You take a fully trained FP16 model and convert its weights to INT8.
This works because the distribution of weights in a trained LLM is often not uniform, and many weights cluster around zero. INT8 can represent this range effectively.python -m llama_recipes.inference.quantization --model_name /path/to/llama-2-70b-hf --quantization_type int8 --output_dir /path/to/llama-2-70b-int8 - Quantization-Aware Training (QAT): This involves simulating the quantization process during training. The model learns to compensate for the precision loss. This yields better accuracy but requires retraining.
- Post-Training Quantization (PTQ): This is the simplest. You take a fully trained FP16 model and convert its weights to INT8.
- INT4 Quantization: This is where things get even more aggressive, reducing each weight to just 0.5 bytes. Llama 2 70B would then be around 35 GB. This offers the biggest memory savings but also the highest risk of accuracy loss. Several methods exist to mitigate this:
- GPTQ (Generative Pre-trained Transformer Quantization): This is a very popular PTQ method. It quantizes weights layer by layer, using a second-order approximation to minimize the quantization error. It’s computationally intensive to create the GPTQ model but fast to run.
GPTQ works by minimizing the error introduced by quantizing a single weight by considering the Hessian (second-order derivative) of the loss function with respect to that weight. It effectively "looks ahead" to see how quantizing one weight impacts others.# Example using AutoGPTQ library from auto_gptq import AutoGPTQ, BaseQuantizeConfig from transformers import AutoModelForCausalLM, AutoTokenizer model_name_or_path = "/path/to/llama-2-70b-hf" quantize_config = BaseQuantizeConfig( bits=4, use_double_quant=True, quant_type="nf4", # or "fp4" ) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True) quant_model = AutoGPTQ.quantize_model( model, quantize_config, do_calibration=True, # Use calibration data for better accuracy calibration_data_file="/path/to/calibration_data.jsonl", # Path to a file with representative text samples use_cuda=True, # If you have CUDA enabled GPU disable_perchannel_quant=False, percdamp=2**10, # Example parameter, adjust as needed ) quant_model.save_quantized("/path/to/llama-2-70b-gptq", use_safetensors=True) tokenizer.save_pretrained("/path/to/llama-2-70b-gptq") - AWQ (Activation-aware Weight Quantization): This method is also a PTQ technique but focuses on protecting "salient" weights. It observes that not all weights are equally important; weights that are multiplied by large activation values have a bigger impact. AWQ identifies these important weights and skips quantizing them (or quantizes them with higher precision) while quantizing the rest more aggressively.
AWQ’s core idea is that the magnitude of activations matters. It analyzes a small sample of activations to find the top 1% of weights that are most impacted by large activations and protects them from aggressive quantization.# Example using AWQ library from awq import AutoAWQ from transformers import AutoModelForCausalLM, AutoTokenizer model_path = '/path/to/llama-2-70b-hf' quant_path = '/path/to/llama-2-70b-awq' quant_config = { "wbits": 4, "groupsize": 128, "scheme": "sym", "desc_act": True } # Common AWQ config # Load the model model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) # Quantize the model AutoAWQ.quantize( model, tokenizer, quant_config=quant_config, # AWQ often uses calibration data, but the library handles it internally if not specified explicitly for some ops # For detailed control, you might pass calibration_data=... save_dir=quant_path, calib_len=128, # Number of calibration samples calib_filter="cuda", # Use CUDA for calibration if available calib_batch_size=1, ) # The library usually saves the model and tokenizer automatically after quantization
- GPTQ (Generative Pre-trained Transformer Quantization): This is a very popular PTQ method. It quantizes weights layer by layer, using a second-order approximation to minimize the quantization error. It’s computationally intensive to create the GPTQ model but fast to run.
The trade-off is always between model size/speed and accuracy. INT4 models, especially those produced by GPTQ or AWQ, can be remarkably close to their FP16 counterparts for many tasks, making them viable for consumer hardware.
The next hurdle after getting a quantized model to load is managing the activation memory, which can still be substantial and lead to CUDA out of memory errors even with a quantized model.