LLM quantization is not about making models smaller to save disk space; it’s about making them runnable on less powerful hardware by reducing the precision of their weights.

Let’s see what this looks like in practice. Imagine a massive language model, like Llama 2 70B, with its parameters (weights) stored as 16-bit floating-point numbers (FP16). This means each weight takes 2 bytes. For 70 billion parameters, that’s a staggering 140 GB of memory just for the weights. Trying to load this onto a single GPU, even a high-end one with 48GB or 80GB, is impossible without special techniques.

Quantization attacks this by reducing the number of bits used to represent each weight. The most common targets are 8-bit integers (INT8) and 4-bit integers (INT4).

  • INT8 Quantization: Reduces each weight to 1 byte. For Llama 2 70B, this cuts the model size down to approximately 70 GB. This is often achievable with minimal performance degradation and can be done in a few ways.
    • Post-Training Quantization (PTQ): This is the simplest. You take a fully trained FP16 model and convert its weights to INT8.
      python -m llama_recipes.inference.quantization --model_name /path/to/llama-2-70b-hf --quantization_type int8 --output_dir /path/to/llama-2-70b-int8
      
      This works because the distribution of weights in a trained LLM is often not uniform, and many weights cluster around zero. INT8 can represent this range effectively.
    • Quantization-Aware Training (QAT): This involves simulating the quantization process during training. The model learns to compensate for the precision loss. This yields better accuracy but requires retraining.
  • INT4 Quantization: This is where things get even more aggressive, reducing each weight to just 0.5 bytes. Llama 2 70B would then be around 35 GB. This offers the biggest memory savings but also the highest risk of accuracy loss. Several methods exist to mitigate this:
    • GPTQ (Generative Pre-trained Transformer Quantization): This is a very popular PTQ method. It quantizes weights layer by layer, using a second-order approximation to minimize the quantization error. It’s computationally intensive to create the GPTQ model but fast to run.
      # Example using AutoGPTQ library
      from auto_gptq import AutoGPTQ, BaseQuantizeConfig
      from transformers import AutoModelForCausalLM, AutoTokenizer
      
      model_name_or_path = "/path/to/llama-2-70b-hf"
      quantize_config = BaseQuantizeConfig(
          bits=4,
          use_double_quant=True,
          quant_type="nf4", # or "fp4"
      )
      model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
      tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
      
      quant_model = AutoGPTQ.quantize_model(
          model,
          quantize_config,
          do_calibration=True, # Use calibration data for better accuracy
          calibration_data_file="/path/to/calibration_data.jsonl", # Path to a file with representative text samples
          use_cuda=True, # If you have CUDA enabled GPU
          disable_perchannel_quant=False,
          percdamp=2**10, # Example parameter, adjust as needed
      )
      quant_model.save_quantized("/path/to/llama-2-70b-gptq", use_safetensors=True)
      tokenizer.save_pretrained("/path/to/llama-2-70b-gptq")
      
      GPTQ works by minimizing the error introduced by quantizing a single weight by considering the Hessian (second-order derivative) of the loss function with respect to that weight. It effectively "looks ahead" to see how quantizing one weight impacts others.
    • AWQ (Activation-aware Weight Quantization): This method is also a PTQ technique but focuses on protecting "salient" weights. It observes that not all weights are equally important; weights that are multiplied by large activation values have a bigger impact. AWQ identifies these important weights and skips quantizing them (or quantizes them with higher precision) while quantizing the rest more aggressively.
      # Example using AWQ library
      from awq import AutoAWQ
      from transformers import AutoModelForCausalLM, AutoTokenizer
      
      model_path = '/path/to/llama-2-70b-hf'
      quant_path = '/path/to/llama-2-70b-awq'
      quant_config = { "wbits": 4, "groupsize": 128, "scheme": "sym", "desc_act": True } # Common AWQ config
      
      # Load the model
      model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
      tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
      
      # Quantize the model
      AutoAWQ.quantize(
          model,
          tokenizer,
          quant_config=quant_config,
          # AWQ often uses calibration data, but the library handles it internally if not specified explicitly for some ops
          # For detailed control, you might pass calibration_data=...
          save_dir=quant_path,
          calib_len=128, # Number of calibration samples
          calib_filter="cuda", # Use CUDA for calibration if available
          calib_batch_size=1,
      )
      # The library usually saves the model and tokenizer automatically after quantization
      
      AWQ’s core idea is that the magnitude of activations matters. It analyzes a small sample of activations to find the top 1% of weights that are most impacted by large activations and protects them from aggressive quantization.

The trade-off is always between model size/speed and accuracy. INT4 models, especially those produced by GPTQ or AWQ, can be remarkably close to their FP16 counterparts for many tasks, making them viable for consumer hardware.

The next hurdle after getting a quantized model to load is managing the activation memory, which can still be substantial and lead to CUDA out of memory errors even with a quantized model.

Want structured learning?

Take the full Llm course →