The most surprising thing about quantizing LLMs is that you can often make them run dramatically faster and smaller by forcing their weights into just 4 or 8 bits, and still get almost the same answers.
Let’s see this in action. Imagine we have a pre-trained model, llama-2-7b-chat-hf. We’ll use the transformers library and bitsandbytes to load it directly into INT8.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
# You'll need to log in to Hugging Face CLI: `huggingface-cli login`
# and have access to Llama 2.
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model in INT8
# The `load_in_8bit=True` flag is the magic here.
# `device_map="auto"` handles distributing the model across available GPUs.
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
# Now, let's generate some text. The model will use its INT8 weights internally.
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Ensure tensors are on GPU
with torch.no_grad():
outputs = model_8bit.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
When you run this, you’ll notice the model_8bit object reports its layers are Int8QaTiedParameters or similar. The memory footprint of model_8bit will be roughly 1/4th of the full FP16/BF16 model.
INT4 quantization is even more aggressive. You can load an LLM into INT4 using load_in_4bit=True and specifying a bnb_4bit_compute_dtype for computation (usually torch.bfloat16). This often halves the memory usage again compared to INT8.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
# Configure 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for computations
bnb_4bit_quant_type="nf4", # nf4 is a common and effective type
bnb_4bit_use_double_quant=True # Enables double quantization for further savings
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load model in INT4
model_4bit = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
prompt = "Tell me a short story about a brave knight."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model_4bit.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The core problem quantization solves is the immense memory and computational cost of large language models. A 70B parameter model in FP16 needs over 140GB of VRAM just to load. Quantizing to INT8 reduces this to ~70GB, and INT4 to ~35GB. This makes it feasible to run these models on consumer hardware or significantly increase batch sizes on server GPUs.
Internally, techniques like bitsandbytes use quantization-aware training or post-training quantization methods. For inference, they often employ zero-point quantization. This involves mapping a range of floating-point values to a smaller integer range. A scale factor and a zero-point offset are learned (or derived) to dequantize the weights back to a higher precision (like FP16 or BF16) for the actual matrix multiplications, and then quantize them again. The "nf4" (normal float 4) quantization type is particularly effective because it’s designed to represent the distribution of weights in neural networks more accurately than simple linear quantization. The bnb_4bit_use_double_quant=True further optimizes this by quantizing the quantization constants themselves, saving a bit more memory.
The magic is that for inference, the dynamic range of the weights is what matters most. Even if individual weights are coarse approximations, the overall distribution and relationships between them are preserved well enough for the model to function. The computation itself often happens in a higher precision (like BF16) after a quick dequantization step, preventing catastrophic accuracy loss.
A common misconception is that INT4/INT8 quantization always involves a significant accuracy drop. While some decline is inevitable, especially for very sensitive tasks or models, the advancements in quantization algorithms (like nf4) and the use of compute dtypes like bfloat16 during inference have made the accuracy loss on many benchmarks negligible, often less than 1% on metrics like perplexity or downstream task performance.
The next frontier you’ll likely encounter is optimizing the speed of quantized inference further, perhaps by exploring custom kernels or different quantization schemes for specific layers.