QLoRA lets you fine-tune massive language models on consumer-grade GPUs by cleverly packing model weights into 4-bit integers.

Let’s see QLoRA in action. Imagine we have a base Llama 2 7B model and we want to fine-tune it on a dataset of customer support tickets to make it better at answering common questions.

First, we need to install the necessary libraries.

pip install transformers datasets peft bitsandbytes accelerate trl

Now, let’s load the base model, but crucially, we’ll load it with 4-bit quantization enabled.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "meta-llama/Llama-2-7b-hf" # Or your chosen model

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 # Or torch.float16
)

# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto" # Automatically distributes the model across available GPUs
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Set padding token

The BitsAndBytesConfig is where the magic starts. load_in_4bit=True tells bitsandbytes to load the weights in 4-bit. bnb_4bit_quant_type="nf4" specifies the NormalFloat 4-bit quantization type, which is generally better than standard 4-bit for neural network weights. bnb_4bit_use_double_quant=True applies a secondary quantization step to the quantization constants, saving a bit more memory. bnb_4bit_compute_dtype=torch.bfloat16 ensures that computations are done in a higher precision format (bfloat16 or float16) to maintain accuracy, even though the weights are stored in 4-bit. device_map="auto" is essential for distributing the quantized model across your GPUs if you have multiple, or placing it on the single available GPU.

Next, we set up the Parameter-Efficient Fine-Tuning (PEFT) configuration, specifically for QLoRA.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of the update matrices
    lora_alpha=32, # Alpha scaling factor
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # Specify which modules to apply LoRA to. Common choices for Llama are 'q_proj', 'k_proj', 'v_proj', 'o_proj'
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# Apply LoRA to the quantized model
peft_model = get_peft_model(model, lora_config)

# Print trainable parameters to see the reduction
peft_model.print_trainable_parameters()

prepare_model_for_kbit_training(model) is a helper that ensures the quantized model is ready for gradient updates, handling things like gradient checkpointing and enabling gradients for LoRA adapters. The LoraConfig defines the LoRA parameters: r (rank) controls the size of the adapter matrices, lora_alpha is a scaling factor, and lora_dropout adds regularization. The crucial part is target_modules, which tells PEFT which layers of the original model should have LoRA adapters added. For transformer models, this typically includes the attention projection matrices (q_proj, k_proj, v_proj, o_proj) and sometimes feed-forward network layers (gate_proj, up_proj, down_proj). peft_model.print_trainable_parameters() will show you that only a tiny fraction of the total parameters are trainable, typically less than 1%.

Now, you’d prepare your dataset and use a trainer (like trl’s SFTTrainer or Hugging Face’s Trainer) to run the fine-tuning.

from trl import SFTTrainer
from datasets import load_dataset

# Load your dataset
dataset = load_dataset("your_dataset_name", split="train") # Replace with your actual dataset

# Define training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    optim="paged_adamw_8bit", # Use a memory-efficient optimizer
    learning_rate=2e-4,
    fp16=False, # Set to False if using bf16
    bf16=True, # Enable bfloat16 training
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    logging_steps=50,
    save_steps=500,
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    peft_config=lora_config,
    dataset_text_field="text", # The column in your dataset containing the text
    max_seq_length=1024, # Adjust based on your dataset and GPU memory
    tokenizer=tokenizer,
    args=training_args,
    packing=False, # Set to True for more efficient packing if your dataset allows
)

# Start training
trainer.train()

The SFTTrainer (Supervised Fine-Tuning Trainer) from trl simplifies this. It takes the PEFT-wrapped model, your dataset, and training arguments. optim="paged_adamw_8bit" is a memory-optimized optimizer from bitsandbytes. bf16=True enables training in bfloat16, which is often preferred for stability and performance on modern GPUs if your hardware supports it.

The core idea of QLoRA is that instead of fine-tuning all billions of parameters of a large model, you freeze the original 4-bit quantized weights and train only small, low-rank adapter matrices. These adapters are then added to the original weights during inference. This drastically reduces memory requirements because you only need to store and update the adapter weights, not the entire model. The 4-bit quantization itself reduces the memory footprint of the base model by 4x compared to FP16 or BF16, and the LoRA adapters are tiny (e.g., a few megabytes) compared to the gigabytes of the base model.

The most surprising thing about QLoRA is how little memory is required to fine-tune models that would otherwise need multiple A100s. You can often fine-tune a 65B parameter model on a single 24GB consumer GPU. This is achieved by combining 4-bit NormalFloat quantization, double quantization, and paged optimizers with the low-rank adaptation technique.

The entire process boils down to:

  1. Quantize: Load the base LLM into 4-bit precision.
  2. Adapt: Add small, trainable LoRA adapter layers to specific parts of the quantized model.
  3. Train: Fine-tune only these adapter layers, keeping the quantized base model frozen.
  4. Merge (Optional): For inference, you can merge the trained adapter weights back into the base model to get a single, fine-tuned model file, though often you can also load the base model and apply the adapter weights dynamically.

The real power comes from the combination of these techniques. The 4-bit quantization drastically reduces the base model’s memory footprint. LoRA drastically reduces the number of trainable parameters. Together, they make fine-tuning accessible.

When you eventually want to use your fine-tuned model for inference, you’ll typically load the base 4-bit model and then load your trained PEFT adapters on top.

from peft import PeftModel

# Load the base model again (or ensure it's loaded)
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Load the PEFT model on top of the base model
ft_model = PeftModel.from_pretrained(base_model, "./results/adapter_model.bin") # Path to your saved adapter weights

# Now ft_model is ready for inference

This allows you to apply the fine-tuned knowledge without needing to store a full copy of the fine-tuned model. The next step after successfully fine-tuning is often optimizing this merged model for faster inference, which might involve techniques like quantization-aware fine-tuning or model compilation.

Want structured learning?

Take the full Huggingface course →