PEFT and LoRA allow you to fine-tune massive language models on consumer-grade hardware by only training a tiny fraction of the model’s parameters.

Here’s a GPT-2 small (124M parameters) fine-tuned on a few sentences of Shakespeare. Notice how it’s not perfect, but it’s starting to pick up the style.

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load a small model and tokenizer
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

# Sample data (replace with your actual dataset)
texts = [
    "To be, or not to be, that is the question:",
    "Whether 'tis nobler in the mind to suffer",
    "The slings and arrows of outrageous fortune,",
    "Or to take arms against a sea of troubles,",
    "And by opposing end them: to die: to sleep;",
    "No more; and by a sleep to say we end",
    "The heart-ache and the thousand natural shocks",
    "That flesh is heir to, 'tis a consummation",
    "Devoutly to be wish'd. To die, to sleep;",
    "To sleep: perchance to dream: ay, there's the rub;",
]
# Tokenize the data
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)

# LoRA Configuration
lora_config = LoraConfig(
    r=8,  # Rank of the update matrices
    lora_alpha=16,  # Alpha scaling factor
    target_modules=["q_attn", "v_attn"],  # Modules to apply LoRA to (GPT-2 specific)
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Wrap the model with PEFT
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

# Training Arguments
training_args = TrainingArguments(
    output_dir="./peft_lora_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=50,
    fp16=True, # Use mixed precision if available
    report_to="none", # Disable reporting for simplicity
)

# Trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=torch.utils.data.TensorDataset(inputs["input_ids"], inputs["attention_mask"]),
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the LoRA adapters
peft_model.save_pretrained("./lora_adapters")

# --- Inference Example ---
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(model_name)
# Load the LoRA adapters
from peft import PeftModel
inference_model = PeftModel.from_pretrained(base_model, "./lora_adapters")

# Merge adapters (optional, can improve inference speed)
# inference_model = inference_model.merge_and_unload()

# Generate text
prompt = "To be, or not to be"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Move model and input to GPU if available
if torch.cuda.is_available():
    inference_model = inference_model.to("cuda")
    input_ids = input_ids.to("cuda")

with torch.no_grad():
    generation_output = inference_model.generate(
        input_ids=input_ids,
        max_length=50,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

print("Generated Text:")
print(tokenizer.decode(generation_output[0], skip_special_tokens=True))

The core problem PEFT and LoRA solve is the astronomical cost of fine-tuning full-sized LLMs. Imagine a 70-billion parameter model. Training this beast from scratch requires hundreds of high-end GPUs for weeks. Even fine-tuning it on a new task typically means updating all 70 billion parameters, which is still prohibitively expensive for most. PEFT, or Parameter-Efficient Fine-Tuning, is a family of techniques designed to drastically reduce this cost. LoRA, Low-Rank Adaptation, is a specific, highly effective PEFT method.

LoRA works by freezing the original weights of the pre-trained LLM and injecting small, trainable "adapter" matrices into specific layers. Crucially, these adapter matrices are designed to be low-rank. If a standard weight matrix is $W$ (dimensions $d \times k$), LoRA decomposes the update $\Delta W$ into two smaller matrices, $A$ (dimensions $d \times r$) and $B$ (dimensions $r \times k$), where $r$ (the rank) is much smaller than $d$ and $k$. The update is then applied as $W’ = W + BA$. Since $r$ is small, the number of trainable parameters ($d \times r + r \times k$) is vastly less than the original $d \times k$. For example, if $d=k=4096$ and $r=8$, the original matrix has $4096^2 \approx 16.7$ million parameters. The LoRA update $BA$ has $4096 \times 8 + 8 \times 4096 \approx 65,536$ parameters. This is a reduction of over 99%.

In Hugging Face’s peft library, applying LoRA is straightforward. You first define a LoraConfig object, specifying parameters like r (the rank), lora_alpha (a scaling factor for the update, often set to $2r$), target_modules (which layers to inject adapters into – for GPT-2, common choices are q_attn and v_attn for the query and value projection layers in self-attention), lora_dropout, bias (whether to train bias terms), and task_type. Then, you wrap your pre-trained model using get_peft_model(model, lora_config). This function modifies the model in-place, replacing specified layers with LoRA-adapted versions and freezing the original weights. The print_trainable_parameters() method will then show you that only a tiny percentage of parameters are trainable.

When training, you use the standard Hugging Face Trainer as usual. The peft_model is passed to the Trainer, and only the LoRA adapter weights are updated during backpropagation. After training, you save these adapters using save_pretrained(). To use the fine-tuned model, you load the original base model and then load the saved adapters on top using PeftModel.from_pretrained(base_model, adapter_path). This dynamically merges the learned updates with the base model weights. For inference, you can optionally call merge_and_unload() on the PeftModel to permanently fuse the adapter weights into the base model, which can sometimes lead to faster inference as it removes the overhead of the adapter computations.

The most surprising thing about LoRA is that despite only training a minuscule fraction of parameters, it often achieves performance comparable to full fine-tuning, especially for tasks that are not drastically different from the pre-training objective. This is because the pre-trained LLM already possesses immense general knowledge, and fine-tuning primarily involves adapting this knowledge to a specific style or domain. The low-rank adapters are surprisingly adept at capturing these necessary adjustments without needing to modify the entire model.

The target_modules parameter is critical. For different model architectures (like Llama, Mistral, BERT), the names of the linear layers within the attention mechanisms or feed-forward networks will differ. You need to inspect the model’s architecture (e.g., by printing model) to identify the correct layer names that correspond to query, key, value, output projections, or feed-forward layers where you want to inject LoRA adapters. Commonly targeted modules are q_proj, k_proj, v_proj, o_proj (for attention) and gate_proj, up_proj, down_proj (for feed-forward networks) in more recent architectures.

The next step after mastering LoRA is exploring other PEFT methods like QLoRA, which combines LoRA with 4-bit quantization to further reduce memory requirements, or Prompt Tuning and P-Tuning, which involve training only a small set of "soft prompt" embeddings prepended to the input.

Want structured learning?

Take the full Huggingface course →