The surprising truth about fine-tuning large language models is that you’re not teaching it a new language, but rather how to perform a specific task within a language it already understands.
Imagine you have a powerful, general-purpose language model like BERT or RoBERTa. It knows grammar, vocabulary, and a vast amount of world knowledge. Fine-tuning is like giving it a specialized instruction manual for a particular job, in this case, question answering. We’re not retraining its core language understanding; we’re showing it how to map its existing knowledge to the format of answering questions based on a given context.
Let’s see this in action. We’ll use the datasets library to load a question answering dataset, like SQuAD (Stanford Question Answering Dataset).
from datasets import load_dataset
squad_dataset = load_dataset("squad", split="train")
print(squad_dataset[0])
This outputs something like:
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a bronze statue of Christ the Teacher. In addition, known as the "Mother of all University Chapels", the Sacred Heart Church is located on the south side of the quad. Immediately to the north of the Main Building is the LaFortune Student Center, and to the south of the Main Building is the Hesburgh Library. The school also has an outdoor amphitheater, the Washington Hall, which is located behind the Main Building and to the west of the Hesburgh Library.', 'question': 'To whom did the Virgin Mary statue in the Main Building\'s gold dome allude?', 'answers': {'text': ["the Virgin Mary"], 'answer_start': [281]}}
Here, you have a context (a passage of text), a question about that context, and the answers (including the precise start index of the answer within the context).
To fine-tune, we need a model capable of extractive question answering. Hugging Face’s transformers library provides models pre-trained for this. We’ll load a BERT model fine-tuned on question answering.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
The core of fine-tuning involves preparing your data so the model can understand it and then training it. For question answering, this means tokenizing the context and question, and crucially, identifying the start and end tokens of the answer within the tokenized context. The model will learn to predict a probability distribution over all possible token spans in the context, with the highest probabilities indicating the start and end of the answer.
Our goal is to train the model to output two sets of logits: one for the start of the answer span and one for the end. The loss function (typically Cross-Entropy) will then compare these predicted logits to the actual start and end token positions of the correct answer in the training data.
The Trainer API in transformers simplifies this. You’ll define a data collator to handle batching and padding, and then configure the TrainingArguments.
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
# (Assuming you've preprocessed your dataset to align with model input format)
# For example, you'd tokenize contexts and questions, and map answer spans to token indices.
# Placeholder for actual training setup
# training_args = TrainingArguments(...)
# trainer = Trainer(
# model=model,
# args=training_args,
# train_dataset=tokenized_train_dataset,
# eval_dataset=tokenized_eval_dataset,
# data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
# )
# trainer.train()
The model learns by adjusting its internal weights. It’s not learning new words, but rather learning to recognize patterns that signal the boundaries of an answer. For instance, it might learn that phrases following "The answer is" or specific grammatical structures often contain the answer.
The magic happens in how the model predicts the answer span. It outputs two vectors of scores (logits), one for each token in the input sequence. The first vector represents the probability that each token is the start of the answer, and the second vector represents the probability that each token is the end of the answer. We then find the token pair (i, j) where i <= j that maximizes start_logits[i] + end_logits[j]. This pair corresponds to the predicted answer span.
A common pitfall is not correctly mapping the character-based answer start indices from datasets like SQuAD to token-based start indices after tokenization. If your tokenization splits a word containing the answer’s start character, the offset will be wrong. You need to carefully align character offsets to token offsets, especially with subword tokenizers.
After fine-tuning, you can use the model to answer new questions. You provide the context and question, tokenize them, and pass them to the model. The model outputs the start and end logits, which you then process to extract the text span with the highest probability score.
The next step is often exploring different sampling strategies for the answer span to improve robustness or investigating how to use a smaller model effectively for question answering.