Vision Transformers (ViTs) can learn to classify images with surprising effectiveness, even when trained on datasets much smaller than those typically used for deep learning image models.

Let’s see how this plays out with a practical example. Imagine we have a small dataset of cat and dog images. We want to fine-tune a pre-trained ViT, like google/vit-base-patch16-224, to distinguish between them.

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
from datasets import load_dataset

# Load a pre-trained ViT model and its feature extractor
model_name = "google/vit-base-patch16-224"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
model = ViTForImageClassification.from_pretrained(model_name)

# Load a small dataset (e.g., a subset of 'cats_vs_dogs')
dataset = load_dataset("cats_vs_dogs", split="train[:100]") # Using first 100 images for demonstration

# Preprocess the dataset
def preprocess_data(examples):
    images = [Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in examples['image_url']]
    pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
    labels = examples['label']
    return {"pixel_values": pixel_values, "labels": labels}

processed_dataset = dataset.map(preprocess_data, batched=True, remove_columns=["image_url", "image_id"])
processed_dataset.set_format("torch")

# Now, `processed_dataset` contains tensors ready for training.
# You would typically split this into train/validation sets and then
# use a Trainer object from Hugging Face to handle the fine-tuning loop.
# For brevity, we'll skip the full training loop here.

The core idea behind ViT’s effectiveness on smaller datasets is transfer learning. A ViT trained on a massive dataset like ImageNet has already learned powerful, general-purpose visual features. When you fine-tune it, you’re not teaching it to see from scratch; you’re adapting its existing knowledge to your specific task. The model adjusts its weights, particularly in the later layers, to focus on the discriminative features relevant to your cat vs. dog classification.

The ViTForImageClassification class in Hugging Face conveniently adds a classification head on top of the pre-trained ViT backbone. During fine-tuning, this head is trained from scratch, while the backbone’s weights are updated via backpropagation.

The ViTFeatureExtractor is crucial. It handles all the necessary preprocessing steps: resizing images to the expected input dimensions (e.g., 224x224), normalizing pixel values, and converting them into the tensor format the model expects. It ensures that your input data aligns perfectly with what the pre-trained model was originally trained on.

The key levers you control in this process are:

  • Model Choice: Selecting a pre-trained ViT variant (base, large, etc.) and its specific checkpoint (e.g., google/vit-base-patch16-224). Larger models might offer better performance but require more resources.
  • Dataset Size and Quality: Even with transfer learning, more high-quality, relevant data generally leads to better fine-tuning results.
  • Hyperparameters: Learning rate, batch size, number of epochs, and optimizer choice significantly impact how well the model adapts. A smaller learning rate is often preferred for fine-tuning to avoid catastrophic forgetting of the pre-trained features.
  • Classification Head: You can customize the classification head if needed, though ViTForImageClassification provides a sensible default.

A subtle but important aspect of ViT fine-tuning is the learning rate. Because the model has already learned so much, you typically want to use a much smaller learning rate for fine-tuning than you would for training from scratch. This prevents the model from drastically altering its well-learned weights too quickly, which could lead to it "forgetting" its general visual understanding. A common starting point for fine-tuning is a learning rate around 1e-5 or 5e-5, whereas training from scratch might start at 1e-3.

Once you’ve fine-tuned your ViT, the next logical step is to evaluate its performance on a held-out test set using metrics like accuracy, precision, recall, and F1-score.

Want structured learning?

Take the full Huggingface course →