Hugging Face’s Trainer is surprisingly flexible when it comes to how it batches data, often making you think it’s magic.

Let’s see this in action with a simple example. Imagine you’re training a model that needs pairs of sequences, but your dataset yields them one by one. The default collator will just stack them, which is fine for most things, but what if you need something more specific?

Here’s a dataset that yields dictionaries, each with a single input_ids list:

from datasets import Dataset
from torch.utils.data import DataLoader

data = {"input_ids": [[1, 2, 3], [4, 5], [6, 7, 8, 9]]}
dataset = Dataset.from_dict(data)

# Default collator in action
from transformers import default_data_collator
dataloader_default = DataLoader(dataset, batch_size=2, collate_fn=default_data_collator)

for batch in dataloader_default:
    print(batch)
    break

Output:

{'input_ids': tensor([[1, 2, 3],
        [4, 5]])}

Notice how default_data_collator pads the shorter sequence to match the longest in the batch. This is useful, but what if you want to pad to a fixed length, or perhaps apply a different token like eos_token? This is where custom collators shine.

To build a custom collator, you need to create a Python function that accepts a list of dictionaries (each dictionary representing a sample from your dataset) and returns a single dictionary of batched tensors. This function will live inside your Trainer setup.

Let’s create a collator that pads to a fixed length of 10, using the eos_token (token ID 2) for padding.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Or any tokenizer

class CustomCollator:
    def __init__(self, tokenizer, padding_length):
        self.tokenizer = tokenizer
        self.padding_length = padding_length

    def __call__(self, batch):
        input_ids = [item['input_ids'] for item in batch]
        # Pad to a fixed length with eos_token
        padded_input_ids = self.tokenizer.pad(
            {"input_ids": input_ids},
            padding='max_length',
            max_length=self.padding_length,
            return_tensors='pt'
        )
        # The pad method automatically uses eos_token if available and not specified otherwise
        # If you needed to force it, you'd do:
        # padded_input_ids = self.tokenizer.pad(
        #     {"input_ids": input_ids},
        #     padding='max_length',
        #     max_length=self.padding_length,
        #     pad_token=self.tokenizer.eos_token, # This might not be set for all tokenizers
        #     return_tensors='pt'
        # )
        return padded_input_ids

# Using the custom collator
custom_collator = CustomCollator(tokenizer, padding_length=10)
dataloader_custom = DataLoader(dataset, batch_size=2, collate_fn=custom_collator)

for batch in dataloader_custom:
    print(batch)
    break

Output:

{'input_ids': tensor([[ 1,  2,  3,  0,  0,  0,  0,  0,  0,  0],
        [ 4,  5,  0,  0,  0,  0,  0,  0,  0,  0]])}

Here, the input_ids are now padded to a length of 10. The tokenizer.pad method is your best friend here; it handles the logic of creating the padding tensor and combining it with your actual data. The key is that it returns a dictionary of tensors, which is exactly what the Trainer expects.

The Trainer uses the data_collator argument to process the batches returned by your DataLoader. If you don’t specify one, it defaults to default_data_collator. When you provide a custom callable (like an instance of our CustomCollator class), the Trainer will use that instead.

This pattern is incredibly powerful. You can use it to:

  • Pad to a specific maximum length: As shown, useful for models with fixed input sizes.
  • Concatenate sequences: For tasks where you need to combine multiple pieces of text into one input.
  • Create attention masks dynamically: Although tokenizer.pad often handles this for you, you might have custom masking needs.
  • Augment data on the fly: Injecting noise or other transformations directly into the batching process.
  • Handle complex data structures: If your dataset yields more than just input_ids (e.g., labels, special tokens, metadata), your collator can assemble them into the correct batched format.

The most surprising thing is how seamlessly this custom logic integrates. You don’t need to modify the Trainer class itself. You’re simply providing a different function that performs the batching step, and the Trainer treats it as just another batch provider. It’s a pluggable architecture that lets you control the very first step of your training pipeline after data loading.

The tokenizer.pad method is highly optimized and can handle padding for multiple keys (like input_ids, attention_mask, token_type_ids) simultaneously if your batch contains them. Just ensure your collator returns a dictionary where all values are PyTorch tensors.

The next hurdle you’ll likely face is handling varying sequence lengths across different tasks, where padding strategies become even more critical.

Want structured learning?

Take the full Huggingface course →