The Hugging Face tokenizers library is a high-performance Rust-based tokenizer written in Python, designed to be fast and flexible for modern NLP tasks.

Let’s see it in action. Imagine we have a small corpus of text and we want to train a custom WordPiece tokenizer on it.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
from tokenizers.processors import TemplateProcessing

# Initialize a new tokenizer with WordPiece model
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

# Customize pre-tokenization and decoding
# Whitespace splitting is a common pre-tokenizer
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# The decoder handles converting token IDs back to strings
tokenizer.decoder = decoders.WordPiece()

# Define a trainer for the WordPiece model
# We'll set a vocabulary size and special tokens
trainer = trainers.WordPieceTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Sample corpus
files = ["my_corpus.txt"]
with open(files[0], "w", encoding="utf-8") as f:
    f.write("This is the first sentence.\n")
    f.write("This is another sentence for training.\n")
    f.write("Custom tokenizers are powerful.\n")
    f.write("Hugging Face makes NLP easier.\n")

# Train the tokenizer
tokenizer.train(files, trainer=trainer)

# Now we can use the tokenizer
print("Vocabulary size:", tokenizer.get_vocab_size())
print("Vocabulary:", tokenizer.get_vocab())

encoded = tokenizer.encode("This is a custom tokenizer example.")
print("Encoded:", encoded.tokens)
print("Encoded IDs:", encoded.ids)

# Decode the tokens back to text
decoded = tokenizer.decode(encoded.ids)
print("Decoded:", decoded)

# Add a post-processor for BERT-style input
# This adds [CLS] at the beginning and [SEP] at the end of a sequence
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

encoded_with_postprocessing = tokenizer.encode("This is a sentence.")
print("\nEncoded with post-processing:", encoded_with_postprocessing.tokens)
print("Encoded IDs with post-processing:", encoded_with_postprocessing.ids)

This code initializes a WordPiece tokenizer, sets up whitespace pre-tokenization and WordPiece decoding, defines a trainer with a specific vocabulary size and special tokens, and then trains the tokenizer on a small text file. Finally, it demonstrates encoding and decoding text, and adds a BERT-style post-processor.

The core problem this solves is efficiently creating subword tokenizations tailored to your specific dataset. Instead of relying on pre-trained tokenizers that might not understand the nuances of your domain (e.g., medical texts, code, or highly specialized jargon), you can train a tokenizer that creates meaningful subwords from your data, leading to better model performance. The tokenizers library achieves this by implementing algorithms like WordPiece, BPE, and Unigram in Rust, making the training and inference phases significantly faster than pure Python implementations.

Internally, the process involves several stages. First, pre-tokenization breaks down raw text into initial words or units. Common strategies include splitting by whitespace or using regular expressions. Next, the model (like WordPiece or BPE) learns a vocabulary of subword units by iteratively merging frequent character sequences or splitting words based on probabilistic models. This learned vocabulary is then used during tokenization to convert text into sequences of token IDs. Finally, decoding reverses this process, and post-processing can add special tokens or structure sequences for specific model architectures.

The trainer object is where you control the learning process of the subword vocabulary. Key parameters include vocab_size, which dictates the maximum number of unique tokens in your final vocabulary, and special_tokens, which are reserved tokens with specific meanings (like unknown words, start/end of sequence, padding, etc.). The trainer iterates through your corpus, counting token frequencies and applying the chosen subword algorithm (e.g., WordPiece’s greedy longest-match-first approach) to build the vocabulary.

The pre_tokenizer is crucial because it defines how the raw text is initially segmented before the subword model even sees it. For instance, Whitespace is simple but might not handle punctuation well. ByteLevel pre-tokenization, which treats every byte as a potential token, is often used with BPE to ensure that any string can be tokenized, even if it contains characters not seen during training. The choice of pre-tokenizer directly impacts what the subword model has to work with.

When you train a WordPiece tokenizer, it doesn’t just randomly pick subwords. It starts with individual characters and then iteratively merges the most frequent pairs of tokens that appear consecutively. The process is designed to prioritize creating meaningful subwords that cover the most common word parts in your training data, minimizing the number of unknown tokens ([UNK]) for your specific corpus.

A subtle but powerful aspect is how the unk_token is handled. When the tokenizer encounters a word or subword that isn’t in its vocabulary, it replaces it with the unk_token. However, the WordPiece algorithm itself is designed to create a vocabulary that minimizes the need for this fallback. It learns a set of subwords such that most words can be represented as a sequence of these known subwords, rather than being entirely unknown. This is why training on your specific data is so impactful; it generates subwords relevant to your domain, reducing the reliance on the generic [UNK] token.

The next logical step after training a custom tokenizer is integrating it with a pre-trained model from Hugging Face’s transformers library.

Want structured learning?

Take the full Huggingface course →