Hugging Face’s Optimum library can accelerate CPU inference for transformers by converting models to ONNX format, enabling them to run on the ONNX Runtime.

Let’s see this in action. Imagine you have a simple text classification pipeline.

from transformers import pipeline

# Load a pre-trained model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

# Run inference
results = classifier("This is a great movie!")
print(results)

This works, but for high-throughput scenarios, especially on CPUs, we can do better. Enter Optimum and ONNX.

Optimum acts as a bridge, allowing us to optimize Hugging Face models for various hardware backends, including ONNX Runtime. ONNX (Open Neural Network Exchange) is an open format built to represent machine learning models, and ONNX Runtime is a high-performance inference engine that can execute ONNX models efficiently on different hardware.

The core idea is to convert your PyTorch or TensorFlow model into an ONNX graph. This conversion process can involve optimizations like layer fusion and quantization. ONNX Runtime then leverages its optimized kernels and execution strategies to run this graph much faster than the original framework, especially on CPUs where it can exploit SIMD instructions and multi-threading more effectively.

Here’s how you’d set it up. First, install the necessary libraries:

pip install optimum[onnxruntime] transformers onnx

Now, let’s convert our distilbert model.

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Load the original model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Export the model to ONNX format
# This command saves the ONNX model and its configuration to a directory
ORTModelForSequenceClassification.from_pretrained(model_name, export=True)

# Save the tokenizer
tokenizer.save_pretrained("onnx_model_directory")

The export=True argument triggers the conversion. Optimum handles the details of serializing the model into the ONNX format and saving it alongside necessary configuration files.

Once converted, you can load and use the ONNX model with ONNX Runtime via Optimum.

from transformers import pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Load the ONNX model and tokenizer
model_path = "distilbert-base-uncased-finetuned-sst-2-english" # Directory where the ONNX model was saved
tokenizer = AutoTokenizer.from_pretrained(model_path)
ort_model = ORTModelForSequenceClassification.from_pretrained(model_path)

# Create a pipeline using the ONNX model
onnx_classifier = pipeline("text-classification", model=ort_model, tokenizer=tokenizer)

# Run inference with the ONNX model
results_onnx = onnx_classifier("This is a great movie!")
print(results_onnx)

You’ll notice that the inference speed is significantly improved, especially when you run many predictions. The ONNX Runtime’s execution engine is built for speed, often outperforming native PyTorch or TensorFlow inference on CPUs by using highly optimized kernels and parallel execution strategies.

The key levers you control are the from_pretrained arguments and the export flag during conversion. For more advanced scenarios, Optimum supports various ONNX Runtime execution providers (like CUDA for GPUs, or different CPU optimizations), quantization (reducing model precision to speed up computation and reduce memory), and other graph optimizations. You can also manually specify opset_imports to control the ONNX operator set version used during export, which can be critical for compatibility with specific ONNX Runtime versions or hardware.

The performance gains come from ONNX Runtime’s ability to fuse operations (e.g., combining a convolution and an activation function into a single, optimized kernel) and its efficient memory management, which reduces overhead compared to the more general-purpose execution of frameworks like PyTorch.

When exporting, Optimum by default uses the latest stable ONNX opset version compatible with the model architecture. If you encounter compatibility issues with specific ONNX Runtime versions or target hardware, explicitly specifying the opset_version during the export process is often the solution. For example, ORTModelForSequenceClassification.from_pretrained(model_name, export=True, opset_version=13).

The next step in optimizing CPU inference might involve exploring quantization techniques to further reduce model size and latency.

Want structured learning?

Take the full Huggingface course →