The TEI server can embed more text per second than you’d think, but not if you’re doing it wrong.

Let’s see TEI in action. Imagine we have a massive dataset of product reviews, and we want to find similar reviews or categorize them. Instead of throwing a huge, general-purpose model at it, we can use TEI to serve a highly optimized, smaller model for this specific task.

First, we need a model. Let’s say we’ve fine-tuned a sentence-transformers/all-MiniLM-L6-v2 model for our review data. You can get this from Hugging Face.

# Download the model (example, actual path might differ)
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Now, we’ll set up the TEI server. We’ll use Docker for this.

docker run --gpus all -p 8080:80 \
    -v $(pwd)/all-MiniLM-L6-v2:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /data \
    --port 80 \
    --max-input-length 512 \
    --max-total-tokens 1024 \
    --num-shard 1
  • --gpus all: This is crucial. TEI is designed to leverage GPUs for maximum throughput.
  • -p 8080:80: We’re mapping the container’s port 80 to our host’s port 8080.
  • -v $(pwd)/all-MiniLM-L6-v2:/data: This mounts our local model directory into the container at /data.
  • ghcr.io/huggingface/text-generation-inference:latest: This is the TEI Docker image.
  • --model-id /data: Tells TEI to load the model from the mounted directory.
  • --port 80: The port TEI listens on inside the container.
  • --max-input-length 512: The maximum number of tokens for the input prompt.
  • --max-total-tokens 1024: The maximum number of tokens the model can generate (though for embeddings, we’re not really generating, but this still matters for internal processing).
  • --num-shard 1: For smaller models or single-GPU setups, we use one shard. For larger models and multiple GPUs, you’d increase this.

Once the server is up (it might take a minute to download and load the model), you can send requests. For embeddings, you’ll use the /generate endpoint, but with specific parameters.

Here’s a Python example using requests:

import requests
import json

url = "http://127.0.0.1:8080/generate"

# Our sample product reviews
reviews = [
    "This laptop is amazing! The screen is bright and the keyboard is comfortable.",
    "I love this new phone. The camera takes incredible photos, and the battery lasts all day.",
    "The sound quality of these headphones is outstanding. Perfect for my commute.",
    "This is the worst laptop I've ever owned. It's slow and constantly crashes.",
    "The phone's display is dull, and it overheats easily. Very disappointed."
]

# We need to format the request for TEI to generate embeddings.
# The key is to ask for 'embedding' as the 'details' and provide the text.
# The 'generate_kwargs' will contain parameters for the underlying model.
# For embeddings, we usually don't want to *generate* text, so we set max_new_tokens to 0,
# but TEI's /generate endpoint is flexible enough to return embeddings.
payload = {
    "inputs": reviews,
    "parameters": {
        "details": "embedding", # This tells TEI we want embeddings
        "max_new_tokens": 0,   # We don't want to generate text
        "return_full_text": False # We only want the embedding part
    }
}

headers = {
    "Content-Type": "application/json"
}

try:
    response = requests.post(url, data=json.dumps(payload), headers=headers)
    response.raise_for_status() # Raise an exception for bad status codes
    result = response.json()

    # The result will contain a list of embeddings, one for each input review
    # Each embedding is a list of floats
    embeddings = result['generated_text'] # Note: TEI might return 'generated_text' or similar key for embeddings

    print(f"Generated {len(embeddings)} embeddings.")
    print(f"Dimension of the first embedding: {len(embeddings[0])}")
    # print("First embedding:", embeddings[0][:10], "...") # Print first 10 dimensions

    # Now you can use these embeddings for similarity search, clustering, etc.
    # For example, to find similar reviews, you'd compute cosine similarity
    # between the embedding of a query review and all other embeddings.

except requests.exceptions.RequestException as e:
    print(f"Error making request: {e}")
except json.JSONDecodeError:
    print("Error decoding JSON response. Response content:", response.text)

The magic here is that TEI handles batching, quantization (if applicable to the model), and efficient GPU utilization under the hood. The details: "embedding" parameter is what signals TEI to bypass the standard text generation pipeline and instead compute and return the final hidden state of the model, which serves as the embedding vector.

The core problem TEI solves for high-throughput embeddings is making specialized, optimized models accessible via a performant API. It abstracts away the complexities of loading models onto GPUs, managing inference requests, and batching them efficiently. It’s not just about serving a model; it’s about serving it fast.

What most people miss is how TEI orchestrates the transformers library and its own internal batching logic. When you send a list of inputs to the /generate endpoint with details: "embedding", TEI doesn’t just run each input through the model one by one. It groups these inputs into batches that fit within your GPU memory and the model’s context window. It then runs these batches through the model, collects the results, and returns them. This batching is the primary driver of its high throughput, especially when dealing with many small inputs. The max-input-length and max-total-tokens parameters, while seemingly for generation, influence how TEI can group inputs into batches. For embeddings, you primarily care about max-input-length to ensure your individual texts fit, and TEI will then figure out the optimal batch size for the embedding computation.

The next step is often exploring different model architectures or quantization techniques to push throughput even higher for your specific embedding task.

Want structured learning?

Take the full Huggingface course →