BentoML is a framework for packaging your trained machine learning models into a standardized format called a "Bento" and serving them as production-ready APIs.

Here’s a quick look at how it works:

from bentoml import BentoML, api, env, artifacts
from bentoml.handlers import DataframeHandler
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Assume you have a trained model
model = RandomForestClassifier()
# ... train your model ...

@env(infer_pip_packages=["scikit-learn"])
@artifacts([artifacts.ModelArtifact(model, name="model")])
@api(input=DataframeHandler(), output="json")
def predict(df: pd.DataFrame) -> dict:
    bentoml_model = BentoML().get_artifact("model")
    predictions = bentoml_model.predict(df)
    return {"predictions": predictions.tolist()}

# To build the bento:
# bentoml build . --output-dir ./bentos

# To serve the bento:
# bentoml serve ./bentos:<bento_tag>

The core problem BentoML solves is the gap between a trained model artifact and a deployed, scalable, production-ready service. Researchers and data scientists often end up with model files (like .pkl, .h5, or .pt) that are difficult to integrate into existing software stacks. BentoML bridges this by providing a consistent way to bundle the model, its dependencies, and the serving logic.

Internally, a Bento is a directory containing:

  • bentofile.yaml: The manifest file that describes the Bento, including its name, version, dependencies, and the artifacts it contains.
  • models/: A directory storing the serialized model artifacts.
  • scripts/: Python files defining the API endpoints and serving logic.
  • Dockerfile: A Dockerfile used to build a container image for the Bento.

The @api decorator defines an endpoint. DataframeHandler() specifies that this endpoint expects a Pandas DataFrame as input, automatically handling JSON parsing and DataFrame conversion. The @env decorator infers and lists Python packages required by your model, ensuring they are installed in the production environment. The @artifacts decorator registers your trained model as a deployable artifact within the Bento.

When you run bentoml build, it packages all these components into a versioned artifact. bentoml serve then takes this artifact and starts a web server (typically Flask or FastAPI) that exposes your defined API endpoints. This server can be easily containerized using the generated Dockerfile for deployment on platforms like Kubernetes, AWS SageMaker, or GCP AI Platform.

The artifacts decorator is incredibly flexible. Beyond ModelArtifact, you can register FileArtifact for data files, PytorchModelArtifact, HuggingfaceModelArtifact, and more, allowing you to bundle complex model pipelines.

The most surprising thing about BentoML’s artifact system is that it doesn’t just store your model file; it serializes and deserializes it using a standardized method, ensuring consistency across different environments. When you bentoml.get_artifact("model"), BentoML knows how to load your specific model type (e.g., scikit-learn, PyTorch, TensorFlow) correctly, abstracting away the underlying loading mechanisms and making your code portable.

The next step is exploring how to integrate custom pre- and post-processing logic directly into your Bento.

Want structured learning?

Take the full MLOps & AI DevOps course →