Deploying ML models on Kubernetes is often a complex undertaking, but KServe, a Kubernetes-native inference server, simplifies this by providing a standardized way to package, deploy, and serve machine learning models.

Let’s see KServe in action. Imagine you have a trained scikit-learn model saved as a model.joblib file. You want to deploy this model as a REST API endpoint on Kubernetes.

First, you’ll need to package your model. KServe uses a standardized format called an "inference graph" or a "model archive." For a simple scikit-learn model, this typically involves a directory structure like this:

my-sklearn-model/
├── model.joblib
└── sklearnserver.yaml

The model.joblib file is your serialized scikit-learn model. The sklearnserver.yaml file is a KServe configuration that tells the sklearnserver runtime how to load and serve your model. A basic sklearnserver.yaml might look like this:

modelFormat: sklearn
modelUri: gs://my-bucket/my-sklearn-model/model.joblib # Or s3://, file://, etc.

Next, you create a KServe InferenceService custom resource. This is a Kubernetes object that describes your model deployment.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    sklearn:
      storageUri: "gs://my-bucket/my-sklearn-model" # Points to the directory containing model.joblib and sklearnserver.yaml

When you apply this InferenceService to your Kubernetes cluster, KServe handles the rest. It creates a Kubernetes Deployment and a Service to manage your model’s pods. It also sets up a VirtualService (if you’re using Istio, which is common with KServe) to route traffic to your model.

Here’s what’s happening under the hood:

  1. KServe Controller: The KServe controller watches for InferenceService objects. When it sees a new one, it provisions the necessary Kubernetes resources.
  2. Model Server: KServe launches a specific model server based on the modelFormat you specified (e.g., sklearnserver for scikit-learn, tfserving for TensorFlow, torchserve for PyTorch). This model server is responsible for loading your model artifact from the storageUri.
  3. Networking: KServe integrates with service meshes like Istio or uses its own built-in networking to expose your model server as a stable endpoint. You can then send inference requests to this endpoint.

The problem KServe solves is the boilerplate of setting up inference servers, managing dependencies, and exposing them reliably on Kubernetes. Instead of writing custom Dockerfiles, Kubernetes Deployments, and Services for each model framework, you define a declarative InferenceService.

The storageUri is a critical parameter. It can point to various storage locations, including cloud object storage (like S3 or GCS), a network file system, or even a local path within the pod if you’re packaging the model directly into the container image. This flexibility allows you to manage your model artifacts independently of your deployment configuration.

The predictor section in the InferenceService spec can be extended to include multiple models, pre/post-processing steps (using transformer and explainer fields), and even complex routing logic. This allows you to build sophisticated inference pipelines.

A common misconception is that KServe is just another model serving tool. In reality, it’s a Kubernetes-native abstraction layer. It doesn’t reinvent serving; it standardizes how existing, robust model servers (like TensorFlow Serving, TorchServe, or custom Python servers) are deployed and managed within the Kubernetes ecosystem. This means you get the benefits of Kubernetes (scalability, self-healing, rolling updates) applied directly to your ML model deployments.

When you configure your InferenceService, the sklearn field under predictor is not just a string; it’s an instruction to KServe to instantiate the sklearnserver component. This component is a specialized container image that knows how to load scikit-learn models using joblib and expose a gRPC and REST API for predictions. The storageUri then tells this sklearnserver where to find the actual model.joblib file.

Once deployed, you can send a prediction request to your sklearn-iris service. The request might look like this (using curl):

curl -v -X POST \
  -H "Content-Type: application/json" \
  --data '{"instances": [[5.1, 3.5, 1.4, 0.2]]}' \
  http://sklearn-iris.default.svc.cluster.local/v1/models/sklearn-iris:predict

The response will be the model’s prediction for the given input.

The next concept you’ll encounter is managing multiple models or even complex inference graphs where one model’s output feeds into another, which KServe supports through its multi-model serving and graph capabilities.

Want structured learning?

Take the full Mlflow course →