MLOps Seldon Core: Deploy Models at Scale on K8s (2026)

Seldon Core doesn’t just deploy models; it orchestrates entire ML inference services, turning your trained models into robust, scalable, and observable endpoints within Kubernetes.

Imagine you’ve got a fantastic model, trained and ready. Seldon Core is your bridge from that trained artifact to a live, production-ready API. It’s built on Kubernetes, so it inherits all the benefits: automatic scaling, self-healing, rolling updates, and resource management. But Seldon Core adds the ML-specific layer, handling things like model versioning, A/B testing, canary deployments, and advanced request routing for complex inference graphs.

Let’s see it in action. Suppose we have a simple Scikit-learn model for Iris flower classification.

First, we need a model.pkl file containing our trained model.

# Example training script (not part of Seldon deployment itself)
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import joblib

X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
model.fit(X, y)

joblib.dump(model, 'model.pkl')

Now, we define a kserve.yaml (Seldon Core uses the KServe API for model serving) to tell Kubernetes how to serve this model.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "iris-classifier"
spec:
  predictor:
    model:
      modelFormat: "sklearn"
      storageUri: "memory://model.pkl" # For local testing, or a cloud storage path

This InferenceService manifest tells Seldon Core:

name: iris-classifier: The name of our deployed inference service.
predictor.model: We’re deploying a single model.
modelFormat: "sklearn": The framework of our model. Seldon Core knows how to load and run Scikit-learn models. Other formats like tensorflow, pytorch, xgboost, triton, etc., are also supported.
storageUri: "memory://model.pkl": Where to find the model file. For local testing, memory:// works. In production, this would typically be a path to a cloud storage bucket (e.g., s3://my-bucket/models/iris/v1/model.pkl).

To deploy this, we’d apply the manifest:

kubectl apply -f kserve.yaml

Seldon Core, running within our Kubernetes cluster, will:

Provision a KServe InferenceService resource.
Download the model.pkl from the storageUri.
Start a container (using a pre-built Seldon/KServe image for sklearn) that loads the model.
Expose an HTTP endpoint (usually /v1/models/iris-classifier:predict) for inference requests.

A typical prediction request would look like this:

curl -X POST kserve.your-domain.com/v1/models/iris-classifier:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [6.8, 2.8, 4.8, 1.4]
    ]
  }'

And the response:

{
  "predictions": [
    1
  ]
}

This simple example is just the tip of the iceberg. Seldon Core shines when you move beyond single models. You can create complex inference graphs:

Ensembles: Combine multiple models with different weighting strategies (e.g., average predictions).
Transformers: Pre-process or post-process data before/after model inference.
Routers: Direct traffic to different model versions for A/B testing or canary releases.

Consider a more advanced kserve.yaml for an ensemble:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "iris-ensemble"
spec:
  predictor:
    graph:
      children:
      - name: "iris-v1"
        model:
          modelFormat: "sklearn"
          storageUri: "memory://iris-v1.pkl"
      - name: "iris-v2"
        model:
          modelFormat: "sklearn"
          storageUri: "memory://iris-v2.pkl"
      - name: "combiner"
        // A custom service or a built-in combiner can be used here
        // For simplicity, imagine a built-in average combiner
        ensemble:
          strategy: "average"
          models:
          - "iris-v1"
          - "iris-v2"

Here, iris-ensemble will route requests to iris-v1 and iris-v2, then take their predictions and average them using a built-in strategy. This is powerful for experimentation and improving robustness.

The core idea is that Seldon Core manages the lifecycle of your ML service. When you update iris-v1.pkl with a new model, you can update its storageUri in the InferenceService manifest. Seldon Core will then perform a rolling update, ensuring no downtime and allowing you to gradually shift traffic if needed.

The most counterintuitive aspect of Seldon Core is how it abstracts away Kubernetes complexity while leveraging its full power for ML workloads. You don’t need to write Kubernetes YAML for deployments, services, or ingress when defining your model. Seldon Core’s InferenceService CRD acts as a higher-level abstraction, but under the hood, it’s generating all the necessary Kubernetes resources. This means you get Kubernetes’ robustness and scalability without needing to be a Kubernetes expert for every model deployment.

Beyond basic deployments and ensembles, Seldon Core offers sophisticated features like Outlier Detection and Explanations, allowing you to build more intelligent and trustworthy ML systems directly within your inference graph.

The next step after mastering InferenceService is exploring the Seldon Operator’s ability to manage multiple, complex inference graphs and integrate with MLOps pipelines for automated retraining and redeployment.