A Kubernetes ML workload isn’t just about running a container; it’s about orchestrating a distributed system that can dynamically scale to meet unpredictable inference demands, all while managing the lifecycle of your trained models.
Let’s see what this looks like in practice. Imagine we have a Python Flask app that serves predictions from a pre-trained scikit-learn model.
# app.py
from flask import Flask, request, jsonify
import joblib
import os
app = Flask(__name__)
model_path = os.environ.get("MODEL_PATH", "/app/model.joblib")
model = joblib.load(model_path)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = data['features']
prediction = model.predict([features])[0]
return jsonify({'prediction': prediction})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Now, we need to containerize this.
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py app.py
COPY model.joblib /app/model.joblib
EXPOSE 5000
CMD ["python", "app.py"]
The requirements.txt would simply contain:
flask
scikit-learn
joblib
With this Dockerfile, we build the image:
docker build -t my-ml-model:v1 .
And then deploy it to Kubernetes. A basic deployment might look like this:
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-service
spec:
replicas: 2
selector:
matchLabels:
app: ml-service
template:
metadata:
labels:
app: ml-service
spec:
containers:
- name: ml-service
image: my-ml-model:v1
ports:
- containerPort: 5000
And a service to expose it:
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: ml-service
spec:
selector:
app: ml-service
ports:
- protocol: TCP
port: 80
targetPort: 5000
type: ClusterIP # Or LoadBalancer for external access
Applying these:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
Now, Kubernetes manages the pods, ensuring they are running and can be accessed via the ml-service ClusterIP.
The mental model here is that Kubernetes acts as an intelligent scheduler and manager for your containerized applications. For ML workloads, this means it can:
- Automate Scaling: Using Horizontal Pod Autoscalers (HPAs) based on CPU or custom metrics (like requests per second), Kubernetes can automatically increase or decrease the number of
ml-servicepods to handle varying inference loads. - Ensure High Availability: If a pod crashes or a node fails, Kubernetes automatically restarts the pod on a healthy node.
- Manage Updates: Rolling updates and canary deployments allow you to update your model or application code with zero downtime.
- Resource Management: You can specify CPU and memory requests/limits for your containers, ensuring that your ML workloads get the resources they need without starving other applications.
The key to managing ML models within this system is treating the model artifact itself as a deployable component. In our example, it’s baked into the container image. For larger models or frequent updates, you’d typically store models in object storage (like S3, GCS) and have your container pull the model down on startup, or use a Kubernetes CSI driver to mount model volumes. The MODEL_PATH environment variable in our Flask app is a simple mechanism to allow overriding the default model location, making the container more flexible.
What most people don’t realize is the power of Kubernetes’ built-in network primitives for ML. Services abstract away the IP addresses of your pods, providing a stable endpoint. Ingress controllers can then expose these services externally, handling SSL termination, load balancing, and even routing based on request paths or headers. This means you can easily expose multiple model versions or different models from the same deployment using a single external endpoint, with Kubernetes managing the complex traffic routing.
The next logical step is to integrate with ML-specific operators like Kubeflow, which provide higher-level abstractions for training, serving, and managing the entire ML lifecycle on Kubernetes.