Serving ML models at scale in Kubernetes isn’t about packaging your model, it’s about treating your model as a microservice that needs to be available, observable, and resilient.
Let’s watch a model go live.
apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
name: my-model-service
spec:
predictor:
minReplicas: 2
maxReplicas: 10
containerConcurrency: 2
# For TensorFlow Serving
tensorflow:
storageUri: gs://my-model-bucket/models/my-model/v1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
# Or for PyTorch TorchServe
# pytorch:
# model:
# modelName: my_torch_model
# storageUri: gs://my-model-bucket/models/my-model/v1/my_torch_model.mar
# resources:
# requests:
# cpu: "1"
# memory: "2Gi"
# limits:
# cpu: "2"
# memory: "4Gi"
# Or for custom models with a pre-built container image
# model:
# container:
# name: kserve-container
# image: my-docker-registry/my-custom-model-server:latest
# ports:
# - containerPort: 8080
# resources:
# requests:
# cpu: "1"
# memory: "2Gi"
# limits:
# cpu: "2"
# memory: "4Gi"
This InferenceService custom resource, part of the KServe project (formerly KFServing), is the core abstraction. It defines how your model should be served. The predictor section is where the magic happens. You specify your model’s location (e.g., storageUri for cloud storage) and the type of server (TensorFlow Serving, TorchServe, or a custom container). Crucially, you also define resource requests and limits, dictating how much CPU and memory each model pod can consume. This is your primary lever for scaling and stability. minReplicas and maxReplicas tell Kubernetes how many instances of your model server to maintain, and containerConcurrency limits how many requests a single replica can handle simultaneously, preventing overload.
The system breaks down the InferenceService into several Kubernetes-native components. A Deployment manages the actual model server pods. A Service provides a stable network endpoint for accessing these pods. KServe often uses Istio or Knative for advanced traffic management, including canary deployments, A/B testing, and autoscaling. When you apply this YAML, KServe’s controllers watch for InferenceService resources and provision the necessary Kubernetes objects. The model artifacts are typically loaded from a remote storage location (like S3, GCS, or MinIO) into the model server container when it starts. This allows you to update your model by simply changing the storageUri and updating the InferenceService, triggering a rolling update of your pods.
The real power comes from autoscaling. KServe integrates with Kubernetes’ Horizontal Pod Autoscaler (HPA) and can also leverage Knative’s scaling-to-zero capabilities. The HPA monitors metrics like CPU utilization, memory usage, or custom metrics (like requests per second) and automatically adjusts the number of pods between your defined minReplicas and maxReplicas. For example, if your model server pods are consistently hitting 80% CPU utilization, the HPA will spin up more replicas. Conversely, if utilization drops, it will scale down, saving resources. Knative, if used, can even scale the number of replicas down to zero when there are no incoming requests, drastically reducing costs for infrequently accessed models.
When you send a request to the Service endpoint, it’s routed to one of the available model server pods. The model server itself then loads the model and performs the inference. For TensorFlow Serving, the storageUri points to a directory containing your SavedModel. TensorFlow Serving automatically loads the latest version or a specific version you might specify. For TorchServe, you’d package your model into a .mar file, upload it to storage, and provide the URI. TorchServe then loads this archive. If you’re using a custom container, your application code is responsible for loading the model from wherever it’s accessible (e.g., mounted volumes, cloud storage) and exposing an inference endpoint, usually via HTTP.
The secret sauce for efficient scaling often lies in how the model server itself handles concurrency and how Kubernetes allocates resources. While containerConcurrency limits requests per pod, the underlying model server might have its own internal threading or batching mechanisms. For instance, a TensorFlow Serving instance might be configured to batch incoming requests to maximize GPU utilization, even if containerConcurrency is set lower. Similarly, Kubernetes’ resource requests and limits are critical. If you set limits too low, pods will be OOMKilled (Out Of Memory) or throttled, leading to instability. If you set requests too high, you might not be able to schedule enough pods on your nodes. The interplay between containerConcurrency, the model server’s internal processing, and Kubernetes resource management is key to achieving high throughput and low latency.
If your model server container listens on a port other than the default 8080, you’ll need to explicitly specify that in the container section of your predictor spec.