Building an MLOps end-to-end pipeline isn’t just about stitching together tools; it’s about creating a living, breathing system that continuously learns and adapts. The most surprising thing is that the biggest bottleneck isn’t usually the model training itself, but the friction in getting that model from a data scientist’s laptop into production reliably and repeatedly.
Let’s see this in action. Imagine a simple pipeline for image classification.
# config/training.yaml
model:
name: resnet50
pretrained: true
data:
dataset: cifar10
batch_size: 64
image_size: 32
optimizer:
type: adam
lr: 0.001
training:
epochs: 50
device: cuda:0
This config drives a Python script that fetches CIFAR-10, loads a pre-trained ResNet50, trains it for 50 epochs, and saves the model artifacts. But the magic of MLOps is what happens after this script runs.
The core problem MLOps solves is the "it works on my machine" syndrome, scaled up. It transforms ad-hoc, manual model development into a predictable, automated process. This involves several key stages:
-
Data Ingestion & Versioning: Raw data lands in a central store (like an S3 bucket or GCS bucket). Tools like DVC (Data Version Control) or LakeFS are used to version datasets, so you can always reproduce a specific training run based on the exact data it saw. For example,
dvc add data/raw/imagesfollowed bydvc commitcreates a snapshot. -
Feature Engineering: This is where raw data is transformed into model-ready features. This process must be deterministic and versioned. A common pattern is to use a feature store (like Feast or Tecton) which manages feature definitions and their computation, ensuring consistency between training and inference.
-
Model Training & Experiment Tracking: This is where the code from our
config/training.yamlruns. Crucially, every training run is logged using tools like MLflow or Weights & Biases. This logs hyperparameters, metrics, code versions, and the trained model artifacts. A typical MLflow log command might look like:mlflow run . -P model_name=resnet50 -P epochs=50. -
Model Evaluation & Validation: After training, the model isn’t just thrown into production. It’s evaluated against predefined metrics on a hold-out dataset. This might involve checking accuracy, precision, recall, or domain-specific metrics. If it doesn’t meet a threshold (e.g., accuracy < 0.85), it’s rejected. This validation step is often automated in CI/CD pipelines.
-
Model Registry & Versioning: Validated models are stored in a model registry (like MLflow Model Registry or SageMaker Model Registry). Each model version gets a unique identifier and metadata. This allows for easy rollback and auditing.
-
Model Deployment: This is where the model becomes an API endpoint. Options range from simple REST APIs (e.g., using FastAPI with a loaded model) to managed services (SageMaker Endpoints, Vertex AI Endpoints). Deployment strategies like canary releases or A/B testing are common. For a basic FastAPI deployment, you might have a
main.pyserving predictions. -
Monitoring & Alerting: Once deployed, the model’s performance in production is continuously monitored. This includes technical metrics (latency, error rates) and ML-specific metrics (data drift, concept drift, prediction drift). Alerts are triggered if performance degrades, signaling a need for retraining.
The entire process is orchestrated by a CI/CD system (Jenkins, GitLab CI, GitHub Actions). A code commit to the model repository can trigger a retraining pipeline, followed by evaluation, registration, and potentially an automated deployment.
The exact levers you control are primarily around configuration, code, and infrastructure.
- Configuration: The
config/training.yamlexample shows how you abstract away training parameters. This makes it easy to experiment with different hyperparameters without touching core training code. - Code: Version-controlled Python scripts for data processing, training, and inference. Reproducibility hinges on immutable code versions.
- Infrastructure: The compute resources (CPUs, GPUs), storage, and networking that host your data, run your training jobs, and serve your models.
A key aspect that often gets overlooked is the distinction between model versioning and data versioning. You can have the same model code trained on different versions of the dataset, or the same dataset used to train different versions of the model. Both need to be tracked independently to ensure true reproducibility and to understand the impact of changes. For instance, if model_v2 performs worse than model_v1, you need to know if it was a change in the model architecture, a change in the training data, or a change in the training hyperparameters.
The next concept you’ll grapple with is how to automate the decision-making process for retraining and redeployment based on live monitoring data.