ML CI/CD: Automate Training & Deployment

The most surprising thing about automating model training and deployment is that the "model" itself is often the least important part of the pipeline.

Let’s watch a typical CI/CD pipeline for machine learning in action. Imagine a Git repository holding our code, including data preprocessing scripts, model training scripts (e.g., a Python file using scikit-learn or TensorFlow), and a Dockerfile to containerize our model serving application.

# .github/workflows/ci-cd.yml
name: ML CI/CD Pipeline

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install mlflow

      - name: Train and log model
        env:

          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

          MLFLOW_EXPERIMENT_NAME: my-model-experiment
        run: |
          python train.py --data-path data/train.csv --model-output ./model

      - name: Build and push Docker image
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: your-dockerhub-username/my-ml-model:latest
          file: Dockerfile

      - name: Deploy to Kubernetes
        uses: azure/k8s-actions/aks-deploy@v1 # Example for AKS, adjust for your platform
        with:

          creds: ${{ secrets.AZURE_CREDENTIALS }}

          resource-group: my-resource-group
          cluster-name: my-aks-cluster
          namespace: ml-models
          image-name: your-dockerhub-username/my-ml-model:latest
          deployment-file: k8s/deployment.yaml

Here’s what’s happening:

Checkout Code: Fetches the latest version of our ML project from Git.
Setup Python: Ensures a consistent Python environment for reproducibility.
Install Dependencies: Installs all necessary libraries, crucially including mlflow for experiment tracking.
Train and Log Model: This is where the magic happens. train.py not only trains a model but also logs metrics (accuracy, loss), parameters, and the trained model artifact itself to an MLflow tracking server. This makes experiments auditable and reproducible. The MLFLOW_TRACKING_URI points to where these logs are stored.
Build and Push Docker Image: The Dockerfile defines how to package our model serving application (e.g., a Flask API wrapping the trained model). This image is then pushed to a container registry (like Docker Hub or a private registry).
Deploy to Kubernetes: The containerized application is deployed to a Kubernetes cluster. This step typically involves updating a Kubernetes Deployment resource to use the newly built Docker image.

The problem this solves is the "last mile" problem of ML: getting a trained model from a data scientist’s laptop into a production environment where it can serve predictions. Traditionally, this was a manual, error-prone process involving ad-hoc scripts and significant coordination. MLOps CI/CD automates this, treating model training and deployment with the same rigor as software code.

Internally, the pipeline orchestrates several key components:

Version Control System (VCS): Git, acting as the single source of truth for code and configuration.
CI/CD Platform: GitHub Actions, GitLab CI, Jenkins, etc., which trigger and manage the pipeline execution.
Experiment Tracking: MLflow, Weights & Biases, or similar, to log model training runs, parameters, and metrics. This is crucial for comparing different model versions and debugging.
Containerization: Docker, to package the model and its serving code into a portable, reproducible unit.
Container Registry: Docker Hub, AWS ECR, GCP GCR, etc., to store the built Docker images.
Orchestration/Deployment Platform: Kubernetes, AWS SageMaker Endpoints, Azure ML Endpoints, etc., to host and serve the model.

The exact levers you control are primarily in the code that runs within the pipeline: the train.py script, the Dockerfile, and the Kubernetes deployment manifests (k8s/deployment.yaml). You define what gets trained, how it’s packaged, and where it’s deployed. The CI/CD platform then automates the execution of these definitions.

Most people focus on the model performance metrics during training. However, the actual artifact being deployed is the container image. If your Dockerfile has a subtle bug, like installing a different version of a library than what your training script used, your deployed model might fail in production even if your training metrics looked perfect. This is why the build and push step for the Docker image is as critical as the training step itself.

The next challenge is often setting up robust model monitoring after deployment.