MLOps teams are often described as the glue holding together data science and operations, but the surprising truth is that a well-structured MLOps team often reduces the need for a dedicated "MLOps team" altogether.
Here’s how it actually looks in practice. Imagine a company that’s just launched its first production model: a recommendation engine for an e-commerce site.
The Model: A Python-based Scikit-learn model, trained on historical user clickstream data. It outputs a list of 10 recommended products per user.
The Infrastructure:
- Data Storage: PostgreSQL for user profiles, S3 for historical clickstream logs.
- Feature Engineering: Python scripts using Pandas, run daily via Airflow.
- Model Training: A dedicated EC2 instance with GPUs, managed by SageMaker.
- Model Registry: MLflow for tracking experiments and storing trained models.
- Deployment: A Flask API containerized with Docker, deployed on AWS ECS.
- Monitoring: Prometheus for system metrics, Grafana for dashboards, and a custom Python script for drift detection.
The "MLOps" Process in Action:
-
Data Ingestion & Feature Engineering:
- An Analytics Engineer (or a Data Engineer with strong SQL skills) is responsible for the daily Airflow DAGs that pull raw clickstream data from S3, join it with user profiles from PostgreSQL, and run the feature transformation scripts. They ensure data quality and availability.
- Config Snippet (Airflow DAG):
from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime def run_feature_engineering(): # ... (Python code to load data, transform, save features) ... print("Feature engineering complete.") with DAG( dag_id='feature_engineering_pipeline', start_date=datetime(2023, 1, 1), schedule_interval='@daily', catchup=False ) as dag: feature_task = PythonOperator( task_id='run_feature_engineering_script', python_callable=run_feature_engineering )
-
Model Training & Experimentation:
- A Machine Learning Scientist (or a Data Scientist specializing in modeling) uses SageMaker to run training jobs. They leverage MLflow to log hyperparameters, metrics, and model artifacts for each experiment.
- Command Line Example (MLflow logging):
mlflow run -e train --entry-point train.py --experiment-id 12345 --params learning_rate=0.01,epochs=10 - The scientist selects the best-performing model artifact from MLflow for deployment.
-
Model Packaging & Deployment:
- The Machine Learning Scientist or a Software Engineer with an interest in deployment packages the trained model (e.g., a
model.pklfile) and its inference code (e.g.,predict.py) into a Docker image. This image is pushed to a container registry like ECR. - Dockerfile Snippet:
FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY predict.py . COPY model.pkl . EXPOSE 8000 CMD ["python", "predict.py"] - An Infrastructure Engineer (or a DevOps Engineer) manages the ECS service, ensuring the Docker image is deployed, scaled, and exposed via a load balancer. They set up auto-scaling rules based on CPU utilization.
- The Machine Learning Scientist or a Software Engineer with an interest in deployment packages the trained model (e.g., a
-
Monitoring & Alerting:
- The Infrastructure Engineer configures Prometheus to scrape metrics from the ECS service (CPU, memory, request latency, error rates).
- The Machine Learning Scientist or Data Scientist develops and deploys the custom drift detection script. This script might periodically fetch recent predictions and compare them to ground truth (when available), or analyze input feature distributions against training data. Alerts are configured in Alertmanager.
- Configuration Snippet (Prometheus
prometheus.yml):scrape_configs: - job_name: 'recommendation_api' ec2_sd_configs: - region: us-east-1 port: 8000 filters: - name: "tag:ServiceName" values: ["recommendation-api"]
The Mental Model:
The "MLOps team" isn’t a silo; it’s a distributed responsibility. The core idea is to embed MLOps principles and tooling into the existing workflows of data engineers, data scientists, and software/infrastructure engineers. The "MLOps" function becomes an enablement layer, providing standardized tools, automation frameworks, and best practices that these individuals can adopt.
- Data Engineers/Analytics Engineers: Focus on reliable data pipelines and feature stores. Their MLOps concern is data versioning, quality checks, and efficient feature retrieval.
- Machine Learning Scientists/Data Scientists: Focus on model development, experimentation, and performance. Their MLOps concern is experiment tracking, reproducibility, model versioning, and understanding model drift.
- Software/Infrastructure Engineers: Focus on scalable, reliable, and observable systems. Their MLOps concern is CI/CD for ML, containerization, infrastructure as code, and system monitoring.
The common thread is automation and collaboration. Tools like Airflow, MLflow, Docker, Kubernetes, and cloud-specific ML platforms (SageMaker, Vertex AI, Azure ML) are the enablers. The "MLOps team" might exist as a small central group to define standards, build internal platforms, and provide guidance, but the execution of MLOps is spread across these roles.
The one thing most people don’t know is that the most robust MLOps systems are built not by a dedicated MLOps team, but by empowering existing teams with the right tools and clear ownership boundaries. The "MLOps" in this context becomes a set of engineering practices and a cultural shift towards treating ML models as production software, rather than academic research artifacts. This means rigorous testing, automated deployment, continuous monitoring, and rapid iteration, all integrated into the daily lives of the people already building and deploying software.
This distributed model leads to faster iteration cycles and a deeper sense of ownership over the entire ML lifecycle, from data to production and back.
The next concept you’ll run into is how to effectively manage and version the data used for training and inference, especially as datasets grow and evolve.