Data validation in MLOps is less about catching bugs and more about preventing silently drifting models from becoming useless.

Here’s a concrete example of a data validation pipeline in action, using great_expectations with a Pandas DataFrame. Imagine we’re building a model to predict customer churn.

import pandas as pd
from great_expectations.dataset import PandasDataset

# Simulate incoming customer data
data = {
    'customer_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'account_age_months': [12, 24, 6, 30, 18, 5, 15, 28, 10, 22],
    'monthly_charges': [55.50, 80.20, 45.00, 95.75, 70.00, 40.00, 60.30, 88.00, 50.00, 75.50],
    'total_charges': [666.00, 1924.80, 270.00, 2872.50, 1260.00, 200.00, 904.50, 2464.00, 500.00, 1661.00],
    'churn': [0, 0, 1, 0, 0, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)

# Load data into Great Expectations
expectations_data_context = PandasDataset(df)

# Define expectations (assertions about the data)
expectations_data_context.expect_column_to_exist("customer_id")
expectations_data_context.expect_column_values_to_be_unique("customer_id")
expectations_data_context.expect_column_values_to_be_between("account_age_months", min_value=0, max_value=60)
expectations_data_context.expect_column_values_to_be_between("monthly_charges", min_value=0)
expectations_data_context.expect_column_values_to_be_between("total_charges", min_value=0)
expectations_data_context.expect_column_mean_to_be_close_to("monthly_charges", value=65.0, tolerance=10.0)
expectations_data_context.expect_column_values_to_be_in_set("churn", value_set=[0, 1])

# Run the validation
validation_results = expectations_data_context.validate()

# Print results
print(validation_results)

This code snippet demonstrates defining and running checks against a dataset. The expect_column_to_exist, expect_column_values_to_be_unique, expect_column_values_to_be_between, expect_column_mean_to_be_close_to, and expect_column_values_to_be_in_set are all "expectations" – assertions about the data’s properties. When validate() is called, Great Expectations checks if the actual data conforms to these expectations.

The core problem data validation solves in MLOps is data drift. Over time, the real-world data your model encounters can subtly change. This change, even if it doesn’t break the system outright, can degrade model performance without any obvious errors. Think of a spam filter trained on emails from 2010; it would likely perform poorly today because the nature of spam has evolved. Data validation pipelines act as an automated quality control system, catching these shifts before they impact model predictions.

Internally, tools like Great Expectations work by defining a set of "expectations" that describe the desired characteristics of your data. These expectations can cover a wide range of properties:

  • Column Existence and Data Types: Ensuring expected columns are present and have the correct data type (e.g., int, float, string).
  • Value Ranges and Distributions: Verifying that numerical values fall within reasonable bounds (e.g., age cannot be negative) or that the distribution of values (e.g., mean, median, standard deviation) remains consistent with historical patterns.
  • Uniqueness and Cardinality: Checking for duplicate identifiers or ensuring categorical columns have a predictable number of unique values.
  • Row Counts: Monitoring the volume of incoming data.
  • SQL-like Queries: For more complex rules, you can define expectations using SQL queries that must return zero rows (e.g., SELECT * FROM table WHERE invalid_condition).

When a new batch of data arrives, the validation pipeline runs these expectations against it. If any expectation fails, it signals a potential issue. This failure can then trigger alerts, halt the downstream model training or inference pipeline, or initiate an investigation.

The "levers" you control are primarily the set of expectations you define and the thresholds you set for each. You can tailor these expectations to the specific characteristics of your data and the requirements of your model. For instance, for a financial model, you might have very strict expectations on the range of transaction amounts and the absence of negative values. For a natural language processing model, you might focus on vocabulary coverage and document length distributions.

The surprising thing about data validation is how often the most critical failures are not outright data corruption, but subtle shifts in statistical properties that slowly erode model accuracy. For example, a model predicting customer lifetime value might have been trained on data where the average monthly_charges was $60. If, over time, the average monthly_charges in production data creeps up to $90 due to inflation or changes in service offerings, the model’s predictions will become increasingly inaccurate, even if all data types and ranges appear "valid" at a superficial level. The expect_column_mean_to_be_close_to expectation, with a carefully chosen tolerance, is designed to catch exactly this kind of drift. It’s a proactive measure against silent degradation.

The next logical step after setting up basic data validation is to integrate these checks into your CI/CD pipeline for automated model retraining.

Want structured learning?

Take the full MLOps & AI DevOps course →