Detecting model drift before it impacts your users is less about magic and more about systematically measuring how your model’s world has changed since it was trained.

Let’s watch a simple model monitor in action. Imagine we have a model that predicts customer churn, trained on data from last year. We’re running it on live customer data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from evidently.test_suite import TestSuite
from evidently.tests import TestNumberOfColumns, TestNumberOfRows, TestColumnInNumberOfColumns, TestNumberOfUniqueValues, TestNumberOfMissingValues, TestNumberOfConstantColumns, TestNumberOfEmptyRows, TestNumberOfNullValues, TestNumberOfDuplicatedRows, TestNumberOfDuplicatedColumns, TestNumberOfConstantValues, TestNumberOfOutliersByIQR, TestNumberOfOutliersByZScore, TestNumberOfMissingValuesByColumn, TestNumberOfUniqueValuesByColumn, TestNumberOfConstantValuesByColumn, TestNumberOfConstantValuesByRow, TestNumberOfOutliersByIQRByColumn, TestNumberOfOutliersByZScoreByColumn, TestNumberOfDriftByColumn, TestNumberOfDriftByTarget, TestNumberOfDriftByFeature, TestNumberOfDriftByFeatureDrift, TestNumberOfDriftByFeatureDriftByColumn, TestNumberOfDriftByFeatureDriftByTarget, TestNumberOfDriftByFeatureDriftByTargetByColumn

# Simulate training data
data_train = {
    'customer_id': range(1000),
    'age': [30 + i % 40 for i in range(1000)],
    'tenure': [12 + i % 24 for i in range(1000)],
    'monthly_charges': [50 + (i % 50) * 1.5 for i in range(1000)],
    'total_charges': [600 + (i % 50) * 18 for i in range(1000)],
    'churn': [0 if i % 3 == 0 else 1 for i in range(1000)]
}
df_train = pd.DataFrame(data_train)

# Simulate production data (with some drift)
data_prod = {
    'customer_id': range(1000, 2000),
    'age': [35 + i % 45 for i in range(1000)], # Age distribution shifted
    'tenure': [10 + i % 30 for i in range(1000)], # Tenure distribution shifted
    'monthly_charges': [55 + (i % 60) * 1.8 for i in range(1000)], # Monthly charges shifted
    'total_charges': [660 + (i % 60) * 21.6 for i in range(1000)], # Total charges shifted
    'churn': [0 if i % 2 == 0 else 1 for i in range(1000)] # Churn rate shifted
}
df_prod = pd.DataFrame(data_prod)

# Train a simple model
X_train = df_train.drop(['customer_id', 'churn'], axis=1)
y_train = df_train['churn']
model = LogisticRegression()
model.fit(X_train, y_train)

# --- Monitoring Setup ---

# Define the tests we want to run
# These are basic data quality and drift tests
data_quality_and_drift_tests = [
    TestNumberOfColumns(),
    TestNumberOfRows(),
    TestColumnInNumberOfColumns('age'),
    TestNumberOfUniqueValues('customer_id'),
    TestNumberOfMissingValues(),
    TestNumberOfConstantColumns(),
    TestNumberOfEmptyRows(),
    TestNumberOfNullValues(),
    TestNumberOfDuplicatedRows(),
    TestNumberOfDuplicatedColumns(),
    TestNumberOfConstantValues(),
    TestNumberOfOutliersByIQR('age'),
    TestNumberOfOutliersByZScore('monthly_charges'),
    TestNumberOfMissingValuesByColumn('total_charges'),
    TestNumberOfUniqueValuesByColumn('customer_id'),
    TestNumberOfConstantValuesByColumn('age'),
    TestNumberOfConstantValuesByRow(),
    TestNumberOfOutliersByIQRByColumn('tenure'),
    TestNumberOfOutliersByZScoreByColumn('monthly_charges'),
    # Drift tests
    TestNumberOfDriftByColumn('age'),
    TestNumberOfDriftByColumn('tenure'),
    TestNumberOfDriftByColumn('monthly_charges'),
    TestNumberOfDriftByColumn('total_charges'),
    TestNumberOfDriftByTarget(), # Checks if target distribution has drifted
    TestNumberOfDriftByFeatureDrift(), # General drift across features
    TestNumberOfDriftByFeatureDriftByColumn('age'), # Specific feature drift check
    TestNumberOfDriftByFeatureDriftByTarget(), # Feature drift relative to target drift
    TestNumberOfDriftByFeatureDriftByTargetByColumn('age') # Specific feature drift relative to target drift
]

# Create a test suite
data_quality_and_drift_suite = TestSuite(tests=data_quality_and_drift_tests)

# Run the tests comparing production data against training data
data_quality_and_drift_suite.run(reference_data=df_train, current_data=df_prod)

# Display the results
data_quality_and_drift_suite.show()

This code simulates training data and then "production" data that has subtly changed. The evidently library is used to run a series of tests. When data_quality_and_drift_suite.run(reference_data=df_train, current_data=df_prod) executes, it compares the distributions and characteristics of the df_prod (what’s happening now) against df_train (what the model expects). The show() method then visualizes which tests passed or failed. You’d see failures on TestNumberOfDriftByColumn('age'), TestNumberOfDriftByColumn('tenure'), etc., because the age and tenure distributions in df_prod are different from df_train.

The core problem MLOps model monitoring solves is the silent degradation of model performance. A model trained on historical data operates under the assumption that the future will resemble the past. When the underlying data distributions shift (due to changing user behavior, external events, or data pipeline issues), the model’s predictions become less reliable, even if no code has changed. Model monitoring is the systematic process of detecting these shifts.

Internally, model monitoring typically involves comparing metrics or distributions from a "reference" dataset (often the training data or a known good snapshot of production data) with a "current" dataset (live production data). This comparison can happen at several levels:

  1. Data Drift: Changes in the distribution of input features. If your model expects ages between 20-60 and suddenly starts seeing mostly 70-year-olds, that’s data drift.
  2. Concept Drift: Changes in the relationship between input features and the target variable. The same customer profile might now lead to a different outcome than it did before. For example, a new competitor might emerge, changing churn behavior even for identical customer profiles.
  3. Prediction Drift: Changes in the distribution of model predictions themselves. If your model suddenly starts predicting churn for 90% of customers when it used to predict it for 10%, even if input features haven’t changed drastically, it signals a problem.

The levers you control are the types of tests you run and the thresholds at which you flag a drift. You can monitor individual feature distributions, the overall feature space, the target variable’s distribution, and the model’s output distribution. You also define what constitutes a "significant" drift – is a 5% shift in average age a problem, or only a 20% shift?

The most surprising thing about drift detection is that sometimes, concept drift is more insidious than data drift. You might see your input feature distributions (like age or monthly_charges) remain relatively stable, but the underlying relationship between those features and the target (churn) has fundamentally changed. This is harder to spot because basic feature distribution checks will pass, but the model’s accuracy will tank. Monitoring TestNumberOfDriftByTarget() or more advanced drift detection methods that analyze feature-target relationships is crucial here.

The next step after detecting drift is understanding its root cause and deciding on a remediation strategy.

Want structured learning?

Take the full MLOps & AI DevOps course →