MLOps Data Quality: Validate Datasets with Great Expectations (2026)

Great Expectations is actually a data validation tool, not a data quality tool, and its primary value is in documenting your data’s expected state.

Let’s see it in action. Imagine you have a CSV file with user data, and you want to ensure it’s always clean before it gets fed into your machine learning model.

user_id,username,signup_date,last_login,is_active
1,alice,2023-01-15,2024-03-10,true
2,bob,2023-02-20,2024-03-11,true
3,charlie,2023-03-01,,false
4,david,2023-04-05,2024-03-09,true
5,eve,2023-05-10,2024-03-11,

First, we need to install Great Expectations and initialize a project:

pip install great_expectations
great_expectations init

This creates a great_expectations directory in your project. Now, let’s create a "Datasource" to connect to our CSV file. We’ll use the CLI for this:

great_expectations datasource new

When prompted, choose Pandas as the execution engine, Pandas as the store backend, and select Path as the data source type. For the path, enter the directory containing your CSV file (e.g., ./data/). Great Expectations will scan this directory and find your CSV. Let’s name this datasource user_data_csv.

Next, we define "Expectations." These are the rules your data must follow. We’ll create an "Expectation Suite" and add some expectations to it.

great_expectations suite new --datasource user_data_csv

This will open a Jupyter Notebook. Here’s how you’d add some common expectations:

# In the generated Jupyter Notebook
import great_expectations as gx
from great_expectations.core.batch import BatchRequest

context = gx.get_context()

# Let's assume your CSV file is named 'users.csv'
batch_request = BatchRequest(
    datasource_name="user_data_csv",
    data_connector_name="default_runtime_data_connector_name", # This name might vary, check your config
    data_asset_name="users.csv",
)

# Get the validator for your data asset
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="default" # Or whatever you named your suite
)

# Add expectations
validator.expect_column_to_exist("user_id")
validator.expect_column_values_to_be_unique("user_id")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_of_type("user_id", "int64")

validator.expect_column_to_exist("username")
validator.expect_column_values_to_not_be_null("username")
validator.expect_column_values_to_be_unique("username")
validator.expect_column_values_to_be_string("username")

validator.expect_column_to_exist("signup_date")
validator.expect_column_values_to_not_be_null("signup_date")
validator.expect_column_values_to_be_datetime(column="signup_date", format_string="%Y-%m-%d")

validator.expect_column_to_exist("last_login")
validator.expect_column_values_to_be_datetime(column="last_login", format_string="%Y-%m-%d", mostly_null=True) # Allow some nulls

validator.expect_column_to_exist("is_active")
validator.expect_column_values_to_be_in_set("is_active", [True, False, None]) # Allow boolean or null

# Save the expectations
validator.save_expectation_suite()

The validator.expect_column_values_to_be_datetime(column="last_login", format_string="%Y-%m-%d", mostly_null=True) line is interesting. It checks if the last_login column contains values that can be parsed as dates in the YYYY-MM-DD format. The mostly_null=True part is a "success metric" that tells Great Expectations that it’s okay if a certain percentage of values are null, as long as the non-null values conform to the date format. This is crucial for real-world data where missing values are common.

Now, to validate a dataset against these expectations, you’d run:

great_expectations checkpoint run default # Or the name of your checkpoint

This generates a "Data Docs" site, which is a human-readable HTML report of your data’s quality and how it passed or failed expectations.

The core problem Great Expectations solves is that ML models are brittle. They learn patterns from the data they’re trained on. If the new data entering the system (for inference or retraining) has a different structure or contains unexpected values, the model’s performance can degrade silently, or worse, lead to catastrophic failures. Great Expectations acts as a gatekeeper, ensuring that data entering your ML pipeline conforms to a predefined standard. It builds a verifiable "contract" between your data sources and your ML systems.

The real power of Great Expectations comes from integrating it into your CI/CD pipeline. When new data arrives, you can automatically run a checkpoint. If any expectations fail, the pipeline can halt, alert engineers, and prevent bad data from reaching your model. This proactive validation is far more effective than reactive monitoring of model performance after it has already degraded.

What most people don’t realize is that the "Expectations" themselves serve as living documentation. When you look at your great_expectations/expectations/ directory, you’re not just seeing configuration files; you’re seeing a precise, executable specification of what "good" data looks like for your project. This makes debugging data issues much faster because you can immediately see which specific expectation failed and why, rather than digging through logs or trying to infer the problem from model output.

The next hurdle is often how to automatically trigger these validations as part of a larger data pipeline orchestration.