MLOps platforms like Prefect and Dagster can make your ML pipelines look like they’re running themselves, but the real magic is how they let you treat your data as a first-class citizen, not just a byproduct of your code.

Let’s see Dagster in action with a simple data pipeline. Imagine we want to download some data, process it, and then train a model.

from dagster import job, op, repository, schedule

@op
def download_data(url: str) -> dict:
    """Downloads data from a given URL."""
    print(f"Downloading data from {url}...")
    # In a real scenario, this would fetch data from an API or file storage
    return {"data": f"raw_data_from_{url}"}

@op
def process_data(data: dict) -> dict:
    """Processes the raw data."""
    print(f"Processing data: {data['data']}")
    processed = data["data"].upper()
    return {"processed_data": processed}

@op
def train_model(processed_data: dict):
    """Trains a model on the processed data."""
    print(f"Training model on: {processed_data['processed_data']}")
    # In a real scenario, this would involve ML libraries
    print("Model training complete.")

@job
def ml_pipeline():
    downloaded = download_data(url="http://example.com/data")
    processed = process_data(data=downloaded)
    train_model(processed_data=processed)

# To run this, you'd typically define a repository
@repository
def my_repository():
    return [ml_pipeline]

# And potentially schedule it
@schedule(cron_schedule="0 0 * * *", job=ml_pipeline, execution_delta=None, run_config={})
def daily_ml_pipeline_schedule():
    return {}

This code defines three operations (ops) and a job that orchestrates them. When you run this in Dagster’s UI (the Dagit interface), you see a directed acyclic graph (DAG) representing the flow of data. The download_data op produces an output that is directly passed as an input to process_data, and so on. This explicit data flow is key.

The core problem these tools solve is the brittle nature of traditional ML scripts. You write a Python script, run it, and if it fails halfway through, you have to figure out where it broke, what state it was in, and rerun it from the beginning, often recomputing intermediate results. Prefect and Dagster introduce structured execution, dependency management, and observability. They allow you to define your ML workflow as a series of distinct tasks with clear inputs and outputs. This makes pipelines more robust, easier to debug, and simpler to version.

Internally, Dagster treats your ops as functions that produce "Output" objects, which are then passed to downstream ops. These ops are executed within "solids" (Dagster’s term for an atomic unit of computation), and solids are grouped into "graphs" or "jobs." Dagster manages the execution of these jobs, tracking their status, handling retries, and logging their results. You can configure how each op runs, including its compute resources, concurrency, and parameters. The explicit definition of data dependencies means Dagster knows exactly what needs to be recomputed if something fails.

Prefect offers a similar paradigm with "tasks" and "flows." A task is a Python function decorated with @task, and a flow is a Python function decorated with @flow that calls tasks. Prefect’s strength lies in its dynamic nature and its focus on handling transient failures. It can automatically retry tasks that fail due to temporary network glitches or resource unavailability.

The one thing most people don’t realize is how deeply these systems embrace data lineage. When Dagster or Prefect runs your pipeline, they aren’t just executing code; they’re building a history of how your data was transformed. This means if you discover a bug in your process_data op months later, you can go back to a specific version of your pipeline, rerun it with the exact same input data that caused the problem, and see precisely how the data changed at each step. This level of auditability and reproducibility is invaluable for debugging and for meeting regulatory requirements in fields like finance or healthcare.

The next hurdle is understanding how to integrate these engines with your actual ML training and deployment infrastructure.

Want structured learning?

Take the full MLOps & AI DevOps course →