Dagster is fundamentally a data orchestrator, not just a task orchestrator, and that difference is everything.
Let’s see what that means in practice. Imagine you have a Python function that reads data from a CSV file and another that trains a model. In Airflow, you’d likely define two BashOperator or PythonOperator tasks, each calling your Python scripts.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id='simple_airflow_pipeline',
start_date=datetime(2023, 1, 1),
schedule_interval='@daily',
catchup=False
) as dag:
read_data = BashOperator(
task_id='read_data_from_csv',
bash_command='python /path/to/scripts/read_data.py --output_path /tmp/data.csv'
)
train_model = BashOperator(
task_id='train_model_on_data',
bash_command='python /path/to/scripts/train_model.py --input_path /tmp/data.csv --model_path /tmp/model.pkl'
)
read_data >> train_model
The primary concern here is when tasks run. Airflow manages dependencies and schedules based on time and task completion. If read_data.py produces a file at /tmp/data.csv, train_model.py just has to hope it exists and is in the right format when it runs. There’s no inherent understanding of the data flowing between them.
Now, let’s look at Dagster. In Dagster, you define "ops" (operations) and "graphs" (collections of ops). Crucially, ops have explicit inputs and outputs, and Dagster tracks the lineage and types of these data assets.
from dagster import op, job, In, Out, graph
@op(out=Out(dagster_type=str))
def read_data_op():
"""Reads data and returns the path to the output file."""
output_path = "/tmp/data.csv"
# Simulate writing data
with open(output_path, "w") as f:
f.write("col1,col2\n1,a\n2,b\n")
return output_path
@op(ins={"data_path": In(dagster_type=str)}, out=Out(dagster_type=str))
def train_model_op(data_path: str):
"""Trains a model on data from the given path."""
# Simulate training and saving a model
with open(data_path, "r") as f:
print(f"Reading data from: {data_path}")
model_path = "/tmp/model.pkl"
with open(model_path, "w") as f:
f.write("dummy_model_content")
return model_path
@graph
def training_graph():
data_path = read_data_op()
train_model_op(data_path=data_path)
training_job = training_graph.to_job()
When you execute training_job in Dagster, it doesn’t just schedule tasks. It understands that train_model_op requires the output of read_data_op. It passes the actual value (in this case, the string /tmp/data.csv) from read_data_op directly to train_model_op. This is explicit data passing, not implicit file system reliance. Dagster’s UI, Dagit, visualizes this data flow, showing not just dependencies but the typed assets being produced and consumed.
This leads to the core problem Dagster solves: managing the complexity of data pipelines where the data itself is a first-class citizen. Airflow is excellent for orchestrating workflows where tasks trigger other tasks, often based on time or external events. It’s like a sophisticated cron-job manager with added dependency logic. Dagster, on the other hand, is designed for data-aware orchestration. It understands data schemas, types, lineage, and quality.
The mental model shifts from "when should this run?" to "what data needs to be produced and consumed, and how can I ensure its quality and traceability?" Dagster provides a unified view of your data assets, their transformations, and the code that produces them. This allows for powerful features like:
- Data Versioning: Easily track which code produced which data.
- Data Testing: Write unit and integration tests for your data transformations, just like you test your code.
- Data Lineage: Understand exactly how a specific data asset was created, tracing it back to its source.
- Materializations: Dagster tracks when data assets are "materialized" (created or updated), allowing for intelligent re-runs and monitoring.
- Asset-based Scheduling: Trigger downstream jobs not just when a upstream task finishes, but when a specific data asset is ready or meets certain criteria.
The most surprising thing is how Dagster’s concept of "assets" fundamentally changes how you think about pipeline development. Instead of just defining tasks and their execution order, you define the data entities you care about (like a cleaned dataset, a trained model, or a summary report) and the transformations that produce them. This asset-centric view makes pipelines more modular, testable, and understandable, as you’re reasoning about the data products you’re building.
The next step in mastering Dagster is understanding its abstractions for testing and local development, particularly Definitions and ReplayExecution.