Azure Machine Learning pipelines are the bedrock of MLOps, transforming ad-hoc model training into repeatable, automated workflows. The most surprising thing about them? You can often get a production-ready pipeline running with surprisingly little code, and the complexity you do introduce is almost entirely driven by your specific business needs, not the platform itself.
Let’s see a pipeline in action. Imagine we’re building a simple sentiment analysis model. Here’s a Python script that defines and runs a pipeline:
from azure.ml import MLClient
from azure.ml.entities import PipelineJob
from azure.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
from azure.ml.dsl import pipeline
from azure.ml.dsl import command
# Authenticate and get MLClient
credential = DefaultAzureCredential()
ml_client = MLClient(
credential=credential,
subscription_id="YOUR_SUBSCRIPTION_ID",
resource_group_name="YOUR_RESOURCE_GROUP",
workspace_name="YOUR_WORKSPACE_NAME"
)
# Define components (these are the building blocks of your pipeline)
@command(
name="train_sentiment_model",
display_name="Train Sentiment Model",
environment="azureml://registries/azureml/environments/sklearn-1.0/versions/5",
command="python train.py --data_path ${{inputs.training_data}} --output_model ${{outputs.model_output}}"
)
def train_model(training_data):
return {"model_output": "azureml://datastores/workspaceblobstore/paths/models/sentiment_model.pkl"}
@command(
name="score_sentiment_model",
display_name="Score Sentiment Model",
environment="azureml://registries/azureml/environments/sklearn-1.0/versions/5",
command="python score.py --input_model ${{inputs.model_to_score}} --input_data ${{inputs.scoring_data}} --output_scores ${{outputs.scores}}"
)
def score_model(model_to_score, scoring_data):
return {"scores": "azureml://datastores/workspaceblobstore/paths/scores/sentiment_scores.csv"}
# Create a pipeline
@pipeline(
default_compute="cpu-cluster", # Your Azure ML compute cluster name
default_datastore="workspaceblobstore" # Your default Azure ML datastore
)
def sentiment_pipeline(training_data_input, scoring_data_input):
train_step = train_model(training_data=training_data_input)
score_step = score_model(
model_to_score=train_step.outputs.model_output,
scoring_data=scoring_data_input
)
return {"pipeline_output_scores": score_step.outputs.scores}
# Instantiate the pipeline
pipeline_job = sentiment_pipeline(
training_data_input=Input(type=AssetTypes.URI_FOLDER, path="azureml://datastores/workspaceblobstore/paths/data/training/"),
scoring_data_input=Input(type=AssetTypes.URI_FOLDER, path="azureml://datastores/workspaceblobstore/paths/data/scoring/")
)
# Submit the pipeline job
pipeline_job = ml_client.jobs.create(pipeline_job)
print(f"Pipeline job submitted: {pipeline_job.id}")
print(f"View in Azure ML Studio: {pipeline_job.studio_url}")
In this example, train_model and score_model are defined as command components. These are essentially wrappers around your Python scripts, specifying their inputs, outputs, and the environment (Docker image) they need to run in. The @pipeline decorator then stitches these components together. training_data_input and scoring_data_input are the pipeline’s top-level inputs, which are then passed to the respective components. The output of the train_model step (train_step.outputs.model_output) becomes an input to the score_model step. Finally, we instantiate and submit the pipeline_job using our MLClient.
The core problem Azure ML pipelines solve is managing the complexity of the machine learning lifecycle. It’s not just about training a model; it’s about data preparation, feature engineering, model training, validation, deployment, and monitoring – all in a reproducible way. Each step in the pipeline runs as an independent job, either on a CPU or GPU compute cluster you’ve configured in Azure ML. This allows for parallel execution and efficient resource utilization. You define inputs and outputs for each step, and Azure ML handles passing data between them, often using Azure Blob Storage or Azure Data Lake Storage as the underlying data store.
The exact levers you control are primarily within the component definitions and the pipeline structure. For components, you specify the Python script, the environment (which dictates the OS, Python version, and installed packages), the compute target, and the command-line arguments. For the pipeline itself, you define the order of execution through how you connect component outputs to inputs, and you can specify default compute targets and datastores. You can also control versioning of components and environments, ensuring that your pipelines are always using the intended artifacts.
A lesser-known aspect of pipeline composition is how Azure ML handles the resolution of inputs and outputs. When you define a component input like training_data, and then pass an Input object to it, Azure ML doesn’t just pass a file path. It creates a reference to the data asset in your datastore. If the data is large, it might be staged on the compute target. Similarly, outputs are registered as data assets with versioning. This underlying asset management is crucial for reproducibility and lineage tracking – you can always trace back exactly which data and code produced a specific model or score.
Once your pipeline is running, the next logical step is to automate its execution based on triggers, perhaps a new data arrival or a schedule, which leads you into Azure Data Factory or Azure Logic Apps integration.