Kubeflow Pipelines is a component of Kubeflow that lets you build and deploy portable, scalable machine learning workflows directly on Kubernetes.

Here’s a Kubeflow Pipeline in action, processing some data:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: data-processing-pipeline
spec:
  entrypoint: process-data
  templates:
  - name: process-data
    steps:
    - - name: download-data
        template: download-data-template
    - - name: clean-data
        template: clean-data-template
        dependencies: [download-data]
    - - name: feature-engineer
        template: feature-engineer-template
        dependencies: [clean-data]

  - name: download-data-template
    container:
      image: ubuntu:latest
      command: ["bash", "-c"]
      args: ["echo 'Downloading data...' && sleep 5"]

  - name: clean-data-template
    container:
      image: ubuntu:latest
      command: ["bash", "-c"]
      args: ["echo 'Cleaning data...' && sleep 5"]

  - name: feature-engineer-template
    container:
      image: ubuntu:latest
      command: ["bash", "-c"]
      args: ["echo 'Engineering features...' && sleep 5"]

This YAML defines a simple workflow. The entrypoint is process-data, which is a sequence of steps: download-data, clean-data, and feature-engineer. Each step runs a containerized task. The dependencies field ensures that tasks run in the correct order. For example, clean-data only starts after download-data has completed.

Kubeflow Pipelines is built on top of Argo Workflows, a Kubernetes-native workflow engine. This means your ML pipelines are essentially Kubernetes Workflow custom resources. When you define a pipeline, Kubeflow translates it into this format and submits it to your Kubernetes cluster. Each step in your pipeline becomes a Kubernetes pod. This leverages Kubernetes’ strengths:

  • Scalability: Kubernetes can automatically scale the number of pods to handle your workload.
  • Portability: Pipelines defined this way run anywhere Kubernetes runs, from your laptop to cloud providers.
  • Resource Management: You can define CPU, memory, and GPU requirements for each step, and Kubernetes will schedule them accordingly.
  • Reproducibility: By containerizing each step and defining dependencies, you create a reproducible ML workflow.

The core problem Kubeflow Pipelines solves is the transition from a local ML experiment to a production-ready, automated ML system. It bridges the gap between data scientists’ notebooks and the operational demands of deploying and managing ML models. It provides a structured way to orchestrate complex ML tasks, such as data ingestion, preprocessing, model training, evaluation, and deployment, as a single, cohesive unit.

You interact with Kubeflow Pipelines primarily through its UI or its Python SDK. The UI provides a visual representation of your pipelines, their runs, and their outputs. The SDK allows you to programmatically define, compile, and run pipelines.

Here’s a snippet of the Python SDK for defining a pipeline:

from kfp import dsl

@dsl.pipeline(
    name='Data Processing Pipeline',
    description='A pipeline for processing data.'
)
def data_processing_pipeline(dataset_url: str):
    download_task = download_data_op(url=dataset_url)
    clean_task = clean_data_op(downloaded_data=download_task.output)
    engineer_task = feature_engineer_op(cleaned_data=clean_task.output)

@dsl.component
def download_data_op(url: str):
    # This is a placeholder for actual download logic
    print(f"Downloading data from {url}...")
    return f"data_from_{url}"

@dsl.component
def clean_data_op(downloaded_data: str):
    # Placeholder for cleaning logic
    print(f"Cleaning data: {downloaded_data}...")
    return f"cleaned_{downloaded_data}"

@dsl.component
def feature_engineer_op(cleaned_data: str):
    # Placeholder for feature engineering logic
    print(f"Engineering features for: {cleaned_data}...")
    return f"features_for_{cleaned_data}"

# To compile and run:
# pipeline_func.compile()
# experiment.create_run_from_pipeline_func(pipeline_func, arguments={...})

In this Python code, @dsl.pipeline decorates a function that defines the overall workflow. @dsl.component defines individual steps (operations) that can be reused. The dsl.Pipeline object is then compiled into the Argo Workflow YAML described earlier.

The most surprising thing about Kubeflow Pipelines is how it abstracts away the underlying Kubernetes complexity while still giving you full control over resource allocation and scheduling. You define your ML steps logically, and Kubeflow, via Argo, translates these into Kubernetes pods that are scheduled and managed by the cluster, allowing you to specify resource requests and limits for each step. For instance, if your training step requires a GPU, you can specify kfp.dsl.Resource(gpu=1) within the component definition, and Kubernetes will ensure that pod is scheduled on a node with an available GPU.

The next concept you’ll likely explore is managing artifacts—the outputs of your pipeline steps, like trained models or processed datasets—and how to version and reuse them across different pipeline runs.

Want structured learning?

Take the full MLOps & AI DevOps course →