MLOps data versioning is less about tracking changes to files and more about ensuring reproducibility of model training runs.
Let’s see how DVC and LakeFS tackle this.
Imagine you’ve trained a model and it’s performing great. You want to be able to reproduce that exact performance later. This means you need not just the model code, but also the exact version of the data it was trained on. Traditional Git handles code versioning, but it’s terrible for large data files. DVC and LakeFS offer solutions.
DVC (Data Version Control)
DVC works by storing metadata about your data files in Git, while the actual data files are stored in a remote storage backend (like S3, GCS, Azure Blob Storage, or even a simple network share).
Here’s a typical DVC workflow:
-
Initialize DVC:
dvc init git initThis creates a
.dvcdirectory where DVC will store its metadata. -
Add data to DVC:
dvc add data/raw/images.csvThis command creates a
data/raw/images.csv.dvcfile. This.dvcfile is small and contains a hash of the actual data file and a pointer to its location in your configured remote storage. -
Commit to Git:
git add data/raw/images.csv.dvc .gitignore git commit -m "Add initial dataset"You commit the
.dvcfile to Git. The actualimages.csvfile is not committed to Git. -
Push data to remote storage:
dvc pushThis uploads the actual
images.csvfile to your configured remote storage. -
Reproduce data: Later, when you clone the Git repository, you’ll have the
.dvcfiles but not the data. To get the data back:dvc pullDVC will read the
.dvcfiles, check your remote storage, and download the correct versions of your data files.
DVC also integrates with pipelines. You can define dvc.yaml files that specify dependencies (data), commands (scripts), and outputs (trained models). Running dvc repro will execute your pipeline, ensuring that if you change your data, it automatically re-runs the relevant steps.
LakeFS
LakeFS takes a different approach. It provides a Git-like interface for your data lake (e.g., S3, GCS). Instead of just pointing to files, LakeFS creates a versioned data layer on top of your object storage.
Here’s a basic LakeFS flow:
-
Set up LakeFS: You’ll need a running LakeFS server. You then configure your local environment to point to it.
-
Create a repository:
lakectl repo create my-data-repo --storage-namespace s3://my-lakefs-bucket/my-data-repoThis creates a new LakeFS repository.
-
Ingest data: You can ingest data from existing object storage or upload new data. For example, to copy an existing S3 path into LakeFS:
lakectl commit my-data-repo@main --message "Ingest initial data" --source s3://my-source-bucket/initial-dataThis creates a commit on the
mainbranch of yourmy-data-repoin LakeFS, effectively versioning that data. -
Work on a branch: Just like Git, you branch.
lakectl branch create my-feature-branch my-data-repo@main -
Make changes: You can then read and write data using the LakeFS-provided endpoints (e.g., a local MinIO instance that mirrors your LakeFS repo, or directly via the LakeFS API/SDK). For instance, if you’re using Spark, you’d configure it to read from and write to the LakeFS branch.
# Example using PySpark (simplified) df = spark.read.parquet("lakefs://my-data-repo/my-feature-branch/data.parquet") # ... perform transformations ... df.write.parquet("lakefs://my-data-repo/my-feature-branch/transformed_data.parquet") -
Commit changes:
lakectl commit my-data-repo@my-feature-branch --message "Add transformed data" -
Merge changes:
lakectl merge my-data-repo@my-feature-branch --into my-data-repo@mainThis merges the changes from your feature branch into the main branch, creating a new, versioned snapshot.
LakeFS provides atomic commits and rollbacks for your data. If a training run fails due to bad data, you can revert the entire dataset to a previous, known-good state.
The most surprising thing about data versioning is how much it shifts the focus from file integrity to logical data state. You’re not just tracking a specific blob of bytes; you’re tracking a snapshot of a dataset that represents a particular stage in your data’s lifecycle, enabling reproducible experiments.
Both DVC and LakeFS achieve data versioning, but with different philosophies. DVC is an extension of Git that manages data pointers and leverages external storage. LakeFS is a layer over object storage that provides a Git-like experience for data itself. Choose DVC if you want to integrate data versioning into your existing Git workflow and use standard cloud object storage. Opt for LakeFS if you need a robust, transactional data lake with branching, merging, and atomic commits directly on your data.
The next challenge you’ll likely face is managing model versions alongside your data versions for end-to-end experiment tracking.