Git’s sparse-checkout and filter-branch (or its successor, git filter-repo) are powerful tools for managing large monorepos, but they operate on fundamentally different principles and solve distinct problems. It’s easy to conflate them, but understanding their core differences is key to using them effectively.
git sparse-checkout is about selective checkout of files and directories. When you have a massive monorepo, checking out every single file can take ages and consume vast amounts of disk space. Sparse checkout lets you tell Git, "I only care about this subdirectory," and Git will only populate your working directory with those specified paths. It’s like telling your Git client to ignore everything else for now.
Consider a monorepo with services/frontend/, services/backend/, services/database/, and shared/libs/. If you’re a frontend developer, you might only need services/frontend/ and shared/libs/.
Here’s how you’d set that up:
First, enable sparse checkout:
git sparse-checkout init --cone
Then, define the directories you want:
git sparse-checkout set services/frontend/ shared/libs/
Now, git status will only show changes within services/frontend/ and shared/libs/, and your working directory will only contain files from these paths. The rest of the monorepo’s history is still there, but your local filesystem is much cleaner.
The "cone" mode is the most efficient and recommended for large repos. It works by specifying top-level directories, and Git automatically includes everything within those directories. If you need more granular control, you can disable cone mode (git sparse-checkout init) and list individual files, but this is generally less performant.
The real power of sparse checkout comes when combined with Git’s built-in core.sparseCheckout configuration. You can set it globally or per-repository.
On the other hand, git filter-branch (and git filter-repo) is about rewriting history. This is a much more destructive operation. If your monorepo has grown to an unmanageable size because it accidentally committed large binary files, or sensitive data, or if you need to completely remove a directory from all past commits, then filtering is what you need. It’s like performing surgery on your Git history.
Let’s say you want to remove an old, large assets/ directory that was accidentally committed early on and is now bloating your repository.
Using git filter-repo (which is recommended over git filter-branch due to performance and safety):
First, install git-filter-repo:
pip install git-filter-repo
Then, run the command to remove the directory:
git filter-repo --path assets/ --invert-paths
This command tells git filter-repo to keep everything except the assets/ directory. The --invert-paths flag is crucial here; without it, you’d be telling Git to only keep assets/.
After running filter-repo, your history will be rewritten. This means commit SHAs will change. If you’ve already pushed this history, you’ll need to force-push, which can be disruptive for collaborators.
The most surprising thing about history rewriting with filter-repo is how it handles submodules and other complex Git features. It’s designed to be robust, but it’s still a powerful tool that should be used with extreme caution. Always back up your repository before attempting a history rewrite.
The key difference boils down to this: sparse checkout changes what you see and work with locally, while history filtering changes the actual history of the repository itself. You can use sparse checkout to work on a subset of a repo without ever touching its entire history. You use history filtering when the entire history is the problem.
The next problem you’ll likely encounter is figuring out how to re-clone a repository after a history rewrite without fetching all the old, massive objects.