Git’s ability to track changes over time is what makes it so powerful, but that history can also become a performance bottleneck if not managed.
Let’s see Git’s maintenance in action. Imagine you’ve been working on a project, frequently committing small changes, maybe rebasing often, and perhaps even deleting and recreating branches. This activity, while normal, can lead to a fragmented repository.
# First, let's look at the current state before any maintenance
git count-objects -vH
# This might show a large number of loose objects and delta chains
# Now, let's perform a basic garbage collection
git gc
# After gc, let's check again
git count-objects -vH
You’ll likely observe a reduction in loose objects and a more consolidated packfile. This is Git reorganizing its internal storage to be more efficient.
The core problem Git maintenance solves is repository bloat and fragmentation. Over time, Git stores objects (commits, trees, blobs) individually. When you delete branches, amend commits, or perform other history-rewriting operations, older versions of these objects might linger. This leads to:
- Increased disk space usage: More objects mean more storage.
- Slower operations: Git has to sift through more data to find what it needs for commands like
git log,git blame, or even fetching. - Larger network transfers: When cloning or fetching, Git sends compressed history, and a bloated history means larger downloads.
git gc (garbage collection) is the primary command for this. It performs several crucial tasks:
- Consolidates loose objects into packfiles: Git stores new objects initially as "loose objects."
git gcgathers these loose objects and compresses them into efficient "packfiles," which are single files containing many objects. This reduces the number of files Git needs to manage. - Optimizes packfiles: It can repack existing packfiles, identifying redundant data and creating delta compression. Delta compression stores objects as differences (deltas) from a base object, saving significant space.
- Removes unreachable objects: Objects that are no longer referenced by any branch, tag, or other reachable commit are marked for deletion.
git gcpurges these.
Common Causes of Repository Bloat and Their Fixes:
-
Excessive loose objects: This happens after many commits without a
gcrun.- Diagnosis:
git count-objects -vHwill show a high number of "loose objects." - Fix: Run
git gc. This command will repack these loose objects into a packfile. - Why it works: Packfiles are a more efficient storage format than individual loose objects, reducing file I/O.
- Diagnosis:
-
Large binary files committed unintentionally: Even after deletion, Git retains history.
- Diagnosis: Use
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(committer)' | sed -e 's/^blob //; /^[0-9a-f]\{40\}/!d; s/ .*//' | xargs -0 -n1 -i sh -c 'echo -n "{} " && git cat-file -s "{}"' | sort -n -r | head -n 20to find the largest objects. Or use tools likegit-sizer. - Fix: Use
git filter-repo(preferred overbfgorfilter-branch) to rewrite history and remove the large files. For example, to remove a file namedlarge_binary.zip:
Then, rungit filter-repo --path large_binary.zip --invert-pathsgit gc --aggressive --prune=now. - Why it works:
filter-reporewrites commit history, andgccleans up the now-unreachable old objects containing the large file.
- Diagnosis: Use
-
Numerous small, frequent commits without consolidation: While good for granular tracking, too many can fragment the object database over time.
- Diagnosis:
git log --oneline --graph --decoratemight show a very dense history.git count-objects -vHmight show many small delta chains. - Fix: Periodically use
git rebase -ito squash or reword commits. Follow this withgit gc. - Why it works: Squashing combines multiple commits into one, reducing the number of objects and simplifying the history graph.
- Diagnosis:
-
Stale
.git/objects/packfiles: Sometimes, older, less efficient packfiles can be left behind.- Diagnosis:
git gc --prune=nowwill usually address this. If you suspect issues, examine the contents of.git/objects/pack/. - Fix:
git gc --prune=nowremoves all objects that are no longer referenced by any packfile or loose object, and then repacks. - Why it works: This ensures Git is only using the most optimized packfiles and removing cruft.
- Diagnosis:
-
Accidental commits to large files that were later removed: Similar to large binary files, but might be transient.
- Diagnosis: Use
git rev-list --all --objects | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(committer)' | sort -k3 -nr | head -n 10to find large objects. - Fix: Use
git filter-repoto remove the commits that introduced the large file, then rungit gc --aggressive --prune=now. - Why it works: Rewriting history removes the commit containing the large file, and
gccleans up the associated objects.
- Diagnosis: Use
-
Forgetting to run
git gcregularly: Many developers don’t realizegcisn’t always run automatically or frequently enough.- Diagnosis:
git count-objects -vHshows a large number of loose objects or many packfiles. - Fix: Schedule
git gcto run periodically, perhaps via a Git hook or a cron job (e.g.,git gc --autowhich runs if certain conditions are met, orgit gcfor a full run). - Why it works: Regular garbage collection keeps the repository in an optimized state, preventing significant performance degradation.
- Diagnosis:
The git gc command, especially with --aggressive and --prune=now, can be quite resource-intensive. It’s often best run during off-peak hours or on a repository that isn’t actively being worked on by many people simultaneously.
After performing significant history rewriting with git filter-repo or similar tools, you’ll often need to run git gc --aggressive --prune=now to fully clean up the old, unreachable objects that were part of the previous history.
The next challenge you’ll likely face is understanding how to configure the garbage collection behavior to run automatically and efficiently for your team’s workflow.