Git’s internal storage mechanism is a surprisingly simple, content-addressable filesystem that lets you reconstruct any past state of your project with incredible fidelity.
Let’s see Git in action, not with a conceptual diagram, but with actual files and commands.
First, create a new, empty Git repository:
mkdir my-git-repo
cd my-git-repo
git init
Now, create a file:
echo "Hello, Git!" > hello.txt
Stage and commit this file:
git add hello.txt
git commit -m "Add hello.txt"
If you look inside the .git/objects directory now, you’ll see a couple of new directories and files. These are the core of Git’s object store. Git doesn’t store files as they are; it stores objects based on their content.
The fundamental object is the blob. A blob is simply the content of a file. When you ran git add hello.txt, Git calculated the SHA-1 hash of the file’s content and stored it. Let’s find that blob. First, we need to find the hash.
git cat-file -p HEAD^{tree}:hello.txt
This command tells Git to show us the content of hello.txt as it exists in the tree associated with the HEAD commit. The output will be Hello, Git!. Now, let’s find the actual blob object. The SHA-1 hash will be something like b25c213e502e7679671d48a8169256897c50494e (this will vary based on your content and Git version).
# Replace with your actual hash
ls .git/objects/b2/25c213e502e7679671d48a8169256897c50494e
You’ll see a file with the remaining part of the hash as its name. This file contains the compressed content of hello.txt. Git uses zlib compression. You can decompress and view it directly:
# Replace with your actual hash
zlib-flate -d < .git/objects/b2/25c213e502e7679671d48a8169256897c50494e | cat
This will output blob 13\0Hello, Git!, where 13 is the size of the content in bytes, followed by a null byte, and then the content itself. The blob prefix indicates the object type.
Next up are trees. A tree object represents a directory. It lists the blobs (files) and other trees (subdirectories) contained within it, along with their SHA-1 hashes and file modes.
Let’s look at the tree object for our commit:
git cat-file -p HEAD
This will show you the commit object, which includes a pointer to its root tree. The output will look something like this:
tree a1b2c3d4e5f678901234567890abcdef12345678
parent <previous_commit_hash>
author Your Name <you@example.com> 1678886400 +0000
committer Your Name <you@example.com> 1678886400 +0000
Add hello.txt
Now, let’s examine that root tree:
git cat-file -p a1b2c3d4e5f678901234567890abcdef12345678
You’ll see something like this:
100644 blob b25c213e502e7679671d48a8169256897c50494e hello.txt
This tree object tells us that at mode 100644 (regular file), there’s a blob with the hash b25c213e502e7679671d48a8169256897c50494e named hello.txt. If you had subdirectories, they would appear as tree <hash> <directory_name>.
A commit object is a snapshot of your project at a specific point in time. It points to a tree object (the root of the project’s directory structure), has a parent commit (or multiple for merges), author and committer information, and a commit message. We already saw how to inspect a commit object with git cat-file -p HEAD. The hash of the commit object is what HEAD typically points to.
Finally, tags are simply pointers to specific commit objects. They are often used to mark releases (e.g., v1.0). You can create an annotated tag (which is a full Git object itself) or a lightweight tag (which is just a reference).
git tag v1.0
git cat-file -p v1.0
If v1.0 is a lightweight tag, the cat-file command will show you the commit object it points to. If it’s an annotated tag, it will show you a tag object, which then points to a commit object.
The beauty of this system is that Git only stores unique content. If you modify hello.txt but its content remains the same, Git doesn’t store a new blob; it reuses the existing one. This makes Git incredibly efficient for storing historical data.
The actual mechanism for mapping these object hashes to their physical locations on disk is managed by Git’s internal object database. For small repositories, you’ll often see individual files for each object. For larger repositories, Git uses a packfile mechanism to compress multiple objects into a single file, significantly reducing disk space and improving performance. The git gc command is responsible for running this packing process.
The next concept to grapple with is how Git uses these objects to track changes over time, which leads directly into understanding the staging area and the diffing algorithms.