File Systems: Beyond the Basics

The most surprising thing about modern Linux filesystems is that they don’t actually store files in the way you probably think they do.

Let’s see one in action. Imagine we have a simple directory structure and some files.

$ mkdir /mnt/myfs
$ sudo mkfs.ext4 /dev/loop0 # (assuming /dev/loop0 is set up with a backing file)
$ sudo mount /dev/loop0 /mnt/myfs
$ echo "hello" > /mnt/myfs/file1.txt
$ echo "world" > /mnt/myfs/file2.txt
$ mkdir /mnt/myfs/subdir
$ echo "nested" > /mnt/myfs/subdir/file3.txt

When you ls -l /mnt/myfs, you’re not seeing a direct mapping of filenames to disk blocks. Instead, you’re seeing an abstraction. The filesystem maintains internal data structures (like inodes and directory entries) that point to the actual data blocks. A directory entry is essentially a small file mapping a human-readable name to an inode number. The inode, in turn, contains metadata about the file (permissions, timestamps, owner) and pointers to the data blocks on disk.

This separation of metadata (inode) and data is a core concept. It allows for flexibility. For example, hard links are just multiple directory entries pointing to the same inode. Deleting a file doesn’t immediately erase data; it decrements a link count in the inode. Only when the link count reaches zero is the inode and its associated data blocks truly freed for reuse.

The problem these filesystems solve is managing the complexity of storing and retrieving vast amounts of data efficiently and reliably on block-based storage devices. They provide a hierarchical namespace (directories), handle data allocation, track free space, ensure data integrity, and offer features like journaling, snapshots, and compression.

ext4 is the venerable workhorse, a direct descendant of ext2 and ext3. It introduced features like extents (contiguous allocation of data blocks, improving performance for large files), persistent pre-allocation, and faster fsck times. Its journaling mechanism (write-ahead logging) ensures that filesystem metadata operations are recorded before being applied, allowing for quick recovery after crashes.

XFS is known for its high performance, especially with large files and parallel I/O. It uses a B+ tree structure for almost all of its internal data management, including allocation groups, inodes, and directories. This aggressive use of trees allows for efficient allocation and searching, making it a popular choice for servers and high-performance computing. XFS also features delayed allocation, where the filesystem defers the decision of where to place data blocks until just before they are written, allowing for better allocation strategies and reduced fragmentation.

Btrfs is the modern challenger, aiming to provide advanced features like copy-on-write (CoW), snapshots, built-in RAID, transparent compression, and subvolumes. CoW is a fundamental difference: when data is modified, Btrfs writes the new data to a new location on disk and then updates the metadata pointers to point to the new location, leaving the old data untouched until it’s no longer referenced. This enables efficient snapshots (which are just pointers to a specific CoW tree state) and helps with data integrity.

When you modify a file on a CoW filesystem like Btrfs, the entire block containing that data is copied to a new location, modified, and then the tree structure pointing to that block is updated. The old block remains until all references to it are gone. This is how snapshots are so space-efficient: a snapshot is simply a read-only reference to the filesystem tree at a specific point in time. New writes after the snapshot are to new locations, leaving the snapshot’s data undisturbed.

The next concept to explore is how these filesystems handle data integrity and recovery in the face of hardware failures.