Fix Linux mdraid Degraded Array Errors (2026)

The mdadm daemon failed to properly monitor and re-sync a degraded RAID array, leading to potential data loss and persistent error notifications.

Common Causes and Fixes

1. Drive Failure Detected by mdadm

Diagnosis:
```
sudo mdadm --detail /dev/md0
```
Look for a drive listed as (F) or (U) (failed or unallocated) and a State of clean, degraded.
Fix: If a specific drive has physically failed (e.g., SMART errors, system logs showing I/O errors for that device), physically replace it. Once replaced, mark the old drive as missing and add the new one:
```
sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the failed partition
sudo mdadm /dev/md0 --manage --remove /dev/sdX1
sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Replace sdY1 with the new drive's partition
```
The array will immediately start rebuilding onto the new drive.
Why it works: mdadm is designed to detect and handle individual drive failures. By explicitly failing and removing the old drive, then adding the new one, you allow mdadm to correctly re-initialize the spare and begin the reconstruction process.

2. Drive Failure Not Detected by mdadm (but by kernel/SMART)

Diagnosis: Check kernel logs for I/O errors related to specific drives:
```
dmesg | grep 'error opening block device'
sudo smartctl -a /dev/sdX # Replace sdX with suspect drive
```
If SMART shows errors but mdadm --detail still shows the drive as active or idle, the kernel might be experiencing issues without mdadm marking it as failed.

Fix: Manually mark the problematic drive as failed and remove it, then add a replacement:

sudo mdadm /dev/md0 --manage --fail /dev/sdX1
sudo mdadm /dev/md0 --manage --remove /dev/sdX1
sudo mdadm /dev/md0 --manage --add /dev/sdY1

Why it works: This forces mdadm to acknowledge the drive’s unreliability, even if its internal state didn’t trigger an automatic failure. The rebuild process then proceeds as if a drive had naturally failed.

3. mdadm Daemon Not Running or Misconfigured

Diagnosis: Check if mdadm is running as a service:
```
sudo systemctl status mdmonitor.service
```
If not running, check its configuration in /etc/mdadm/mdadm.conf or /etc/mdadm.conf. Ensure AUTO +CRITICAL +DELAYED is set for the relevant array.
Fix: Start and enable the service:
```
sudo systemctl start mdmonitor.service
sudo systemctl enable mdmonitor.service
```
If configuration is suspect, review /etc/mdadm/mdadm.conf and ensure it correctly lists your arrays and their devices. A common issue is missing HOMEHOST or incorrect ARRAY definitions.
Why it works: The mdmonitor service is responsible for actively monitoring RAID array health, detecting failures, and initiating recovery actions. Without it, mdadm’s passive reporting can be misleading, and automatic recovery won’t occur.

4. Corrupted mdadm Superblock or Metadata

Diagnosis: Run sudo mdadm --examine /dev/sdX1 for each drive in the array. Look for inconsistencies in UUIDs, event counts, or device roles compared to other drives. A severely corrupted superblock might prevent mdadm --detail from showing a valid array.

Fix: This is the most dangerous fix and requires careful planning. If you have a backup, consider recreating the array from scratch. If not, and you’re certain which drive is the source of truth (e.g., has the latest metadata), you can try to force a rebuild:

# WARNING: This can lead to data loss if done incorrectly.
# Ensure you have backups and understand the risks.
# Identify the drive with the most recent metadata (e.g., highest event count from --examine)
# Let's assume /dev/sdA1 is the 'good' drive.
sudo mdadm --assemble --force /dev/md0 /dev/sdA1 /dev/sdB1 /dev/sdC1 # Include all partitions
# Then, if a drive is still marked failed, force it:
sudo mdadm /dev/md0 --manage --replace /dev/sdX1 /dev/sdY1 # Replace sdX1 with failed, sdY1 with good spare

A more robust approach is to remove all devices, re-assemble the array with only the known good devices, and then add spares.

Why it works: Forcing assembly with specific devices tells mdadm to trust the metadata on those particular drives, overriding potentially conflicting or corrupted information from others. This allows the array to become active again, after which you can rectify any remaining inconsistencies.

5. Partitions Not Aligned Correctly or Missing

Diagnosis: Ensure each physical drive has a partition of the exact same size and type (Linux RAID autodetect, 0xfd) covering the intended RAID space. Use fdisk -l /dev/sdX or parted /dev/sdX print for each drive. Verify that mdadm --detail /dev/md0 lists all expected partitions (e.g., /dev/sda1, /dev/sdb1) not whole disks.

Fix: If partitions are incorrect or missing, you’ll need to recreate them. This will destroy data on the affected partition.

# Example for /dev/sdX
sudo fdisk /dev/sdX
# Inside fdisk: d (delete partition), n (new partition), p (primary), 1 (partition number)
# Accept defaults for start/end to use entire disk, t (change type), fd (Linux RAID autodetect)
# w (write changes)

After creating/correcting partitions, add them to the array:

sudo mdadm /dev/md0 --manage --add /dev/sdX1

Why it works: RAID arrays operate on partitions, not raw disks, for metadata management and device identification. Correctly partitioning each drive ensures mdadm can properly identify and utilize the storage space for the array.

6. Rebuild Process Interrupted or Hung

Diagnosis: Check the rebuild progress:
```
cat /proc/mdstat
```
If it shows a rebuild percentage that hasn’t changed for a very long time, or if mdadm --detail /dev/md0 shows the array as resyncing or recovering indefinitely, the rebuild may be stuck.

Fix: Sometimes, a simple mdadm --manage --stop /dev/md0 followed by mdadm --assemble --scan can reset the state. If that fails, you might need to manually fail and re-add the drive that is part of the rebuild.

# Find the drive that is actively rebuilding (check /proc/mdstat)
sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the rebuilding drive
sudo mdadm /dev/md0 --manage --remove /dev/sdX1
sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Add the spare or replacement drive

Why it works: Stopping and re-assembling the array can clear transient states that prevent the rebuild from completing. Forcing a fail/remove/add cycle on the problematic drive essentially restarts the rebuild process from a clean slate.

After resolving these issues, you’ll likely encounter mdadm: /dev/md0 has been started with 1 drive (out of 4) and 3 emergencies. This is normal as the system boots and mdadm reassembles the array before the rebuild is fully complete.