The mdadm daemon failed to properly monitor and re-sync a degraded RAID array, leading to potential data loss and persistent error notifications.
Common Causes and Fixes
1. Drive Failure Detected by mdadm
-
Diagnosis:
sudo mdadm --detail /dev/md0Look for a drive listed as
(F)or(U)(failed or unallocated) and aStateofclean, degraded. -
Fix: If a specific drive has physically failed (e.g., SMART errors, system logs showing I/O errors for that device), physically replace it. Once replaced, mark the old drive as missing and add the new one:
sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the failed partition sudo mdadm /dev/md0 --manage --remove /dev/sdX1 sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Replace sdY1 with the new drive's partitionThe array will immediately start rebuilding onto the new drive.
-
Why it works:
mdadmis designed to detect and handle individual drive failures. By explicitly failing and removing the old drive, then adding the new one, you allowmdadmto correctly re-initialize the spare and begin the reconstruction process.
2. Drive Failure Not Detected by mdadm (but by kernel/SMART)
-
Diagnosis: Check kernel logs for I/O errors related to specific drives:
dmesg | grep 'error opening block device' sudo smartctl -a /dev/sdX # Replace sdX with suspect driveIf SMART shows errors but
mdadm --detailstill shows the drive asactiveoridle, the kernel might be experiencing issues withoutmdadmmarking it as failed. -
Fix: Manually mark the problematic drive as failed and remove it, then add a replacement:
sudo mdadm /dev/md0 --manage --fail /dev/sdX1 sudo mdadm /dev/md0 --manage --remove /dev/sdX1 sudo mdadm /dev/md0 --manage --add /dev/sdY1 -
Why it works: This forces
mdadmto acknowledge the drive’s unreliability, even if its internal state didn’t trigger an automatic failure. The rebuild process then proceeds as if a drive had naturally failed.
3. mdadm Daemon Not Running or Misconfigured
-
Diagnosis: Check if
mdadmis running as a service:sudo systemctl status mdmonitor.serviceIf not running, check its configuration in
/etc/mdadm/mdadm.confor/etc/mdadm.conf. EnsureAUTO +CRITICAL +DELAYEDis set for the relevant array. -
Fix: Start and enable the service:
sudo systemctl start mdmonitor.service sudo systemctl enable mdmonitor.serviceIf configuration is suspect, review
/etc/mdadm/mdadm.confand ensure it correctly lists your arrays and their devices. A common issue is missingHOMEHOSTor incorrectARRAYdefinitions. -
Why it works: The
mdmonitorservice is responsible for actively monitoring RAID array health, detecting failures, and initiating recovery actions. Without it,mdadm’s passive reporting can be misleading, and automatic recovery won’t occur.
4. Corrupted mdadm Superblock or Metadata
-
Diagnosis: Run
sudo mdadm --examine /dev/sdX1for each drive in the array. Look for inconsistencies in UUIDs, event counts, or device roles compared to other drives. A severely corrupted superblock might preventmdadm --detailfrom showing a valid array. -
Fix: This is the most dangerous fix and requires careful planning. If you have a backup, consider recreating the array from scratch. If not, and you’re certain which drive is the source of truth (e.g., has the latest metadata), you can try to force a rebuild:
# WARNING: This can lead to data loss if done incorrectly. # Ensure you have backups and understand the risks. # Identify the drive with the most recent metadata (e.g., highest event count from --examine) # Let's assume /dev/sdA1 is the 'good' drive. sudo mdadm --assemble --force /dev/md0 /dev/sdA1 /dev/sdB1 /dev/sdC1 # Include all partitions # Then, if a drive is still marked failed, force it: sudo mdadm /dev/md0 --manage --replace /dev/sdX1 /dev/sdY1 # Replace sdX1 with failed, sdY1 with good spareA more robust approach is to remove all devices, re-assemble the array with only the known good devices, and then add spares.
-
Why it works: Forcing assembly with specific devices tells
mdadmto trust the metadata on those particular drives, overriding potentially conflicting or corrupted information from others. This allows the array to become active again, after which you can rectify any remaining inconsistencies.
5. Partitions Not Aligned Correctly or Missing
-
Diagnosis: Ensure each physical drive has a partition of the exact same size and type (Linux RAID autodetect,
0xfd) covering the intended RAID space. Usefdisk -l /dev/sdXorparted /dev/sdX printfor each drive. Verify thatmdadm --detail /dev/md0lists all expected partitions (e.g.,/dev/sda1,/dev/sdb1) not whole disks. -
Fix: If partitions are incorrect or missing, you’ll need to recreate them. This will destroy data on the affected partition.
# Example for /dev/sdX sudo fdisk /dev/sdX # Inside fdisk: d (delete partition), n (new partition), p (primary), 1 (partition number) # Accept defaults for start/end to use entire disk, t (change type), fd (Linux RAID autodetect) # w (write changes)After creating/correcting partitions, add them to the array:
sudo mdadm /dev/md0 --manage --add /dev/sdX1 -
Why it works: RAID arrays operate on partitions, not raw disks, for metadata management and device identification. Correctly partitioning each drive ensures
mdadmcan properly identify and utilize the storage space for the array.
6. Rebuild Process Interrupted or Hung
-
Diagnosis: Check the rebuild progress:
cat /proc/mdstatIf it shows a rebuild percentage that hasn’t changed for a very long time, or if
mdadm --detail /dev/md0shows the array asresyncingorrecoveringindefinitely, the rebuild may be stuck. -
Fix: Sometimes, a simple
mdadm --manage --stop /dev/md0followed bymdadm --assemble --scancan reset the state. If that fails, you might need to manually fail and re-add the drive that is part of the rebuild.# Find the drive that is actively rebuilding (check /proc/mdstat) sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the rebuilding drive sudo mdadm /dev/md0 --manage --remove /dev/sdX1 sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Add the spare or replacement drive -
Why it works: Stopping and re-assembling the array can clear transient states that prevent the rebuild from completing. Forcing a fail/remove/add cycle on the problematic drive essentially restarts the rebuild process from a clean slate.
After resolving these issues, you’ll likely encounter mdadm: /dev/md0 has been started with 1 drive (out of 4) and 3 emergencies. This is normal as the system boots and mdadm reassembles the array before the rebuild is fully complete.