The mdadm daemon failed to properly monitor and re-sync a degraded RAID array, leading to potential data loss and persistent error notifications.

Common Causes and Fixes

1. Drive Failure Detected by mdadm

  • Diagnosis:

    sudo mdadm --detail /dev/md0
    

    Look for a drive listed as (F) or (U) (failed or unallocated) and a State of clean, degraded.

  • Fix: If a specific drive has physically failed (e.g., SMART errors, system logs showing I/O errors for that device), physically replace it. Once replaced, mark the old drive as missing and add the new one:

    sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the failed partition
    sudo mdadm /dev/md0 --manage --remove /dev/sdX1
    sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Replace sdY1 with the new drive's partition
    

    The array will immediately start rebuilding onto the new drive.

  • Why it works: mdadm is designed to detect and handle individual drive failures. By explicitly failing and removing the old drive, then adding the new one, you allow mdadm to correctly re-initialize the spare and begin the reconstruction process.

2. Drive Failure Not Detected by mdadm (but by kernel/SMART)

  • Diagnosis: Check kernel logs for I/O errors related to specific drives:

    dmesg | grep 'error opening block device'
    sudo smartctl -a /dev/sdX # Replace sdX with suspect drive
    

    If SMART shows errors but mdadm --detail still shows the drive as active or idle, the kernel might be experiencing issues without mdadm marking it as failed.

  • Fix: Manually mark the problematic drive as failed and remove it, then add a replacement:

    sudo mdadm /dev/md0 --manage --fail /dev/sdX1
    sudo mdadm /dev/md0 --manage --remove /dev/sdX1
    sudo mdadm /dev/md0 --manage --add /dev/sdY1
    
  • Why it works: This forces mdadm to acknowledge the drive’s unreliability, even if its internal state didn’t trigger an automatic failure. The rebuild process then proceeds as if a drive had naturally failed.

3. mdadm Daemon Not Running or Misconfigured

  • Diagnosis: Check if mdadm is running as a service:

    sudo systemctl status mdmonitor.service
    

    If not running, check its configuration in /etc/mdadm/mdadm.conf or /etc/mdadm.conf. Ensure AUTO +CRITICAL +DELAYED is set for the relevant array.

  • Fix: Start and enable the service:

    sudo systemctl start mdmonitor.service
    sudo systemctl enable mdmonitor.service
    

    If configuration is suspect, review /etc/mdadm/mdadm.conf and ensure it correctly lists your arrays and their devices. A common issue is missing HOMEHOST or incorrect ARRAY definitions.

  • Why it works: The mdmonitor service is responsible for actively monitoring RAID array health, detecting failures, and initiating recovery actions. Without it, mdadm’s passive reporting can be misleading, and automatic recovery won’t occur.

4. Corrupted mdadm Superblock or Metadata

  • Diagnosis: Run sudo mdadm --examine /dev/sdX1 for each drive in the array. Look for inconsistencies in UUIDs, event counts, or device roles compared to other drives. A severely corrupted superblock might prevent mdadm --detail from showing a valid array.

  • Fix: This is the most dangerous fix and requires careful planning. If you have a backup, consider recreating the array from scratch. If not, and you’re certain which drive is the source of truth (e.g., has the latest metadata), you can try to force a rebuild:

    # WARNING: This can lead to data loss if done incorrectly.
    # Ensure you have backups and understand the risks.
    # Identify the drive with the most recent metadata (e.g., highest event count from --examine)
    # Let's assume /dev/sdA1 is the 'good' drive.
    sudo mdadm --assemble --force /dev/md0 /dev/sdA1 /dev/sdB1 /dev/sdC1 # Include all partitions
    # Then, if a drive is still marked failed, force it:
    sudo mdadm /dev/md0 --manage --replace /dev/sdX1 /dev/sdY1 # Replace sdX1 with failed, sdY1 with good spare
    

    A more robust approach is to remove all devices, re-assemble the array with only the known good devices, and then add spares.

  • Why it works: Forcing assembly with specific devices tells mdadm to trust the metadata on those particular drives, overriding potentially conflicting or corrupted information from others. This allows the array to become active again, after which you can rectify any remaining inconsistencies.

5. Partitions Not Aligned Correctly or Missing

  • Diagnosis: Ensure each physical drive has a partition of the exact same size and type (Linux RAID autodetect, 0xfd) covering the intended RAID space. Use fdisk -l /dev/sdX or parted /dev/sdX print for each drive. Verify that mdadm --detail /dev/md0 lists all expected partitions (e.g., /dev/sda1, /dev/sdb1) not whole disks.

  • Fix: If partitions are incorrect or missing, you’ll need to recreate them. This will destroy data on the affected partition.

    # Example for /dev/sdX
    sudo fdisk /dev/sdX
    # Inside fdisk: d (delete partition), n (new partition), p (primary), 1 (partition number)
    # Accept defaults for start/end to use entire disk, t (change type), fd (Linux RAID autodetect)
    # w (write changes)
    

    After creating/correcting partitions, add them to the array:

    sudo mdadm /dev/md0 --manage --add /dev/sdX1
    
  • Why it works: RAID arrays operate on partitions, not raw disks, for metadata management and device identification. Correctly partitioning each drive ensures mdadm can properly identify and utilize the storage space for the array.

6. Rebuild Process Interrupted or Hung

  • Diagnosis: Check the rebuild progress:

    cat /proc/mdstat
    

    If it shows a rebuild percentage that hasn’t changed for a very long time, or if mdadm --detail /dev/md0 shows the array as resyncing or recovering indefinitely, the rebuild may be stuck.

  • Fix: Sometimes, a simple mdadm --manage --stop /dev/md0 followed by mdadm --assemble --scan can reset the state. If that fails, you might need to manually fail and re-add the drive that is part of the rebuild.

    # Find the drive that is actively rebuilding (check /proc/mdstat)
    sudo mdadm /dev/md0 --manage --fail /dev/sdX1 # Replace sdX1 with the rebuilding drive
    sudo mdadm /dev/md0 --manage --remove /dev/sdX1
    sudo mdadm /dev/md0 --manage --add /dev/sdY1 # Add the spare or replacement drive
    
  • Why it works: Stopping and re-assembling the array can clear transient states that prevent the rebuild from completing. Forcing a fail/remove/add cycle on the problematic drive essentially restarts the rebuild process from a clean slate.

After resolving these issues, you’ll likely encounter mdadm: /dev/md0 has been started with 1 drive (out of 4) and 3 emergencies. This is normal as the system boots and mdadm reassembles the array before the rebuild is fully complete.

Want structured learning?

Take the full Linux & Systems Programming course →