A Linux soft lockup means a CPU core has been busy for too long doing non-preemptible work, preventing other processes from running and making the system unresponsive.

The most common culprit is a kernel module or driver getting stuck in an infinite loop or a very long-running operation that doesn’t yield control back to the scheduler. This can happen due to bugs in the module, hardware issues that cause unexpected behavior, or even resource exhaustion within the module itself.

Diagnosis:

Check kernel logs for "BUG: soft lockup" messages. These messages will typically include the CPU number that experienced the lockup and the function name within the kernel that was running when the lockup occurred.

sudo dmesg | grep "soft lockup"

If you see a repeating pattern of soft lockups on a specific CPU, it strongly suggests a hardware or driver issue related to that CPU or its immediate peripherals.

Common Causes and Fixes:

  1. Stuck Kernel Module/Driver: A buggy kernel module is the most frequent cause. This could be a network driver, storage driver, or any other hardware-related module.

    • Diagnosis: Examine dmesg for the specific function name mentioned in the soft lockup message. Search online for known issues with that function or the associated driver.
    • Fix: Identify the module responsible (e.g., ixgbe for Intel NICs, nvme for NVMe storage). Blacklist the module to prevent it from loading:
      echo "blacklist <module_name>" | sudo tee -a /etc/modprobe.d/blacklist.conf
      sudo update-initramfs -u
      sudo reboot
      
      Replace <module_name> with the actual module name. This works by preventing the problematic code from ever running, thus avoiding the infinite loop.
    • Why it works: Blacklisting a module prevents the kernel from loading it during boot. If the soft lockup was caused by a bug within that module, the system will no longer encounter that bug.
  2. Hardware Faulty CPU: While less common than software bugs, a faulty CPU core can sometimes enter a state where it gets stuck.

    • Diagnosis: If soft lockups consistently happen on the same CPU core (as indicated in dmesg), and you’ve ruled out software causes, suspect hardware. Run CPU stress tests like stress-ng with specific CPU targets.
    • Fix: If a specific CPU core is identified as faulty, you can disable it at the BIOS/UEFI level or via kernel boot parameters. Add maxcpus=<n> to your GRUB configuration (/etc/default/grub):
      sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 maxcpus=3"/' /etc/default/grub
      sudo update-grub
      sudo reboot
      
      Replace 3 with the number of enabled cores you want. This works by telling the kernel not to initialize or use the suspected faulty core.
    • Why it works: By disabling the core, the operating system and its scheduler will no longer assign tasks to it, preventing it from entering its faulty state.
  3. I/O Wait or Resource Starvation: A driver might be waiting indefinitely for an I/O operation to complete, or a resource it needs is perpetually unavailable, leading to a long-running, non-yielding operation.

    • Diagnosis: Monitor system I/O using iostat and look for devices with consistently high %util or await times. Check vmstat for high wa (I/O wait) values.
    • Fix: For storage-related issues, ensure disks are healthy (smartctl -a /dev/sdX). If it’s a network issue, check network cable integrity, switch port status, and network driver parameters. For example, tuning network driver ring buffers might help:
      # Example for Intel e1000e, adjust for your driver
      sudo ethtool -G eth0 rx 4096 tx 4096
      
      Replace eth0 with your network interface and adjust buffer sizes. This works by increasing the buffer space for network packets, reducing the chance of the driver blocking due to full buffers.
    • Why it works: Addressing the underlying I/O bottleneck or resource contention allows the driver to complete its operations in a timely manner, preventing it from holding up the CPU.
  4. Kernel Bug in Scheduler or Core Components: While rare, a bug in the Linux kernel’s scheduler or other core components could lead to a situation where a process or interrupt handler runs for an excessively long time without yielding.

    • Diagnosis: If the lockup function is generic (e.g., schedule, __schedule, irq_enter), and you’ve ruled out specific drivers, consider a kernel bug. Check kernel mailing lists for similar reports.
    • Fix: Update to a newer, stable kernel version. If the issue persists, consider reporting it to the kernel community with detailed logs and system information.
      sudo apt update && sudo apt upgrade linux-image-generic
      sudo reboot
      
      This works by replacing the potentially buggy kernel code with a corrected version.
    • Why it works: Newer kernel versions often contain bug fixes for scheduler issues and other core components, resolving the underlying cause of the long-running, non-preemptible code.
  5. High Interrupt Load: An interrupt storm or a poorly handled interrupt service routine (ISR) can consume CPU time excessively, potentially leading to a soft lockup if the ISR is too long.

    • Diagnosis: Use top or htop to see if any specific process is consuming high CPU, or use perf top to see kernel functions. cat /proc/interrupts can show interrupt counts per CPU/device.
    • Fix: Identify the device generating excessive interrupts. This might involve disabling specific hardware features or updating firmware/drivers. If a specific hardware device is misbehaving (e.g., a faulty network card constantly generating interrupts), you might need to replace it or disable the offending interrupt source if possible (less common for critical devices).
      # Example to temporarily disable an interrupt on a specific CPU (use with caution)
      # Find the interrupt number for the device from /proc/interrupts
      # sudo echo 0 > /proc/irq/<interrupt_number>/smp_affinity
      
      This works by limiting where the interrupt can be handled or by identifying and mitigating the source of excessive interrupts.
    • Why it works: By reducing or isolating the source of the interrupt storm, you prevent the CPU from being overwhelmed by ISRs that don’t yield.
  6. ACPI/Power Management Issues: Sometimes, aggressive power management settings or ACPI (Advanced Configuration and Power Interface) bugs can cause CPUs to get stuck in certain states or during transitions.

    • Diagnosis: Look for ACPI-related errors in dmesg. Try disabling specific power-saving features in BIOS/UEFI.
    • Fix: Use kernel boot parameters to disable certain ACPI features. For example, acpi=off (disables ACPI entirely, usually not recommended for modern systems) or more specific options like acpi_osi=! "Windows 2015" to trick the BIOS.
      sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 acpi_osi=! \"Windows 2015\""/' /etc/default/grub
      sudo update-grub
      sudo reboot
      
      This works by bypassing or modifying how the kernel interacts with the system’s ACPI tables, which can sometimes be buggy.
    • Why it works: Some ACPI implementations can be faulty, leading to unexpected system behavior. Modifying boot parameters can either work around these bugs or disable ACPI features that are causing problems.

After fixing the soft lockup, the next error you’re likely to encounter is a hard lockup, where the system becomes completely unresponsive and requires a hard reboot without any kernel messages.

Want structured learning?

Take the full Linux & Systems Programming course →