A Linux soft lockup means a CPU core has been busy for too long doing non-preemptible work, preventing other processes from running and making the system unresponsive.
The most common culprit is a kernel module or driver getting stuck in an infinite loop or a very long-running operation that doesn’t yield control back to the scheduler. This can happen due to bugs in the module, hardware issues that cause unexpected behavior, or even resource exhaustion within the module itself.
Diagnosis:
Check kernel logs for "BUG: soft lockup" messages. These messages will typically include the CPU number that experienced the lockup and the function name within the kernel that was running when the lockup occurred.
sudo dmesg | grep "soft lockup"
If you see a repeating pattern of soft lockups on a specific CPU, it strongly suggests a hardware or driver issue related to that CPU or its immediate peripherals.
Common Causes and Fixes:
-
Stuck Kernel Module/Driver: A buggy kernel module is the most frequent cause. This could be a network driver, storage driver, or any other hardware-related module.
- Diagnosis: Examine
dmesgfor the specific function name mentioned in the soft lockup message. Search online for known issues with that function or the associated driver. - Fix: Identify the module responsible (e.g.,
ixgbefor Intel NICs,nvmefor NVMe storage). Blacklist the module to prevent it from loading:
Replaceecho "blacklist <module_name>" | sudo tee -a /etc/modprobe.d/blacklist.conf sudo update-initramfs -u sudo reboot<module_name>with the actual module name. This works by preventing the problematic code from ever running, thus avoiding the infinite loop. - Why it works: Blacklisting a module prevents the kernel from loading it during boot. If the soft lockup was caused by a bug within that module, the system will no longer encounter that bug.
- Diagnosis: Examine
-
Hardware Faulty CPU: While less common than software bugs, a faulty CPU core can sometimes enter a state where it gets stuck.
- Diagnosis: If soft lockups consistently happen on the same CPU core (as indicated in
dmesg), and you’ve ruled out software causes, suspect hardware. Run CPU stress tests likestress-ngwith specific CPU targets. - Fix: If a specific CPU core is identified as faulty, you can disable it at the BIOS/UEFI level or via kernel boot parameters. Add
maxcpus=<n>to your GRUB configuration (/etc/default/grub):
Replacesudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 maxcpus=3"/' /etc/default/grub sudo update-grub sudo reboot3with the number of enabled cores you want. This works by telling the kernel not to initialize or use the suspected faulty core. - Why it works: By disabling the core, the operating system and its scheduler will no longer assign tasks to it, preventing it from entering its faulty state.
- Diagnosis: If soft lockups consistently happen on the same CPU core (as indicated in
-
I/O Wait or Resource Starvation: A driver might be waiting indefinitely for an I/O operation to complete, or a resource it needs is perpetually unavailable, leading to a long-running, non-yielding operation.
- Diagnosis: Monitor system I/O using
iostatand look for devices with consistently high%utilorawaittimes. Checkvmstatfor highwa(I/O wait) values. - Fix: For storage-related issues, ensure disks are healthy (
smartctl -a /dev/sdX). If it’s a network issue, check network cable integrity, switch port status, and network driver parameters. For example, tuning network driver ring buffers might help:
Replace# Example for Intel e1000e, adjust for your driver sudo ethtool -G eth0 rx 4096 tx 4096eth0with your network interface and adjust buffer sizes. This works by increasing the buffer space for network packets, reducing the chance of the driver blocking due to full buffers. - Why it works: Addressing the underlying I/O bottleneck or resource contention allows the driver to complete its operations in a timely manner, preventing it from holding up the CPU.
- Diagnosis: Monitor system I/O using
-
Kernel Bug in Scheduler or Core Components: While rare, a bug in the Linux kernel’s scheduler or other core components could lead to a situation where a process or interrupt handler runs for an excessively long time without yielding.
- Diagnosis: If the lockup function is generic (e.g.,
schedule,__schedule,irq_enter), and you’ve ruled out specific drivers, consider a kernel bug. Check kernel mailing lists for similar reports. - Fix: Update to a newer, stable kernel version. If the issue persists, consider reporting it to the kernel community with detailed logs and system information.
This works by replacing the potentially buggy kernel code with a corrected version.sudo apt update && sudo apt upgrade linux-image-generic sudo reboot - Why it works: Newer kernel versions often contain bug fixes for scheduler issues and other core components, resolving the underlying cause of the long-running, non-preemptible code.
- Diagnosis: If the lockup function is generic (e.g.,
-
High Interrupt Load: An interrupt storm or a poorly handled interrupt service routine (ISR) can consume CPU time excessively, potentially leading to a soft lockup if the ISR is too long.
- Diagnosis: Use
toporhtopto see if any specific process is consuming high CPU, or useperf topto see kernel functions.cat /proc/interruptscan show interrupt counts per CPU/device. - Fix: Identify the device generating excessive interrupts. This might involve disabling specific hardware features or updating firmware/drivers. If a specific hardware device is misbehaving (e.g., a faulty network card constantly generating interrupts), you might need to replace it or disable the offending interrupt source if possible (less common for critical devices).
This works by limiting where the interrupt can be handled or by identifying and mitigating the source of excessive interrupts.# Example to temporarily disable an interrupt on a specific CPU (use with caution) # Find the interrupt number for the device from /proc/interrupts # sudo echo 0 > /proc/irq/<interrupt_number>/smp_affinity - Why it works: By reducing or isolating the source of the interrupt storm, you prevent the CPU from being overwhelmed by ISRs that don’t yield.
- Diagnosis: Use
-
ACPI/Power Management Issues: Sometimes, aggressive power management settings or ACPI (Advanced Configuration and Power Interface) bugs can cause CPUs to get stuck in certain states or during transitions.
- Diagnosis: Look for ACPI-related errors in
dmesg. Try disabling specific power-saving features in BIOS/UEFI. - Fix: Use kernel boot parameters to disable certain ACPI features. For example,
acpi=off(disables ACPI entirely, usually not recommended for modern systems) or more specific options likeacpi_osi=! "Windows 2015"to trick the BIOS.
This works by bypassing or modifying how the kernel interacts with the system’s ACPI tables, which can sometimes be buggy.sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="\(.*\)"/GRUB_CMDLINE_LINUX_DEFAULT="\1 acpi_osi=! \"Windows 2015\""/' /etc/default/grub sudo update-grub sudo reboot - Why it works: Some ACPI implementations can be faulty, leading to unexpected system behavior. Modifying boot parameters can either work around these bugs or disable ACPI features that are causing problems.
- Diagnosis: Look for ACPI-related errors in
After fixing the soft lockup, the next error you’re likely to encounter is a hard lockup, where the system becomes completely unresponsive and requires a hard reboot without any kernel messages.