The Linux kernel’s scheduler is what decides which process gets to use the CPU next, and it’s a surprisingly complex piece of engineering that often gets overlooked until it bites you.
Here’s a look at the two main scheduling policies you’ll encounter: Completely Fair Scheduler (CFS) and the Real-Time (RT) schedulers.
Completely Fair Scheduler (CFS)
CFS is the default scheduler for normal processes in Linux. Its core idea is to give each process a fair share of the CPU. Instead of time slices, it uses a concept called "virtual runtime" (vruntime). Think of vruntime as the amount of time a process would have run if the CPU were infinitely fast.
When a process runs, its vruntime increases. The scheduler always picks the process with the lowest vruntime to run next. This ensures that processes that have been waiting longer or have been preempted get a chance to catch up.
Let’s see CFS in action. Imagine two processes, A and B.
# Start process A
./my_app --id A &
# Start process B
./my_app --id B &
If A runs for 10ms and B runs for 10ms, their vruntime might look like this:
A: vruntime = 10B: vruntime = 10
Now, suppose process A is much more demanding and uses 50ms of CPU time, while B is idle for a bit.
A: vruntime = 10 + 50 = 60B: vruntime = 10 (it didn’t run)
CFS would then schedule B to run because its vruntime (10) is lower than A’s (60). If B runs for 20ms:
A: vruntime = 60B: vruntime = 10 + 20 = 30
Now, B’s vruntime is still lower, so it might get another turn. The scheduler aims to keep the difference in vruntime between any two processes within a certain limit, ensuring fairness.
The main goal of CFS is to provide good throughput and responsiveness for general-purpose applications. It tries to balance giving processes enough CPU time to make progress without letting any single process hog the system.
You can inspect CFS parameters using sysctl. For example, kernel.sched_min_granularity_ns defines the minimum time a process will run before CFS considers rescheduling.
sysctl kernel.sched_min_granularity_ns
# Example output: kernel.sched_min_granularity_ns = 2000000 # 2ms
If this value is too high, a process might run for longer than desired, impacting responsiveness. If it’s too low, the overhead of frequent context switching can increase.
Real-Time (RT) Scheduling
CFS is great for general tasks, but sometimes you need guarantees. That’s where RT scheduling comes in. RT policies are designed for applications with strict timing requirements, like audio/video processing, industrial control systems, or network packet handling.
There are two main RT policies: SCHED_FIFO (First-In, First-Out) and SCHED_RR (Round-Robin).
SCHED_FIFO: ASCHED_FIFOprocess runs until it voluntarily yields the CPU, blocks on I/O, or is preempted by a higher-prioritySCHED_FIFOprocess. It never gets interrupted by a lower-priority process.SCHED_RR: Similar toSCHED_FIFO, but processes at the same priority level are treated in a round-robin fashion. If aSCHED_RRprocess runs for too long (its time slice expires), it’s moved to the end of the queue for its priority level, and the next process at that level gets to run.
RT tasks are assigned priorities, with lower numbers indicating higher priority. A SCHED_FIFO task with priority 50 will always preempt a SCHED_FIFO task with priority 60, and will also preempt any SCHED_RR task or CFS task.
You can change a process’s scheduling policy and priority using chrt.
Let’s say we have a critical audio processing task that needs to run with high priority.
# Run my_audio_app with FIFO policy, priority 50
sudo chrt -f -p 50 <PID_of_my_audio_app>
If this my_audio_app is running, and a normal CFS process (like a web browser tab) suddenly becomes very CPU-intensive, the my_audio_app will continue to run without interruption, ensuring smooth audio playback.
The RT scheduler is not fair. If you have a high-priority RT task, it can starve lower-priority tasks (including CFS tasks) of CPU time, potentially making them unresponsive or seem "frozen." This is why RT priorities should be used judiciously.
The maximum RT priority is typically 99, and the minimum is 1. The CFS scheduler effectively runs at a priority lower than any RT task.
The concept of "preemption" is key here. RT tasks can preempt CFS tasks, and higher-priority RT tasks can preempt lower-priority RT tasks. CFS tasks can only preempt other CFS tasks based on their vruntime.
A common mistake is to set RT priorities too high or to use RT scheduling for non-critical tasks, leading to system instability or unresponsiveness for other applications. The kernel provides mechanisms to limit RT priority inheritance, which can prevent a lower-priority task from indirectly gaining higher priority by holding a lock needed by a higher-priority task.
When configuring RT tasks, you’ll often encounter system limits on the number of RT tasks or the maximum priority they can use. These are often controlled by /proc/sys/kernel/sched_rt_runtime_us and /proc/sys/kernel/sched_rt_period_us, which together define a "budget" for RT tasks within a given period. If RT tasks consume their entire budget before the period ends, they will be throttled until the next period begins. This is a safety mechanism to prevent RT tasks from completely starving the system.