...

Linux Scheduler CFS: How it works and alternatives in server hosting

The Linux scheduler CFS controls how server cores allocate their time to processes and thus directly influences latency, throughput and fairness in server hosting. In this guide, I explain how it works, the tuning levers and useful alternatives such as ULE, BFS and EEVDF for Hosting with web servers.

Key points

  • Fairness and vruntime determine which task gets the CPU.
  • Cgroups regulate quotas and cpu.shares for customer insulation.
  • Kernel tuning via sched_latency_ns and Granularity.
  • Alternatives such as BFS, ULE, EEVDF for special Workloads.
  • PracticeCore affinity, I/O planner and Tests combine.

How CFS works in everyday hosting

With the Completely Fair Scheduler, a virtual runtime decides which task runs next, which means that a fair and predictable Allocation is created. Each task gets CPU time proportional to the nice value, so a low nice value gets more shares. In hosting environments, many small web requests, cronjobs and backups split the CPU between them without one process occupying everything. Interactive workloads such as NGINX requests benefit from frequent, short time slices, while batch tasks receive longer blocks. This means that response times remain reliable for users, even if many sites are processing requests in parallel.

I use Cgroups to limit customers and services, because cpu.shares and cpu.max provide clear Share totals and hard Limits. A default value of 1024 shares for “normal” and 512 for “less important” distributes the cores in a comprehensible way. With cpu.max, for example, I set 50ms in a 100ms period, which effectively corresponds to 50% CPU share. This setup offers predictable reserves for hosting workloads with variable loads. Details on the principle can be found in compact form at fair CPU distribution.

CFS mechanics explained clearly

At its core, CFS manages all ready-to-run tasks in a red-black tree, sorted by vruntime and with efficient Selection of the smallest virtual runtime. This task runs next and increases its vruntime in proportion to the CPU time consumed and weighted by the nice value. This creates a fluid balance without hard queues, which delivers clean results, especially with mixed workloads. On multi-core systems, the scheduler moves tasks between run queues, but pays attention to cache locality via core affinity. In this way, CFS combines load balancing with as few expensive migrations as possible.

For fine-tuning, parameters such as sched_latency_ns and sched_min_granularity_ns set the course for Latency and Throughput. Smaller latency values favor short, interactive jobs, larger values strengthen batch jobs. In tests with tools such as stress-ng and fio, I check the effect on response times and CPU utilization. As the number of tasks increases, so does the administrative overhead of the tree, which can manifest itself as peak latencies. However, correctly set quotas and limits keep these effects in check in hosting environments.

CFS's strengths in server hosting

The greatest strength lies in the Fairness, that are consistent and comprehensible Resources distributed. For shared environments, this means that no customer permanently displaces others because quotas and shares clearly define the weightings. Interactive services receive fast response times, while backups are allowed to run smoothly. Prioritization via nice values completes this picture and leaves me room for coordination depending on the role of a service. Load balancing across all cores allows me to make good use of the available computing power without giving Jeff moments of individual threads too much room.

In practice, the strength of CFS becomes apparent when web server peaks occur and many short requests arrive, as CFS allocates frequent slots to these types of tasks. Clean Cgroups help to set hard upper limits per customer or container. Measurements on averages and percentiles show reliable response times, which pays off in day-to-day business. This approach is particularly useful for application stacks with many components. This is precisely where the mix of predictable fairness and sufficient flexibility scores points.

Limits and typical stumbling blocks

With an extremely large number of simultaneous tasks, the overhead of the tree operations increases, which is not the case with Tips the Latency can drive. In hosting setups with many very short requests, there are sometimes frequent context changes. Such “thrashing” behavior reduces efficiency if granularity values are chosen incorrectly. Fewer but longer time slices can help, as long as interactivity is maintained. CFS reacts sensitively to incorrect quotas, which is why I consistently check limits with load tests.

Even affinity-friendly workloads suffer if tasks jump between cores too often. A clean affinity concept keeps caches warm and reduces migration costs. I also like to bind noisy batch jobs to their own cores so that web requests run quietly on their cores. For latency-critical services, it is worth setting low nice values and a finely tuned latency. What counts in the end is that measurements confirm the selected parameters.

Comparison of alternatives: ULE, BFS and EEVDF

For special workloads, I look at alternatives in order to Latency or Scaling differently. ULE uses simpler queues and scores with less administrative effort, BFS prioritizes responsiveness and shines with few tasks, and EEVDF combines fair distribution with deadlines. EEVDF in particular promises shorter waiting times for interactive loads because the scheduler pays more attention to the “earliest permissible deadline”. For very large server fields, what really counts in the end is which mix of efficiency and plannability really wins in your own stack. A structured look at strengths, weaknesses and fields of application helps with the selection.

scheduler Complexity Strengths in hosting Weaknesses Suitable for
CFS High Fair distribution, Cgroups Latency peaks Shared hosting, mixed loads
ULE Low Simple cues, low Load Less insulation VMs, HPC-like patterns
BFS Medium Interactivity, Speed Weak scaling Desktops, small servers
EEVDF Medium Low latency, deadlines Still little practice Modern hosting stacks

Kernel tuning: practical steps for CFS

For CFS, I often switch sched_autogroup_enabled=0 so that no implicit groups distort the picture and the Load distribution clear remains. With sched_latency_ns I like to start at 20ms, which favors interactive services, and adjust sched_min_granularity_ns to tame context switches. Values depend on the profile: many short web requests need different fine-tuning than backup windows. I test changes serially and measure percentiles instead of just looking at averages. This ensures that not only do mean values look pretty, but also that the long queues shrink.

If you want to go deeper into the sysctl parameters, you will find a good introduction here: sysctl tuning. I also tune the IRQ distribution, CPU governor and energy profiles so that the CPU does not constantly tip over into economical states. I use performance governors for latency-driven stacks, while pure batch boxes live with balanced control. I clearly separate test and production phases so that there are no surprises. After each step, I check logs and metrics before moving on.

Use cgroups and quotas sensibly

With cpu.shares I assign relative Weights while cpu.max is hard Boundaries sets. A customer with 512 shares gets half as much computing time as a customer with 1024, if both generate load at the same time. I use cpu.max to limit peaks cleanly, for example 50ms in 100ms. For dedicated jobs, cpuset.cpus is worthwhile so that a service uses fixed cores and the cache stays warm. All in all, this results in a resilient separation between customers and services.

I document every change and compare it with the service levels I want to achieve. Without measured values, shares quickly lead to misinterpretations, which is why I always accompany adjustments with load tests. For containers, I suggest realistic quotas that can cope with peaks but do not slow down the host. A predictable error budget remains important so that noticeable latency peaks are detected. If you do this consistently, you will avoid surprises at peak times.

Practice: Web server and databases under CFS

Event-driven web servers reduce context switches and harmonize with CFS, resulting in noticeably constant Response times and better Scaling generated. In tests, I see that NGINX maintains higher request rates with less jitter on the same hardware. Databases react positively to core affinity when background jobs are kept away from the hot cores. Simple rules help: Web on cores A-B, batch on C-D and DB on E-F. This way, the stack keeps the pipeline clean and the caches warm.

Many small PHP FPM workers cause too many switches with aggressive granularity. I then increase the minimum time slice and check whether response times remain stable. At the same time, I throttle chatty logs so that I/O does not become a brake. CFS provides the basis here, but the peak performance is achieved by fine-tuning the entire stack. In this way, all the cogs interlock without taking the host's breath away.

Memory I/O and CPU scheduling: the interaction

CPU scheduler and I/O scheduler influence each other, which is why a coherent setup can make a noticeable difference. Advantages at Latency brings. For NVMe I usually use Noop or mq-deadline, while on HDDs mq-deadline serves long queues better. If the CPU allocates time on time, but the I/O path stalls, the overall effect is distorted. I therefore check the I/O scheduler in parallel with CFS parameters. I provide an overview of Noop, mq-deadline and BFQ here: I/O planner in comparison.

For database hosts, I adjust queue depths and read-ahead so that CFS-scheduled slots do not fizzle out due to blocking I/O. Web server boxes with many small files benefit from low latency in the I/O stack. In virtualization scenarios, I rely on consistent schedulers on host and guest to avoid unpredictable patterns. This is how the CPU scheduler interacts with the storage subsystem. In the end, what counts is a consistent chain from the request to the response.

SMP balancing, core affinity and NUMA

I direct threads to fixed cores so that Caches warm and migration costs small remain. For NUMA hosts, I pin memory and CPU together because remote memory accesses increase latency. CFS balances load between run queues, but deliberate affinity rules often get more out of it. Services with frequent cache access benefit from stable core groups. Batch jobs are allowed to roam as long as they do not interfere with the hot cores.

In practice, I set cpuset.cpus and numactl options, then test request times and CPU miss rates. The fewer unnecessary migrations, the better the response time. I also evaluate the interrupt distribution so that hard IRQ peaks do not clog up a core. In this way, I achieve a smooth clocking of the important threads. This calmness pays off in the overall stack performance.

Group scheduling: nice, weighting and hierarchies

A frequent stumbling block in hosting is the Interaction between nice-priorities and Cgroup weights. CFS first distributes fairly between groups, then within the group between tasks. This means that a process with nice -5 can still get less CPU than another with nice 0 if its group (client/container) has a lower weight. For consistent results, I therefore first set the Group weights and use nice only for fine-tuning within a service.

In practice, I work with a few clear levels (e.g. 512/1024/2048 shares for “low/normal/high”) and document which services run in which group. This keeps the Fairness traceable in the hierarchy. Anyone who works a lot with short-lived processes (e.g. CGI/CLI jobs) also benefits from cgroup-based control, because volatile tasks would otherwise bypass the group corset unintentionally. I regularly use runtime metrics to check whether the internal allocation still matches the load profile.

Containers and orchestration: requests, limits and throttling

In container environments, a “request” typically maps to relative weight (shares/weight), a “limit” on the Quota (cpu.max). The interaction determines Throttling: If the quota is too tight, the container CPU is slowed down within the period - visible in the p95/p99 latency bounces. I therefore keep quotas in such a way that normal bursts fit into the period and the services are rarely throttled hard. Where available, I use a burst-reserve (e.g. cpu.max.burst) to cushion short peaks without distortions.

It is important not to set requests too low: If the weights are too low, interactive services will fall behind batch noise. I calibrate requests based on the measured base load and secure limits so that Error budgets are maintained during peak times. For multi-tenant nodes, I also plan buffer cores so that load peaks of individual containers do not affect their neighbors.

Measurement methods and troubleshooting in the scheduler context

I never evaluate CFS tuning blindly, but measure it in a targeted manner. I use for the overview:

  • Runqueue length per CPU (load vs. active cores),
  • Context change per second and number of threads,
  • CPU steal and SoftIRQ-shares,
  • Percentile of response times (p50/p95/p99),
  • Distribution of vruntime or scheduling latencies.

If latency peaks occur, I first look for Throttling (quota exhausted), then after Migrations (cache cold) and finally after I/O blockages (queue depth, storage saturation). I look at wakeup patterns: Frequent short wakeups of many workers indicate too fine granularity or chatty I/O. An increased proportion of ksoftirqd on a core indicates hot IRQ queues - in this case I distribute IRQs and activate RPS/XPS so that network load is spread more widely.

Real-time classes, pre-emption and tick control

In addition to CFS, the following real-time classes exist SCHED_FIFO/RR. They override CFS: an incorrectly configured RT thread can literally take the air out of the system. I therefore only assign RT-Prio very selectively (e.g. for audio/telemetry) and define clear watchdogs. For hosting, CFS with clean weights is usually sufficient.

To PreemptionThe choice of preemption model (e.g. “voluntary” vs. “full/dynamic preempt”) shifts the latency/throughput ratio. For web stacks I prefer more preemption, for pure batch hosts less. Tick optimizations (nohz-modes) can reduce jitter, but should be used with caution. On isolated cores, I occasionally combine nohz_full and Affinity so that hot threads run as undisturbed as possible - it is important that system and IRQ loads do not inadvertently migrate to these cores.

Virtualization: KVM, vCPU pinning and steal time

In the hypervisor environment, the host scheduler determines when vCPUs can run. Create overbookings Steal-Time in the guests, which acts like “invisible latency”. For latency-critical tenants, I pin vCPUs to physical cores and keep the overcommit moderate. I also separate emulator threads (IO threads, vhost) from the hot cores of the guests so that they do not interfere with each other.

I avoid double throttling: If the guest is already using cpu.max, I do not set any additional hard quotas on the same workload on the host. Frequency control remains the task of the host; guests benefit indirectly if the host governor scales cleanly with the actual workload. For consistent latencies, I consider stability beyond pure maximum frequency runs to be more important than peak GHz on paper.

AutoNUMA, memory localization and THP

NUMA can be a performance gain or a performance trap. AutoNUMA often helps, but can generate additional overhead with heavily roaming threads. In hosting stacks with clear service boundaries, I pin CPU and Memory (cpuset.cpus and cpuset.mems) together. This means that hot data remains local and CFS has to compensate for fewer migrations.

Large pages (THP) lower TLB-Pressure, but do not fit every profile. For databases, “madvise” can make more sense than a blanket “always”. Blocking page faults hit interactive latency hard; I therefore plan buffers (page cache, shared buffer) so that CFS slots are used productively and do not wait for I/O or MMU events. This can be measured via page fault rates and miss curves of the caches.

Network path: IRQ control, RPS/XPS and busy polling

Many web workloads are NIC-dominated. I distribute IRQ-queues of the network card across multiple cores and keep them affine to the worker threads so that wakeups remain local. RPS/XPS helps to resolve soft hotspots when individual RX/TX queues are carrying too much load. If ksoftirqd becomes visibly hot, this is an indication of overflowing SoftIRQs - I then equalize flows and increase the budget parameters if necessary without losing fairness.

Optional busy polling can make sense in very special low-latency setups, but it costs CPU time. I rarely use it and only if I can prove by measurement that p99 drops significantly without stressing the host overall. Normally, clean IRQ affinity, Cgroups and CFS granularity provide the better cost-benefit ratio.

Outlook: From CFS to EEVDF and userspace approaches

EEVDF extends fair distribution to include deadlines, which is noticeably shorter and more predictable Answers promises. Especially under interactive latency targets, this can be the deciding factor. I keep a close eye on kernel versions and test EEVDF separately before switching. At the same time, userspace scheduling via eBPF patterns is gaining momentum, which can allow additional control depending on the workload. CFS remains relevant for hosting infrastructures, but EEVDF will establish itself quickly.

A clear migration path remains important: tests, rollout on selected hosts, then expansion. This is the only way to keep percentiles and error rates under control. I keep benchmarks close to reality, including burst phases and slow backends. Only then do I intervene in live environments. In this way, progress can be achieved without nasty surprises.

Briefly summarized

The Linux Scheduler CFS delivers fair distribution, solid integrations and good Control via Cgroups. With suitable sysctl parameters, clean affinity and realistic quotas, I keep latencies low and throughput high. For special patterns, ULE, BFS or EEVDF offer additional leverage. I measure, compare and roll out changes in stages to limit risks. This keeps hosting predictable - and performance where it belongs.

Current articles