...

I/O Scheduler Linux: Noop, mq-deadline & BFQ explained in hosting

The Linux I/O scheduler decides how the system sorts and prioritizes read and write accesses to SSD, NVMe, and HDD and sends them to the device. In this guide, I explain in practical terms when Noop, mq-deadline and BFQ are the best choice for hosting – including tuning, testing, and clear steps for action.

Key points

  • NoopMinimal overhead on SSD/NVMe and in VMs
  • mq-deadline: Balanced latency and throughput for servers
  • BFQFairness and quick response in multi-user environments
  • blk-mqMulti-queue design for modern hardware
  • Tuning: Tests per workload instead of fixed rules

How the I/O scheduler works in Linux hosting

A Linux I/O scheduler sorts I/O requests into queues, performs merging, and decides on delivery to the device in order to Latency and increase throughput. Modern kernels use blk-mq, i.e., multi-queue, so that multiple CPU cores can initiate I/O in parallel. This is ideal for NVMe SSDs, which offer many queues and high parallelism, thereby reducing queues. In hosting, broad mixed loads often collide: web servers deliver many small reads, databases generate sync writes, and backups generate streams. The right scheduler reduces congestion, keeps response times stable, and protects the Server-Experience under load.

blk-mq in practice: none vs. noop and kernel defaults

Since kernel 5.x, the multi-queue design has been the standard path. This means that none the „Noop“ equivalent for blk-mq, while noop historically comes from the single-queue path. On NVMe devices, usually only none available; on SATA/SAS, you often see mq-deadline, optional bfq and, depending on the distribution, also kyber. The defaults vary: NVMe usually starts with none, SCSI/SATA often with mq-deadline. I therefore always check the available options via cat /sys/block//queue/scheduler and decide per device. Where only none is selectable, this is intentional—additional sorting adds practically no value there.

Noop in server use: When minimalism wins

Noop primarily merges adjacent blocks, but does not sort them, which significantly reduces CPU overhead. low On SSDs and NVMe, controllers and firmware take care of the clever sequencing, so additional sorting in the kernel is of little use. In VMs and containers, I often plan Noop because the hypervisor plans comprehensively anyway. I don't use Noop on rotating disks because the lack of sorting increases seek times there. If you want to reliably delimit the hardware context, first look at the memory type—here, it helps to take a look at NVMe, SSD, and HDD, before I start the scheduler determine.

mq-deadline: Deadlines, sequences, and clear priorities

mq-deadline gives read accesses short deadlines and makes write accesses wait a little longer in order to Response time The scheduler also sorts by block addresses, reducing search times, which is particularly helpful for HDDs and RAID arrays. In web and database hosts, mq-deadline provides a good balance between latency and throughput. I like to use it when workloads are mixed and both reads and writes are constantly queued. For fine-tuning, I check request depth, writeback behavior, and controller cache to ensure that the deadline logic is consistent. grabs.

BFQ: Fairness and responsiveness for many simultaneous users

BFQ distributes bandwidth proportionally and allocates budgets per process, which is noticeable. fair works when many users generate I/O in parallel. Interactive tasks such as admin shells, editors, or API calls remain fast even though backups are running in the background. BFQ often achieves high efficiency on HDDs because it takes advantage of sequential phases and makes clever use of short idle windows. On very fast SSDs, there is a little extra overhead, which I weigh against the noticeable responsiveness. Those who use cgroups and ioprio can make clear assurances with BFQ and thus avoid annoyance from noisy neighbors. Avoid.

QoS in everyday life: ioprio, ionice, and Cgroups v2 with BFQ

For clean Prioritization I combine BFQ with process and cgroup rules. At the process level, I set ionice Classes and priorities: ionice -c1 (Real-time) for latency-critical reads, ionice -c2 -n7 (Best effort, low) for backups or index runs, ionice -c3 (Idle) for everything that should only run during idle times. In Cgroups v2, I use io.weight for relative proportions (e.g., 100 vs. 1000) and io.max for hard limits, such as echo "259:0 rbps=50M wbps=20M" > /sys/fs/cgroup//io.max. With BFQ, weights are converted very precisely into bandwidth shares—ideal for shared hosting and container hosts on which Fairness is more important than maximum raw power.

Practical comparison: Which choice suits the hardware?

The choice depends heavily on the memory type and queue architecture, so I first check Device and controllers. SSDs and NVMe usually benefit from Noop/none, while HDDs run more smoothly with mq-deadline or BFQ. In RAID setups, SANs, and all-round hosts, I often prefer mq-deadline because deadline logic and sorting work well together. Multi-user environments with many interactive sessions often benefit from BFQ. The following table summarizes the strengths and useful areas of application in a clear manner together:

scheduler Hardware Strengths Weaknesses Hosting scenarios
Noop/none SSD, NVMe, VMs Minimal overhead, clean merging Disadvantageous without sorting on HDDs Flash server, container, hypervisor-controlled
mq-deadline HDD, RAID, all-round server Strict read priority, sorting, solid latency More logic than Noop Databases, web backends, mixed workloads
BFQ HDD, multi-user, desktop-like hosts Fairness, responsiveness, good sequences Slightly more overhead on very fast SSDs Interactive services, shared hosting, dev servers

Configuration: Check scheduler and set permanently

First, I check which scheduler is active, for example with cat /sys/block/sdX/queue/scheduler, and note the Option in square brackets. To switch temporarily, I write, for example echo mq-deadline | sudo tee /sys/block/sdX/queue/scheduler. For persistent settings, I use udev rules or kernel parameters such as scsi_mod.use_blk_mq=1 and mq-deadline in the command line. For NVMe devices, I check paths under /sys/block/nvme0n1/queue/ and set the selection per device. Important: I document changes so that maintenance and rollbacks can be carried out without guesswork. succeed.

Persistence and automation in operation

In everyday life, I prioritize repeatability over automation. Three approaches have proven effective:

  • udev rulesExample for all HDDs (rotational=1) echo 'ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"' > /etc/udev/rules.d/60-io-scheduler.rules, then udevadm control --reload-rules && udevadm trigger.
  • systemd-tmpfilesFor specific devices, I define /etc/tmpfiles.d/blk.conf with lines such as /sys/block/sdX/queue/scheduler - - - - mq-deadline, that write during boot.
  • Configuration managementIn Ansible/Salt, I create device classes (NVMe, HDD) and distribute consistent defaults along with documentation and rollback.

Note: elevator= was the kernel parameter for the old single-queue path. In blk-mq, I determine the choice per device. For stacks (dm-crypt, LVM, MD), I set the default on the top device; more on this below.

Workloads in hosting: Recognizing patterns and taking the right action

First, I analyze the load: Many small reads indicate web front ends, sync-heavy writes indicate databases and log pipelines, large sequential streams indicate backups or Archive. Tools such as iostat, vmstat and blktrace show queues, latencies, and merge effects. If there is noticeable CPU idle time due to I/O, I refer to Understanding I/O Wait, to resolve bottlenecks in a structured manner. I then test 1–2 scheduler candidates in identical time windows. Only measurement results are decisive, not gut feeling or myths.

Deepening measurement practice: reproducible benchmarks

For reliable decisions, I use controlled fio-Profiles and confirm with real application tests:

  • Random reads (Web/Cache): fio --name=rr --rw=randread --bs=4k --iodepth=32 --numjobs=4 --runtime=120 --time_based --filename=/mnt/testfile --direct=1
  • Random mix (DB): fio --name=randmix --rw=randrw --rwmixread=70 --bs=8k --iodepth=64 --numjobs=8 --runtime=180 --time_based --direct=1
  • Sequential (Backup): fio --name=seqw --rw=write --bs=1m --iodepth=128 --numjobs=2 --runtime=120 --time_based --direct=1

At the same time, I log in. iostat -x 1, pidstat -d 1 and note P95/P99 latencies fio. For in-depth diagnoses, I use blktrace or eBPF tools such as biolatency Important: I measure at the same times of day, with the same load windows and the same file sizes. I minimize cache effects with direct=1 and clean pre-conditions (e.g., pre-fill on the volume).

File systems and I/O schedulers: Interaction matters

The file system affects I/O characteristics, so I check its journal mode, queue depth, and sync behavior very carefully. exactly. EXT4 and XFS work efficiently with mq-deadline, while ZFS buffers and aggregates a lot itself. On hosts with ZFS, I often observe a lower scheduler effect because ZFS already shapes the output. For comparisons, I use identical mount options and workloads. If you are weighing up options, you will find EXT4, XFS, or ZFS helpful perspectives on Storage-Tuning.

Writeback, cache, and barriers: the often overlooked half

Schedulers can only work as well as the writeback subsystem allows. I therefore always check:

  • dirty parameter: sysctl vm.dirty_background_bytes, vm.dirty_bytes, vm.dirty_expire_centisecs control when and how aggressively the kernel writes. For databases, I often lower burst peaks to keep P99 stable.
  • Barriers/FlushOptions such as EXT4 barrier I only back up XFS default flushes if hardware (e.g., BBWC) takes over. „nobarrier“ without power protection is risky.
  • Device write cacheI verify the controller's write cache settings so that fsync actually ends up on the medium and not just in the cache.

Smoothing Writeback reduces the load on the scheduler—deadlines remain reliable, and BFQ has less work to do to counter sudden flush waves.

Virtualization, containers, and the cloud: Who is really planning?

In VMs, the hypervisor controls the physical I/O flow, which is why I often choose Noop/none in the guest to avoid duplication. logic On the host itself, I use mq-deadline or BFQ depending on the device and task. For cloud volumes (e.g., network block storage), parts of the planning are in the backend, so I measure real latencies instead of relying on assumptions. For container hosts with highly mixed loads, BFQ often provides better interactivity. In homogeneous batch clusters with flash-only, Noop prevails because every CPU time counts and controllers are efficient. work.

RAID, LVM, MD, and multipath: where the scheduler comes into play

In stacked block stacks, I set the scheduler at top device because that's where the relevant queues are located:

  • LVM/dm-cryptScheduler at /dev/dm-* respectively /dev/mapper/ set. I usually leave the physical PVs at none, so that merging/sorting does not happen twice.
  • MD RAID: On /dev/mdX decide; underlying sdX Devices remain calm none. Hardware RAID is treated as a single block device.
  • multipath: On the multipath mapper (/dev/mapper/mpatha); set path devices below to none.

Important: I separate tests according to pool and redundancy level (RAID1/10 vs. RAID5/6). Parity RAIDs are more sensitive to random writes; here, mq-deadline often wins out thanks to consistent read deadlines and ordered output.

Tuning strategies: Step by step to reliable performance

I start with a baseline measurement: current response times, throughput, 95th/99th percentiles, and CPU.Load. After that, I change only one factor, typically the scheduler, and repeat the same load. Tools such as fio help to control this, but I confirm every hypothesis with real application tests. Databases require their own benchmarks that map transactions and fsync behavior. Only when the measurement is stable do I finalize my choice and document it. Why.

Queue depth, read ahead, and CPU affinity

In addition to the scheduler, queue parameters also have a significant impact on practice:

  • queue depth: /sys/block//queue/nr_requests Limits pending requests per hardware queue. NVMe can handle high depth (high throughput), while HDDs benefit from moderate depth (more stable latency).
  • Readahead: /sys/block//queue/read_ahead_kb respectively blockdev --getra/setra. Slightly higher for sequential workloads, keep low for random workloads.
  • rq_affinityWith /sys/block//queue/rq_affinity On 2, I ensure that I/O completion is preferentially assigned to the generating CPU core, which reduces cross-CPU costs.
  • rotationalI verify that SSDs rotational=0 Report this so that the kernel does not apply HDD heuristics.
  • Merges: /sys/block//queue/nomerges Can reduce merges (2=off). Useful for NVMe micro-latency, but usually disadvantageous for HDDs.
  • io_poll (NVMe): Polling can reduce latency, but requires CPU power. I activate it specifically for low latency-Requirements.

Scheduler tunables in detail

Depending on the scheduler, useful fine-tuning options are available:

  • mq-deadline: /sys/block//queue/iosched/read_expire (ms, typically small), write_expire (larger), fifo_batch (batch size), front_merges (0/1). I think read_expire short to protect P95 reads, and adjust fifo_batch depending on the device.
  • BFQ: slice_idle (Idle time for sequence utilization), low latency (0/1) for responsive interactivity. With bfq.weight In cgroups, I control relative shares very precisely.
  • none/noop: Hardly any adjustment screws, but the Surroundings (Queue depth, read ahead) determines the results.

I only change one parameter at a time and keep strict track of the changes—that way, it's clear what effect each change had.

Common pitfalls and how I avoid them

Mixed pools of HDD and SSD behind a RAID controller distort tests, so I separate measurements per Group. I don't forget that the scheduler applies per block device – I consider LVM mappers and MD devices separately. Persistence tends to slip through: without a udev rule or kernel parameter, the default is restored after rebooting. Cgroups and I/O priorities often remain unused, even though they significantly improve fairness. And I always check queue depth, writeback, and file system options to ensure that the chosen logic reaches its potential. shows.

Troubleshooting: Read symptoms carefully

When the readings change, I interpret patterns and derive concrete steps:

  • High P99 latency with many readsCheck whether writes are displacing reads. Test with mq-deadline., read_expire lower, smooth writeback (vm.dirty_* adjust).
  • 100% util on HDD, low throughput: Seeks dominate. Try BFQ or mq-deadline, reduce read ahead, moderate queue depth.
  • Good throughput values, but UI stuttersInteractivity suffers. Activate BFQ, critical services via ionice -c1 or prefer cgroup weights.
  • Significant variation depending on the time of dayShared resources. Isolate with cgroups, choose scheduler per pool, move backups to off-peak times.
  • NVMe timeouts in dmesg: Backend or firmware issue. io_poll Deactivate on a trial basis, check firmware/driver, verify path redundancy (multipath).

In summary: Clear decisions for everyday hosting

For flash storage and guests, I often opt for Noop, to save overhead and let controllers do their job. In all-round servers with HDD or RAID, mq-deadline delivers reliable latency and high usability. With many active users and interactive loads, BFQ ensures fair sharing and noticeable responsiveness. Before each commitment, I measure with real workloads and observe the effects on P95/P99. This allows me to make traceable decisions, keep systems running smoothly, and stabilize the Server-Performance in day-to-day business.

Current articles