The Linux I/O scheduler decides how the system sorts and prioritizes read and write accesses to SSD, NVMe, and HDD and sends them to the device. In this guide, I explain in practical terms when Noop, mq-deadline and BFQ are the best choice for hosting – including tuning, testing, and clear steps for action.
Key points
- NoopMinimal overhead on SSD/NVMe and in VMs
- mq-deadline: Balanced latency and throughput for servers
- BFQFairness and quick response in multi-user environments
- blk-mqMulti-queue design for modern hardware
- Tuning: Tests per workload instead of fixed rules
How the I/O scheduler works in Linux hosting
A Linux I/O scheduler sorts I/O requests into queues, performs merging, and decides on delivery to the device in order to Latency and increase throughput. Modern kernels use blk-mq, i.e., multi-queue, so that multiple CPU cores can initiate I/O in parallel. This is ideal for NVMe SSDs, which offer many queues and high parallelism, thereby reducing queues. In hosting, broad mixed loads often collide: web servers deliver many small reads, databases generate sync writes, and backups generate streams. The right scheduler reduces congestion, keeps response times stable, and protects the Server-Experience under load.
blk-mq in practice: none vs. noop and kernel defaults
Since kernel 5.x, the multi-queue design has been the standard path. This means that none the „Noop“ equivalent for blk-mq, while noop historically comes from the single-queue path. On NVMe devices, usually only none available; on SATA/SAS, you often see mq-deadline, optional bfq and, depending on the distribution, also kyber. The defaults vary: NVMe usually starts with none, SCSI/SATA often with mq-deadline. I therefore always check the available options via cat /sys/block//queue/scheduler and decide per device. Where only none is selectable, this is intentional—additional sorting adds practically no value there.
Noop in server use: When minimalism wins
Noop primarily merges adjacent blocks, but does not sort them, which significantly reduces CPU overhead. low On SSDs and NVMe, controllers and firmware take care of the clever sequencing, so additional sorting in the kernel is of little use. In VMs and containers, I often plan Noop because the hypervisor plans comprehensively anyway. I don't use Noop on rotating disks because the lack of sorting increases seek times there. If you want to reliably delimit the hardware context, first look at the memory type—here, it helps to take a look at NVMe, SSD, and HDD, before I start the scheduler determine.
mq-deadline: Deadlines, sequences, and clear priorities
mq-deadline gives read accesses short deadlines and makes write accesses wait a little longer in order to Response time The scheduler also sorts by block addresses, reducing search times, which is particularly helpful for HDDs and RAID arrays. In web and database hosts, mq-deadline provides a good balance between latency and throughput. I like to use it when workloads are mixed and both reads and writes are constantly queued. For fine-tuning, I check request depth, writeback behavior, and controller cache to ensure that the deadline logic is consistent. grabs.
BFQ: Fairness and responsiveness for many simultaneous users
BFQ distributes bandwidth proportionally and allocates budgets per process, which is noticeable. fair works when many users generate I/O in parallel. Interactive tasks such as admin shells, editors, or API calls remain fast even though backups are running in the background. BFQ often achieves high efficiency on HDDs because it takes advantage of sequential phases and makes clever use of short idle windows. On very fast SSDs, there is a little extra overhead, which I weigh against the noticeable responsiveness. Those who use cgroups and ioprio can make clear assurances with BFQ and thus avoid annoyance from noisy neighbors. Avoid.
QoS in everyday life: ioprio, ionice, and Cgroups v2 with BFQ
For clean Prioritization I combine BFQ with process and cgroup rules. At the process level, I set ionice Classes and priorities: ionice -c1 (Real-time) for latency-critical reads, ionice -c2 -n7 (Best effort, low) for backups or index runs, ionice -c3 (Idle) for everything that should only run during idle times. In Cgroups v2, I use io.weight for relative proportions (e.g., 100 vs. 1000) and io.max for hard limits, such as echo "259:0 rbps=50M wbps=20M" > /sys/fs/cgroup//io.max. With BFQ, weights are converted very precisely into bandwidth shares—ideal for shared hosting and container hosts on which Fairness is more important than maximum raw power.
Practical comparison: Which choice suits the hardware?
The choice depends heavily on the memory type and queue architecture, so I first check Device and controllers. SSDs and NVMe usually benefit from Noop/none, while HDDs run more smoothly with mq-deadline or BFQ. In RAID setups, SANs, and all-round hosts, I often prefer mq-deadline because deadline logic and sorting work well together. Multi-user environments with many interactive sessions often benefit from BFQ. The following table summarizes the strengths and useful areas of application in a clear manner together:
| scheduler | Hardware | Strengths | Weaknesses | Hosting scenarios |
|---|---|---|---|---|
| Noop/none | SSD, NVMe, VMs | Minimal overhead, clean merging | Disadvantageous without sorting on HDDs | Flash server, container, hypervisor-controlled |
| mq-deadline | HDD, RAID, all-round server | Strict read priority, sorting, solid latency | More logic than Noop | Databases, web backends, mixed workloads |
| BFQ | HDD, multi-user, desktop-like hosts | Fairness, responsiveness, good sequences | Slightly more overhead on very fast SSDs | Interactive services, shared hosting, dev servers |
Configuration: Check scheduler and set permanently
First, I check which scheduler is active, for example with cat /sys/block/sdX/queue/scheduler, and note the Option in square brackets. To switch temporarily, I write, for example echo mq-deadline | sudo tee /sys/block/sdX/queue/scheduler. For persistent settings, I use udev rules or kernel parameters such as scsi_mod.use_blk_mq=1 and mq-deadline in the command line. For NVMe devices, I check paths under /sys/block/nvme0n1/queue/ and set the selection per device. Important: I document changes so that maintenance and rollbacks can be carried out without guesswork. succeed.
Persistence and automation in operation
In everyday life, I prioritize repeatability over automation. Three approaches have proven effective:
- udev rulesExample for all HDDs (rotational=1)
echo 'ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"' > /etc/udev/rules.d/60-io-scheduler.rules, thenudevadm control --reload-rules && udevadm trigger. - systemd-tmpfilesFor specific devices, I define
/etc/tmpfiles.d/blk.confwith lines such as/sys/block/sdX/queue/scheduler - - - - mq-deadline, that write during boot. - Configuration managementIn Ansible/Salt, I create device classes (NVMe, HDD) and distribute consistent defaults along with documentation and rollback.
Note: elevator= was the kernel parameter for the old single-queue path. In blk-mq, I determine the choice per device. For stacks (dm-crypt, LVM, MD), I set the default on the top device; more on this below.
Workloads in hosting: Recognizing patterns and taking the right action
First, I analyze the load: Many small reads indicate web front ends, sync-heavy writes indicate databases and log pipelines, large sequential streams indicate backups or Archive. Tools such as iostat, vmstat and blktrace show queues, latencies, and merge effects. If there is noticeable CPU idle time due to I/O, I refer to Understanding I/O Wait, to resolve bottlenecks in a structured manner. I then test 1–2 scheduler candidates in identical time windows. Only measurement results are decisive, not gut feeling or myths.
Deepening measurement practice: reproducible benchmarks
For reliable decisions, I use controlled fio-Profiles and confirm with real application tests:
- Random reads (Web/Cache):
fio --name=rr --rw=randread --bs=4k --iodepth=32 --numjobs=4 --runtime=120 --time_based --filename=/mnt/testfile --direct=1 - Random mix (DB):
fio --name=randmix --rw=randrw --rwmixread=70 --bs=8k --iodepth=64 --numjobs=8 --runtime=180 --time_based --direct=1 - Sequential (Backup):
fio --name=seqw --rw=write --bs=1m --iodepth=128 --numjobs=2 --runtime=120 --time_based --direct=1
At the same time, I log in. iostat -x 1, pidstat -d 1 and note P95/P99 latencies fio. For in-depth diagnoses, I use blktrace or eBPF tools such as biolatency Important: I measure at the same times of day, with the same load windows and the same file sizes. I minimize cache effects with direct=1 and clean pre-conditions (e.g., pre-fill on the volume).
File systems and I/O schedulers: Interaction matters
The file system affects I/O characteristics, so I check its journal mode, queue depth, and sync behavior very carefully. exactly. EXT4 and XFS work efficiently with mq-deadline, while ZFS buffers and aggregates a lot itself. On hosts with ZFS, I often observe a lower scheduler effect because ZFS already shapes the output. For comparisons, I use identical mount options and workloads. If you are weighing up options, you will find EXT4, XFS, or ZFS helpful perspectives on Storage-Tuning.
Writeback, cache, and barriers: the often overlooked half
Schedulers can only work as well as the writeback subsystem allows. I therefore always check:
- dirty parameter:
sysctl vm.dirty_background_bytes,vm.dirty_bytes,vm.dirty_expire_centisecscontrol when and how aggressively the kernel writes. For databases, I often lower burst peaks to keep P99 stable. - Barriers/FlushOptions such as EXT4
barrierI only back up XFS default flushes if hardware (e.g., BBWC) takes over. „nobarrier“ without power protection is risky. - Device write cacheI verify the controller's write cache settings so that
fsyncactually ends up on the medium and not just in the cache.
Smoothing Writeback reduces the load on the scheduler—deadlines remain reliable, and BFQ has less work to do to counter sudden flush waves.
Virtualization, containers, and the cloud: Who is really planning?
In VMs, the hypervisor controls the physical I/O flow, which is why I often choose Noop/none in the guest to avoid duplication. logic On the host itself, I use mq-deadline or BFQ depending on the device and task. For cloud volumes (e.g., network block storage), parts of the planning are in the backend, so I measure real latencies instead of relying on assumptions. For container hosts with highly mixed loads, BFQ often provides better interactivity. In homogeneous batch clusters with flash-only, Noop prevails because every CPU time counts and controllers are efficient. work.
RAID, LVM, MD, and multipath: where the scheduler comes into play
In stacked block stacks, I set the scheduler at top device because that's where the relevant queues are located:
- LVM/dm-cryptScheduler at
/dev/dm-*respectively/dev/mapper/set. I usually leave the physical PVs atnone, so that merging/sorting does not happen twice. - MD RAID: On
/dev/mdXdecide; underlyingsdXDevices remain calmnone. Hardware RAID is treated as a single block device. - multipath: On the multipath mapper (
/dev/mapper/mpatha); set path devices below tonone.
Important: I separate tests according to pool and redundancy level (RAID1/10 vs. RAID5/6). Parity RAIDs are more sensitive to random writes; here, mq-deadline often wins out thanks to consistent read deadlines and ordered output.
Tuning strategies: Step by step to reliable performance
I start with a baseline measurement: current response times, throughput, 95th/99th percentiles, and CPU.Load. After that, I change only one factor, typically the scheduler, and repeat the same load. Tools such as fio help to control this, but I confirm every hypothesis with real application tests. Databases require their own benchmarks that map transactions and fsync behavior. Only when the measurement is stable do I finalize my choice and document it. Why.
Queue depth, read ahead, and CPU affinity
In addition to the scheduler, queue parameters also have a significant impact on practice:
- queue depth:
/sys/block//queue/nr_requestsLimits pending requests per hardware queue. NVMe can handle high depth (high throughput), while HDDs benefit from moderate depth (more stable latency). - Readahead:
/sys/block//queue/read_ahead_kbrespectivelyblockdev --getra/setra. Slightly higher for sequential workloads, keep low for random workloads. - rq_affinityWith
/sys/block//queue/rq_affinityOn 2, I ensure that I/O completion is preferentially assigned to the generating CPU core, which reduces cross-CPU costs. - rotationalI verify that SSDs
rotational=0Report this so that the kernel does not apply HDD heuristics. - Merges:
/sys/block//queue/nomergesCan reduce merges (2=off). Useful for NVMe micro-latency, but usually disadvantageous for HDDs. - io_poll (NVMe): Polling can reduce latency, but requires CPU power. I activate it specifically for low latency-Requirements.
Scheduler tunables in detail
Depending on the scheduler, useful fine-tuning options are available:
- mq-deadline:
/sys/block//queue/iosched/read_expire(ms, typically small),write_expire(larger),fifo_batch(batch size),front_merges(0/1). I thinkread_expireshort to protect P95 reads, and adjustfifo_batchdepending on the device. - BFQ:
slice_idle(Idle time for sequence utilization),low latency(0/1) for responsive interactivity. Withbfq.weightIn cgroups, I control relative shares very precisely. - none/noop: Hardly any adjustment screws, but the Surroundings (Queue depth, read ahead) determines the results.
I only change one parameter at a time and keep strict track of the changes—that way, it's clear what effect each change had.
Common pitfalls and how I avoid them
Mixed pools of HDD and SSD behind a RAID controller distort tests, so I separate measurements per Group. I don't forget that the scheduler applies per block device – I consider LVM mappers and MD devices separately. Persistence tends to slip through: without a udev rule or kernel parameter, the default is restored after rebooting. Cgroups and I/O priorities often remain unused, even though they significantly improve fairness. And I always check queue depth, writeback, and file system options to ensure that the chosen logic reaches its potential. shows.
Troubleshooting: Read symptoms carefully
When the readings change, I interpret patterns and derive concrete steps:
- High P99 latency with many readsCheck whether writes are displacing reads. Test with mq-deadline.,
read_expirelower, smooth writeback (vm.dirty_*adjust). - 100% util on HDD, low throughput: Seeks dominate. Try BFQ or mq-deadline, reduce read ahead, moderate queue depth.
- Good throughput values, but UI stuttersInteractivity suffers. Activate BFQ, critical services via
ionice -c1or prefer cgroup weights. - Significant variation depending on the time of dayShared resources. Isolate with cgroups, choose scheduler per pool, move backups to off-peak times.
- NVMe timeouts in dmesg: Backend or firmware issue.
io_pollDeactivate on a trial basis, check firmware/driver, verify path redundancy (multipath).
In summary: Clear decisions for everyday hosting
For flash storage and guests, I often opt for Noop, to save overhead and let controllers do their job. In all-round servers with HDD or RAID, mq-deadline delivers reliable latency and high usability. With many active users and interactive loads, BFQ ensures fair sharing and noticeable responsiveness. Before each commitment, I measure with real workloads and observe the effects on P95/P99. This allows me to make traceable decisions, keep systems running smoothly, and stabilize the Server-Performance in day-to-day business.


