With I/O Scheduler Tuning I specifically optimize the Kernel-path for memory access and reduce latency in hosting environments. The article shows in a practical way how I adapt Linux disk scheduling to hardware and workload in order to hosting performance safely.
Key points
The following key points will give you a quick overview of the content of this article.
- Scheduler selectionNoop/none, mq-deadline, BFQ, Kyber depending on hardware and workload
- measurement strategyFio, iostat, P95/P99, IOPS and throughput before/after changes
- Fine adjustments: Readahead, RQ-Affinity, Cgroups, ionice for QoS
- Persistence: udev rules and GRUB parameters for persistent profiles
- PracticeTroubleshooting for latency peaks, fairness and NVMe specifics
How Linux disk scheduling works
I see the I/O scheduler as a control center that converts requests into Cues sorts, merges and prioritizes. With HDDs, I avoid expensive head movements by sorting requests according to block addresses and thus reducing search times. On SSDs and NVMe, parallelism dominates, which is why the multi-queue subsystem blk-mq makes the path wider and can be distributed across several CPUs distributed. This reduces latencies, smoothes peaks and keeps throughput on track, even if many services are writing and reading at the same time. In hosting, web servers, databases and backup jobs come together, so I always align scheduling with the dominant access patterns.
The common schedulers briefly explained
For NVMe and modern SSDs, I often choose none (equivalent to Noop in the blk-mq context), because the controller optimizes internally and any additional overhead costs. mq-deadline sets fixed deadlines for reads and writes, prioritizes read operations and delivers constant response times in mixed server loads. BFQ distributes bandwidth fairly across processes and is suitable for multi-tenant setups in which individual VMs would otherwise occupy the disk. Kyber aims for low latencies and slows down incoming requests if target times are exceeded. CFQ is considered a legacy load and hardly fits NVMe; I only use CFQ when legacy setups require it or tests show clear advantages; I provide a detailed overview here: I/O Scheduler Guide.
I/O Scheduler Tuning step by step
I start with a clear Baseline-measurement run so that I can show gains objectively. For this I use fio for synthetic patterns, iostat for device statistics and collect P95/P99 latencies for reads and writes. I then check the active scheduler per device and change it at runtime to quickly counter-test. I only make persistent adjustments when stable measurements show that the choice is right. In this way, I avoid wrong decisions that later force expensive rollbacks.
# Check current scheduler
cat /sys/block//queue/scheduler
# Change on the fly (example: nvme0n1 to mq-deadline)
echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler
# Fast comparison with fio (Random Reads 4k)
fio --name=rr --rw=randread --bs=4k --iodepth=32 --numjobs=4 --runtime=60 --filename=/dev/nvme0n1
I keep an eye on the CPU load because an unsuitable scheduler creates additional context switches and thus reduces net performance. As soon as latencies fall and throughput increases, I back up the decision and document test profiles. Each step is followed by a change, then a measurement, so that I can clearly separate cause and effect. This discipline pays off when several disk classes are installed in the server and individual devices react differently.
Fine adjustments: Readahead, RQ affinity, Cgroups
After selecting the scheduler, I adjust the Queue-parameter for the load. For sequential backups I raise readahead, for random IO I lower it so that I don't load any unnecessary pages. With RQ affinity, I ensure that completions land on the core that generated the request, which improves latency and cache locality. I use ionice to downgrade processes such as backups and indexing so that web requests do not suffer. In multi-tenant environments, I regulate bandwidth and IOPS via Cgroups v2 to set hard limits per client.
# Readahead for sequential patterns
echo 128 | sudo tee /sys/block//queue/read_ahead_kb
# RQ affinity: 2 = completion on generating core
echo 2 | sudo tee /sys/block//queue/rq_affinity
# Lower backup process
ionice -c2 -n7 -p
# Cgroup v2: weight and limit (example major:minor 8:0)
echo 1000 | sudo tee /sys/fs/cgroup/hosting/io.weight
echo "8:0 rbps=50M wbps=25M" | sudo tee /sys/fs/cgroup/hosting/io.max
Which choice is right for hosting profiles?
I decide the scheduler-Choice according to hardware class, access pattern and target size (latency vs. throughput vs. fairness). NVMe SSDs in single-tenant VMs usually benefit from none because the controller does extensive optimization and every software layer counts. For mixed read/write loads on SSDs, I often use mq-deadline as it prioritizes read requests and thus protects response times. In shared hosting environments, I choose BFQ to ensure fairness between customers and prevent bandwidth monopolies. I use Kyber when target latencies are critical and I need to maintain hard limits for certain workloads.
| scheduler | Suitable hardware | Typical workloads | Advantages | Notes |
|---|---|---|---|---|
| Noop/none | NVMe, modern SSD | Many parallel reads/writes, VMs | Minimal overhead, high parallelism | Controller takes over sorting; test in SAN/RAID |
| mq-deadline | SSD, SAS, fast HDD | Mixed web/DB loads | Read latencies prioritized, good throughput | Deadline values conservative; fine-tuning possible |
| BFQ | SSD/HDD in multi-tenant | Many users, cgroups | Clear fairness and bandwidth control | Some administrative effort, clean weighting |
| Kyber | SSD, NVMe | Latency-critical services | Target latencies controllable | Measure precisely to set throttling correctly |
| CFQ | Legacy hardware | Legacy workloads | Former standard solution | Rarely useful on modern NVMe/SSD |
Practical profiles and measured values
For web servers with many small Reads the P95 latency counts more than pure IOPS, so I test get requests with keep-alive and TLS in combination. Databases bring sync writes into play, which is why I simulate flush behavior and fsync costs with fio job files. Backup windows often have sequential streams; here I measure throughput in MB/s and make sure that frontend requests do not wait too long. In my tests, I see 20-50 % shorter response times, depending on the initial situation, if the scheduler and readahead match the workloads. If you need more context on measuring disk throughput, you can find an introduction here: Disk throughput in hosting.
Persistent configuration and automation
I anchor the Choice permanently via udev rule so that devices start directly in the appropriate mode after reboots. I often set none for NVMe, mq-deadline for SSDs and BFQ for rotating media if fairness is paramount. I optionally set a global default via GRUB if I am running a homogeneous setup. I keep the rules short and document them in the configuration repository so that the team can track them. For more in-depth kernel optimization, this article supplements the setup: Kernel performance in hosting.
# /etc/udev/rules.d/60-ioschedulers.rules
# NVMe: none
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
# SSDs: mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]|vd[a-z]", ATTR{rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# HDDs: BFQ
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{rotational}=="1", ATTR{queue/scheduler}="bfq"
# Reload/test rules
udevadm control --reload
udevadm trigger
# Optional global default via GRUB
# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="elevator=mq-deadline"
update-grub
QoS with Cgroups v2 and ionice
So that no job leaves the plate blocked, I rely on QoS rules with Cgroups v2 and add priorities via ionice. For premium tenants, I raise io.weight, while I set hard limits for noisy neighbors with io.max. I bind systemd units directly to Cgroups so that services automatically slip into the right class at startup. I temporarily throttle short-term maintenance work so that customer requests continue to run smoothly. This interplay of weighting, limits and process priority creates predictable response times even under load.
# Create cgroup and set limits
mkdir -p /sys/fs/cgroup/hosting
echo 1000 | tee /sys/fs/cgroup/hosting/io.weight
echo "8:0 rbps=100M wbps=60M" | tee /sys/fs/cgroup/hosting/io.max
Move # process to Cgroup
echo | tee /sys/fs/cgroup/hosting/cgroup.procs
# Low IO priority for secondary jobs
ionice -c2 -n7 -p
Monitoring and troubleshooting
I always keep telemetry close on the workloads, otherwise I miss decisions. I use iostat to read service times and queue depths, blktrace to analyze request flows, and sar/dstat to see system load over time. For latencies, I don't just look at average values, but always at P95/P99, because noticeable hackers become visible there. If P95 is good, but P99 is dropping, I adjust the queue depth or RQ affinity and check competing jobs. After each correction, I compare the same key figures so that the effect remains reliable.
Typical stumbling blocks and remedies
High Latency on SSDs often indicates an unsuitable scheduler; I then immediately test mq-deadline and check whether reads become faster. I solve unfair distribution in multi-tenant setups with BFQ and clear Cgroup weights so that strong customers do not crowd out weaker ones. NVMe timeouts indicate firmware or too aggressive polling; in such cases I deactivate io_poll and lower the depth until stability returns. Fluctuating throughput in backup windows can often be smoothed out with an adjusted readahead, especially when large files dominate. If more factors are rotating at the same time, I proceed step by step: one change, then measure, then the next.
Scheduler tunables in detail
Once the basic selection has been made, I turn the adjusting screws of the respective schedulers. I always start by looking at the available parameters for each device, as they vary depending on the kernel and distro.
# Display available tunables
ls -1 /sys/block//queue/iosched
cat /sys/block//queue/iosched/*
# Example: mq-deadline more conservative for write-heavy jobs
echo 100 | sudo tee /sys/block//queue/iosched/read_expire
echo 500 | sudo tee /sys/block//queue/iosched/write_expire
echo 1 | sudo tee /sys/block//queue/iosched/front_merges
# Example: BFQ for stricter fairness and lower idle times
echo 1 | sudo tee /sys/block//queue/iosched/low_latency
echo 0 | sudo tee /sys/block//queue/iosched/slice_idle
At mq-deadline I mainly regulate read_expire/write_expire (in milliseconds) and front_merges for merging pending requests. With BFQ, depending on the tenant density, I switch low latency and slice_idle, to reduce waiting times between flows. I document every change with measured values, because incorrect expires can trigger unwanted latency peaks under burst load.
File system and mount options
Scheduler tuning only really comes into its own when the file system matches. I pay attention to:
- relatime/noatimeavoid unnecessary metadata write access.
- discard vs. fstrim: On SSDs/NVMe I usually use periodic fstrim instead of online discard to avoid latency spikes.
- Journaling: For ext4, the following have proven themselves data=ordered (default) and a suitable commit=-interval (e.g. 10-30s depending on data loss tolerance).
- BarriersWrite barriers remain active; I do not deactivate them unless the hardware guarantees power failure protection (battery/capacitor).
# Example /etc/fstab for ext4
UUID= /data ext4 defaults,noatime,commit=20 0 2
# Enable periodic TRIM instead of discard option
systemctl enable fstrim.timer
systemctl start fstrim.timer
For XFS I also set noatime and prefer fstrim.timer. Journal or barrier options are distribution dependent; I always test the specific kernel/FS combination and measure P95/P99.
RAID, LVM, DM-crypt and Multipath
In stacked setups (Device Mapper, LVM, mdraid, Multipath) I define the scheduler where the application sees I/O - i.e. at the Top-level device - and prevent double sorting underneath.
# Set scheduler at the top level (e.g. dm-0)
echo mq-deadline | sudo tee /sys/block/dm-0/queue/scheduler
# Underlying NVMe/SAS devices "none" to avoid double scheduling
for d in /sys/block/nvme*n1 /sys/block/sd*; do echo none | sudo tee $d/queue/scheduler; done
# mdraid: Optimize readahead and stripe cache (RAID5/6)
sudo blockdev --setra 4096 /dev/md0
echo 4096 | sudo tee /sys/block/md0/md/stripe_cache_size
With encrypted volumes (dm-crypt/LUKS), I pay attention to CPU offload (AES-NI) and ensure that the I/O path does not unnecessarily wander over work queues. I specifically measure sync-write latencies, as these can increase due to the crypto layer. In multipath environments (SAN/iSCSI), I set the scheduler on the multipath device (dm-X) and check that path failover does not generate any outliers.
Virtualization and containers: host vs. guest
In the KVM stack, I deliberately separate host and guest. In the Guest I usually use for virtio devices none, so that the hypervisor takes over the optimization. On the Host I then select the scheduler for each physical device that matches the hardware (often none/mq-deadline on SSD/NVMe).
# Guest (virtio-blk/virtio-scsi): Set scheduler to "none
echo none | sudo tee /sys/block/vda/queue/scheduler
# Host: QEMU with iothreads and multiqueue for virtio-blk
qemu-system-x86_64 \
-drive if=none,id=vd0,file=/var/lib/libvirt/images/guest.qcow2,cache=none,aio=native \
-object iothread,id=ioth0 \
-device virtio-blk-pci,drive=vd0,num-queues=8,iothread=ioth0
I bind containers directly to Cgroups v2 and use systemd properties (IOWeight, IOReadBandwidthMax/IOWriteBandwidthMax) so that services start automatically with the correct I/O budgets. Important: Only prioritize at one level - either in the container or in the host service - to avoid conflicting rules.
NUMA, IRQ and polling optimization
On multi-socket systems, I consider I/O and CPU Close to NUMA. I check the distribution of NVMe interrupts and adjust them if necessary if irqbalance is working suboptimally. I also use blk-mq options to keep completions local.
# Check NVMe interrupts and set core masks (example)
grep -i nvme /proc/interrupts
echo | sudo tee /proc/irq//smp_affinity
# blk-mq: Completions on generating core
echo 2 | sudo tee /sys/block//queue/rq_affinity
# Optional: Test I/O polling depending on workload (use carefully)
echo 0 | sudo tee /sys/block//queue/io_poll
For NVMe, I can adjust the interrupt coalescing parameters via controller features in order to smooth the ratio of CPU load and latency. I work my way forward in small steps and check whether P99 remains stable or whether coalescing leads to visible sluggishness.
Sample fio job profiles and measurement plan
I create reproducible job files and note the kernel, scheduler, queue parameters and file system mounts. This allows me to compare results over weeks.
# db-sync.fio - DB-like sync writes (ext4/XFS)
[global]
ioengine=libaio
direct=1
filename=/dev/
time_based=1
runtime=90
thread=1
numjobs=8
iodepth=1
[randwrite-sync4k]
rw=randwrite
bs=4k
fsync=1
# web-randread.fio - Web-like reads
[global]
ioengine=libaio
direct=1
filename=/dev/
time_based=1
runtime=90
thread=1
numjobs=8
iodepth=32
[randread-4k]
rw=randread
bs=4k
# Measuring frame
# 1) Warmup 60s, 2) Measurement 90s, 3) Cooldown 30s
# Parallel: run iostat, pidstat and blktrace
iostat -x 1 | tee iostat.log &
pidstat -dl 1 | tee pidstat.log &
blktrace -d /dev/ -o - | blkparse -i - -d trace.dump &
# Trace: Pull P95/P99 from fio-JSON
fio --output-format=json --output=fio.json db-sync.fio
jq '.jobs[].lat_ns["percentile"]|{p95:.["95.000000"],p99:.["99.000000"]}' fio.json
I only ever change one variable, e.g. scheduler or read_ahead_kb, and compare the identical job files again. Only when the improvements are consistent over several runs do I commit the settings.
Change management: safely introduce and roll back
In productive hosting environments, I roll out I/O changes staggered I run a canary host, then a small AZ/cluster batch, and only then the broad rollout. I version Udev rules and attach each change to a ticket with measured values. For the rollback, I have a script ready that plays out the previous values (scheduler, read_ahead_kb, Cgroup limits). In this way, interventions remain reversible if workloads change at short notice.
Summary: This is how I proceed
I start with a clear Actual value, I measure latencies and throughput and document the setup. I then select a suitable scheduler for each device: none for NVMe/virtual SSDs, mq-deadline for mixed server loads, BFQ for shared environments with many users. I then tweak readahead, RQ affinity and process priorities to prioritize front-end workloads. If measurements consistently show that the choice works, I fix it via udev/GRUB and write the parameters. Monitoring remains active because workloads change, and with small corrections I keep the Performance permanently high.


