Interrupt coalescing bundles multiple incoming packets into a single hardware interrupt, reducing CPU load while increasing throughput. I show how to tune timings, thresholds and NIC functions such as RSS and RSC to minimize latency, jitter and Throughput depending on the workload.
Key points
OverviewThe following core aspects will guide you in a structured way through technology, tuning and practice.
- CPU load: Fewer interrupts, higher throughput.
- Latency trade-offMilliseconds against stability and pps.
- NIC tuningRSS, RSC, MTU and BIOS energy profiles.
- OS setupethtool, RSC/RSS, driver queues.
- Monitoringpps, interrupts/s, p99 latency.
Interrupt coalescing briefly explained
Coalescing means that the network card collects incoming packets and only triggers an interrupt when there is enough work or a timer expires. In this way, I significantly reduce the number of interrupts and move parts of the packet processing into the NIC, which reduces the load on the CPU. On Windows servers, Receive Segment Coalescing (RSC) helps by combining multiple segments into larger blocks and reducing processing costs. On Linux, I control the aggregation via rx-usecs (time) and rx-frames (packets) depending on flow characteristics and target latency. This approach reduces overhead, keeps cores free and stabilizes throughput with heavy traffic. The deliberate compromise remains important: each summary adds a small waiting time, which I limit tightly for latency-critical flows.
Mechanics: Timings, FIFO and thresholds
NICs keep incoming frames in a FIFO queue and trigger interrupts according to two criteria: after x received frames or after y microseconds. I set small time windows for low-latency services and increase them for high-throughput streams with large bursts. One queue per receive queue improves parallelization, while interrupt moderation reduces core changes and makes better use of the cache. However, too high rx-usecs add delay; too low values generate interrupt storms and push the Throughput. I therefore balance the timeout and packet limit according to the MTU, frame size and the proportion of small packets.
Adaptive moderation and burst detection
Adaptive coalescing dynamically adjusts time and packet windows to the current load. I use it when load profiles fluctuate greatly: at a low pps rate, the windows remain small (low latency); as the pps rate increases, they widen (reducing the load on the CPU). The benefit depends on the driver: some NICs detect bursts and increase rx-usecs at short notice, others work with fixed levels. I check the Stability of the p99 latency with adaptation activated; fidgety curves indicate overly aggressive jumps. For deterministic services, I prefer to set static, finely selected thresholds, while I allow adaptive modes in bulk operation as long as there are no drops on the ring.
Throughput versus latency: the controllable compromise
Latency decreases when I deactivate coalescing, but the CPU then works significantly more and scales worse under load. For file transfers, streaming or replication, I accept some delay because it increases stability and net throughput. For VoIP, real-time gaming or HFT, I prefer minimal delay and switch off moderation. I also check the TCP congestion control, because algorithms such as CUBIC or BBR strongly influence the behavior in the event of packet loss, RTT and bursts. With finely adjusted timers, RSS and suitable TCP parameters, the trade-off measurable optimization.
Transmit coalescing, TSO/GSO/GRO and LRO
In addition to RX, the TX coalescing play a role: tx-usecs and tx-frames bundle outgoing packets, which saves context switches and stabilizes send throughput. I use moderate tx-usecs to smooth out bulk sends, but keep them small if short responses (e.g. HTTP APIs) need to go out quickly. Offloads like TSO/GSO enlarge segments before transmission and reduce the number of packets, while GRO/LRO merge segments on the RX side. I validate whether GRO/LRO harmonize with my middleboxes; for certain firewalls or capture requirements, I reduce LRO to keep packet boundaries visible. All in all, I combine TX coalescing and offloads in such a way that PPS is reduced and the kernel spends less SoftIRQ time without unnecessarily stretching response times.
NIC tuning for hosting servers
RSS (Receive-Side Scaling) distributes incoming flows over several cores and prevents a single core from becoming a brake. I enable RSS and set enough receive queues so that multi-core CPUs work efficiently. RSC also reduces the load by merging smaller segments, which reduces the number of packets in the stack. For hosting workloads, I combine coalescing with clean MTU selection, DSCP/QoS prioritization and CPU power profiles in the BIOS, where C-states and deep sleep modes do not increase latency. I test the combinations in load peaks and check whether IRQ affinity and queue pinning preserve the cache locality. This is how I bring nic tuning hosting and interrupt coalescing network.
NUMA, MSI-X and flow steering
On multi-socket hosts I pay attention to NUMA-Membership: I pin receive queues to cores that are close to the PCIe slot and place associated worker threads on the same NUMA node. MSI-X-interrupts offer several vectors; I use as many as make sense so that each RX/TX queue has its own interrupt and lock retention is reduced. Additionally help RPS/RFS/XPS, to direct flows to the „right“ cores and control send allocation. I measure L1/L2 miss rates and observe if cross-core traffic increases; if it does, I reassign queues or reduce the number of queues to increase locality.
Parameters and their effects (table)
Parameters such as rx-usecs, rx-frames, RSS queues and RSC determine whether I prefer to minimize latency or stabilize throughput. I start with conservative values, measure p99 latency and interrupts per second and then carefully increase the time windows. Small steps make it easier to attribute effects and prevent misinterpretations. If bursts dominate, I increase rx-frames slightly and control the jitter distribution. For mixed workloads, I vary for each VLAN or NIC profile so that Flows with different objectives are optimized separately.
| Parameters | Effect | Risk | Suitable for |
|---|---|---|---|
| rx-usecs (time) | CPU-Relief through delay window | More latency for short flows | High throughput, backups, replication |
| rx-frames (packages) | Combines small packages into one Interrupt together | Cue filling for bursts | Many small packages, web traffic |
| RSS queues | Scaled processing over several cores | Incorrect pinning increases cross-core traffic | Multi-core hosts with 10-100 Gbit/s |
| RSC/RSS active | Less parcel load in the Stack | Unsuitable for ultra-low latency | Hosting, virtualization, storage |
interpretationIf short flows dominate, I outsource the effect to the rx-usecs minimum; for bulk transfers I set higher values and benefit from a falling interrupt rate. I check p95/p99 latency and PPS after each step to avoid misconfigurations. As the load increases, I monitor soft IRQ times and context switches to ensure that CPU time flows to where it can be of real benefit. A clean IRQ affinity layout prevents wandering interrupts between cores and saves Cache-hit.
Practice: Windows Server and Linux
WindowsIn the Device Manager, I open the properties of the NIC, select „Advanced“ and adjust interrupt moderation, RSS and RSC if necessary; for hard low-latency, I set moderation to „Disabled“. I set the power profiles to high performance so that C-States do not increase the response time. LinuxI use ethtool to adjust rx-usecs/rx-frames and use ethtool -S to check the IRQ and error counters; irqbalance or explicit affinity pinning assigns queues to the cores. For very small packages, I experiment with GRO/LRO and check whether the user path or kernel path is the bottleneck. I provide more depth on this topic in my guide to Optimize CPU interrupts, which describes measurable steps and counter-checks.
Virtualization and cloud: SR-IOV, vSwitch and vRSS
In virtualized environments, the Path of the packages the optimum setting. With SR-IOV VFs bypass the vSwitch overhead; I set up coalescing directly on the PF/VF and make sure that the guest and host have similar policies. In vSwitch scenarios (Hyper-V, Open vSwitch), additional queues and schedulers are involved; vRSS distributes load within the VM across several vCPUs. I measure whether coalescing is taking effect in the host or in the VM and prevent double moderation with windows that are too large. For NFV/DPDK workloads, work is shifted to userspace; I adjust the polling budgets there and keep kernel coalescing conservative so as not to falsify measurements.
Performance measurement and telemetry
Measurement ensures every optimization, so I track pps, bytes/s, interrupts/s, SoftIRQ times, drops and queue length. I compare p50/p95/p99 latency and pay attention to burst behavior, because mean values mask sharp outliers. For HTTP/2/3, I measure connection density, request rate and CPU time per request to detect side effects of coalescing. Storage nodes benefit when I look at iowait, IRQ load and network latency together, because bottlenecks tend to migrate between stack layers. Dashboards with events and deploy times help to clearly assign tuning steps and stop regressions immediately.
Time-critical protocols and hardware timestamps
For protocols with precise time measurement (e.g. PTP), I check whether coalescing influences the timestamp accuracy. Some NICs offer hardware timestamps that are set before coalescing - ideal for measurement accuracy. In such cases, I deactivate LRO/GRO and reduce rx-usecs to a minimum so that latency variants do not interfere with time synchronization. For deterministic networks (TSN), I keep energy-saving modes flat, set QoS strictly and confirm that no queues generate overflows that jeopardize clock stability.
Workload profiles: When to activate, when not?
High throughputBackups, CDN origin, object storage and VM replication benefit greatly from coalescing because the CPU is less disturbed. Web hosting with many small requests needs moderate values, combined with RSS and good cache locality. Virtual environments win when I set smart defaults per vNIC and isolate noisy neighbors. For VoIP, gaming or real-time telemetry, I deactivate moderation or set very tight timers. Measurements according to the traffic profile are mandatory, because 10 Gbit/s bulk traffic behaves differently to 1 Gbit/s API traffic.
Ring sizes, buffers and drop behavior
In addition to timers Ring sizes (RX/TX descriptors) to ensure reliability during bursts. I increase RX descriptors moderately when short peaks cause drops, paying attention to memory footprint and cache fitness. Rings that are too large conceal problems, but extend waiting times in the pipeline. I monitor „rx_no_buffer“, „dropped“ and „overruns“ in statistics counters and compare thresholds with typical burst lengths. A finely balanced combination of rx-frames, rx-usecs and ring size prevents Bursts lead to losses or jitter peaks.
Jitter, packet loss and burst handling
Jitter occurs when coalescing windows and burst patterns interact unfavorably; I can recognize this by wide latency distributions. Small timer jumps often smooth out the p99 curve without visibly reducing throughput. If the NIC drops under load, I set less aggressive values and check the queue depth and driver levels. For websites, the analysis of Network jitter, to make render blocking requests and TLS handshakes plannable. Finally, I check whether QoS policies cleanly separate priority classes and thus prevent critical Flows prefer.
Practical tuning checklist
Start with baseline: I record latency, pps, interrupts/s and CPU profile before each change. Then I activate RSS/RSC, set moderate coalescing values and measure p50/p95/p99 again. Then I increase rx-usecs in small steps until jitter or p99 latency increases and roll back to the last good point. I assign queues to fixed cores and monitor cache misses; if cross-core traffic increases, I adjust the affinity. I briefly document each change and compare load peaks so that the Stability does not suffer in secret.
Example start values according to link speed
- 1 Gbit/s: rx-usecs 25-50, rx-frames 8-16, tx-usecs 25-50; few RSS queues (2-4), focus on latency.
- 10 Gbit/s: rx-usecs 50-100, rx-frames 16-32, tx-usecs 50-100; 4-8 RSS queues, GRO on, LRO selective.
- 25/40 Gbit/s: rx-usecs 75-150, rx-frames 32-64, tx-usecs 75-150; 8-16 cues, NUMA pinning strict, RSC/RSS active.
- 100 Gbit/s: rx-usecs 100-200, rx-frames 64-128, tx-usecs 100-200; 16-32 cues, use MSI-X fully, increase ring sizes moderately.
Note: These are conservative entry points. I optimize along the p99 latency and drops and consider packet sizes (MTU 1500 vs. Jumbo), flow mix and CPU topology.
Costs, energy and sustainability
Energy decreases when I press the interrupt rate because the CPU executes fewer context switches and wake-ups. In data centers, this adds up over many hosts and noticeably reduces power and cooling costs. An upgrade to modern 10/25/40/100G NICs with good moderation usually costs a few hundred euros, but often pays for itself quickly due to lower CPU time per byte. I take into account whether licenses, driver maintenance and monitoring are already in place in order to keep running costs low. For SLA-critical services, a conservative window is worthwhile, which Jitter limits and secures the user experience.
Troubleshooting and anti-pattern
Show metrics Interrupt storms, I reduce RSS queues or increase rx-usecs slightly. For „wobbly“ latency curves, I deactivate adaptive moderation as a test. If drops occur despite high CPU reserves, I check ring sizes, firmware version and PCIe link state power management. A classic: Coalescing very high + GRO/LRO active hides packet losses in p50, while p99 suffers - I then rebalance rx-frames and shorten rx-usecs. With multi-tenant hosts, „noisy neighbors“ cause an unevenly distributed IRQ load; I use hard affinity masks and QoS classes to avoid critical IRQs. Flows to protect. Important: Always roll out changes individually and test them against identical load profiles in order to clearly separate cause and effect.
Summary: Faster, smoother, more predictable
Core ideaInterrupt coalescing reduces interference, distributes work more intelligently and increases net throughput as long as I set timers and packet limits in a targeted manner. For high-throughput services I choose more generous windows, for real-time services minimal or deactivated moderation. With RSS, RSC, MTU discipline and clean IRQ affinity, I make full use of multi-core CPUs. Measurements with p95/p99, interrupts/s and SoftIRQ times secure every change and prevent misinterpretations. So my Network quiet under load, responds quickly and delivers predictable latencies for hosting and applications.


