High network load is determined by the efficient processing of Server IRQ signals: If you distribute interrupts wisely across CPU cores, you reduce latency and prevent drops. In this guide, I'll show you how to combine IRQ balancing, RSS/RPS and CPU affinity in a practical way to make high-load hosting sustainable. performant to operate.
Key points
- IRQ distribution prevents hotspots on individual CPU cores.
- Multi-queue plus RSS/RPS parallelizes packet processing.
- NUMA Attention reduces cross-node access and latency.
- CPU Governor and thread pinning smooth out response times.
- Monitoring Checks pps, latencies, drops and core utilization.
IRQs briefly explained: Why they control the network load
For every incoming packet, the network card reports via IRQ, that work is pending, otherwise the kernel would have to actively poll. If the assignment remains on one core, its utilization increases, while other cores unused remain. This is exactly when latencies grow, the RX ring buffers fill up and drivers start dropping packets. I distribute interrupts across suitable cores to keep packet processing even and predictable. This relieves bottlenecks, smoothes response times and keeps packet losses to a minimum.
IRQ balancing and CPU affinity under Linux
The service irqbalance distributes interrupts dynamically, analyzes load and shifts affinities automatically over time. For extreme load profiles, I define affinities manually via /proc/irq//smp_affinity and bind cues specifically to cores of the same NUMA-nodes. This combination of automatic and fine-tuning helps me to process both base load and peaks cleanly. An in-depth introduction to Interrupt handling and CPU optimization I use them to help me with my planning. It remains important: I consistently link hardware topology, IRQ distribution and application threads with each other.
Practical use of multi-queue NICs, RSS and RPS
Modern NICs provide several RX/TX queues, each queue triggers its own IRQs, and Receive Side Scaling (RSS) distributes flows to cores. If there are not enough hardware queues, I add Receive Packet Steering (RPS) and Transmit Packet Steering (XPS) to the kernel for additional Parallelism. With ethtool -L ethX combined N I adjust the queue number to the core number of the associated NUMA node. I check with ethtool -S and nstat, whether drops, busy polls or high pps peaks occur. For finer load smoothing, I also use Interrupt coalescing in the planning so that the NIC does not generate too many individual IRQs.
The following table shows central components and typical commands that I use for a coherent setup:
| Building block | Goal | Example | Note |
|---|---|---|---|
| irqbalance | Automatic distribution | systemctl enable --now irqbalance | Starting point for mixed workloads |
| Affinity | Fixes Pinning | echo mask > /proc/irq/XX/smp_affinity | Observe NUMA assignment |
| Cues | More parallelism | ethtool -L ethX combined N | Match to node cores |
| RSS/RPS | Flow distribution | sysfs: rps_cpus/rps_flow_cnt | Useful for a small number of NIC queues |
| XPS | Ordered TX path cores | sysfs: xps_cpus | Avoids cache thrash |
Making sensible use of automatic IRQ balancing
For mixed hosting servers, it is often sufficient to activate irqbalance, because the daemon constantly detects load shifts. I check the status via systemctl status irqbalance and take a look at /proc/interrupts, to see the distribution per queue and core. If latencies increase in peaks, I define test cores that primarily process interrupts and compare measured values before and after the change. I keep the configuration simple, so that later audits and rollbacks are quick. Only when patterns are clear do I go deeper into pinning.
Manual CPU affinity for maximum control
At very high pps rates, I pin RX queues to selected cores of the same NUMA-nodes and deliberately separate application threads from them. I isolate individual cores for interrupts, run workers on neighboring cores and pay strict attention to cache locality. In this way, I reduce cross-node accesses and minimize expensive context switches in the hot path. For reproducible results, I clearly document the IRQ masks, the queue assignment and the thread affinity of the services. This clarity keeps the packet runtimes constant and reduces outliers.
Clean coordination of CPU optimization and applications
I set the CPU Governor often set to „performance“ because clock changes increase the latency jumps. I bind critical processes such as Nginx, HAProxy or databases to cores that are close to the IRQ cores, or I deliberately separate them if the cache profile requires it. It remains important to limit context changes and keep the kernel up to date so that optimizations in the net stack take effect. I measure the effects of each change instead of making assumptions and adapt step by step. This results in a setup that works under load predictable reacts.
Set up monitoring and measurement correctly
Without measured values, tuning remains a guessing game, so I'll start with sar, mpstat, vmstat, nstat, ss and ethtool -S. For structured load tests I use iperf3 and look at throughput, pps, latency, retransmits and core utilization. I record long-term trends using common monitoring systems to identify patterns such as evening peaks, backup windows or campaigns. If you want to understand the data path holistically, you benefit from a view of the Packet processing pipeline from the NIC IRQ to the user space. Only the combination of these signals shows whether IRQ balancing and affinity have achieved the desired Effect bring
Understanding NAPI, Softirqs and ksoftirqd
In order to manage latency peaks with high pps loads, I take into account the NAPI-mechanics and the interaction of hard IRQs and soft IRQs. After the first hardware IRQ, NAPI retrieves several packets from the RX queue in poll mode to avoid IRQ storms. If soft IRQs are not processed promptly, they are moved to ksoftirqd/N Threads that only run with normal priority - a classic reason for increasing tail latencies. I observe /proc/softirqs and /proc/net/softnet_stat; a high „time_squeeze“ value or drops indicate that the budget is too tight. With sysctl -w net.core.netdev_budget_usecs=8000 and sysctl -w net.core.netdev_budget=600 I increase the processing time per NIC poll and the packet budget as a test. Important: I increase values gradually, measure and check whether CPU jitter or interference with application threads occurs.
Fine-tune RSS hash and indirection table
RSS distributes flows to queues via the indirection table (RETA). I verify hash key and table with ethtool -n ethX rx-flow-hash tcp4 and set the distribution symmetrically if required. With ethtool -X ethX equal N or specifically per entry (ethtool -X ethX hkey ... hfunc toeplitz indir 0:1 1:3 ...), I adjust assignments to the preferred cores of a NUMA node. The goal is Flow stickinessA flow remains on the same core so that cache locality and lock retention in the stack remain minimal. For environments with many short UDP flows, I increase rps_flow_cnt per RX queue so that the software distribution has enough buckets and does not create any hotspots. I keep in mind that symmetric hashes help with ECMP topologies, but in the server context, core balance is what matters most.
Choose offloads, GRO/LRO and ring sizes sensibly
Hardware offloads reduce the load on the CPU, but can change latency profiles. I check with ethtool -k ethX, whether TSO/GSO/UDP_SEG on TX and GRO/LRO are active on RX. GRO bundles packets in the kernel and is almost always useful for throughput; LRO can be problematic in routing or filtering setups and is better left off there. For latency-critical APIs, I test smaller GRO aggregation (or temporarily off) if p99 latencies dominate. I also adjust ring sizes via ethtool -G ethX rx 1024 tx 1024: Larger rings intercept bursts, but increase latency in congestion; rings that are too small lead to rx_missed_errors. I rely on measured values from ethtool -S (e.g. rx_no_buffer_count, rx_dropped) and agree this with BQL (byte queue limits, automatic on the kernel side) so that TX queues are not overfed.
Virtualization: IRQs in VMs and on the hypervisor
In virtualized setups, I control the physical NIC distribution on the host and set IRQ balancing clear on. VMs get enough vCPUs, but I avoid blind overcommitment so that scheduling delays do not increase the latency. Modern paravirtualized drivers such as virtio-net or vmxnet3 provide me with better paths for high pps rates. Within the VM, I check affinity and queue count again so that the guest does not become a bottleneck. It is crucial to have an end-to-end view of the host and guest so that the entire data path true.
Deepening virtualization: SR-IOV, vhost and OVS
For very high pps rates I use the hypervisor SR-IOVI bind virtual functions (VFs) of the physical NIC directly to VMs and pin them to cores of the appropriate NUMA nodes. This bypasses parts of the host stack and reduces latency. Where SR-IOV does not fit, I pay attention to vhost-net and pin the vhost threads such as application workers and IRQ cores so that no cross-NUMA jumps occur. In overlay or switching setups, I evaluate the additional costs of Linux bridge or OVS; for extreme profiles, I only use OVS-DPDK if the operational effort justifies the measurable advantage. The same applies here: I measure pps, latency and CPU distribution before making decisions, not after.
Busy polling and userspace tuning
For latency-critical services Busy polling reduce the jitter. I activate the following as a test sysctl -w net.core.busy_read=50 and net.core.busy_poll=50 (microseconds) and set the socket option SO_BUSY_POLL selectively for affected sockets. The user space then polls shortly before blocking and catches packets before they move deeper into the queues. This costs CPU time, but often delivers more stable p99 latencies. I keep the values low, monitor core utilization and only combine busy polling with clear thread affinity and a fixed CPU governor, otherwise the effects cancel each other out.
Parcel filter, Conntrack and eBPF costs at a glance
Firewalling and NAT are part of the data path. I therefore check the nftables/iptables-rules and clean up dead rules or deep chains. In busy setups, I adjust the Conntrack table size (nf_conntrack_max, hash bucket count) or deactivate Conntrack specifically for stateless flows. If eBPF programs (XDP, tc-BPF) are used, I measure their runtime costs per hook and prioritize „early drop/redirect“ to relieve expensive paths. It is important to have clear responsibility: either the optimization takes effect in the NIC offload, in the eBPF program or in the classic stack - duplication only increases latency.
CPU isolation and housekeeping cores
For absolutely deterministic latency, I store background work on Housekeeping CPUs off. Kernel parameters such as nohz_full=, rcu_nocbs= and irqaffinity= help to keep dedicated cores largely free of tick handling, RCU callbacks and extraneous IRQs. I isolate one set of cores for application workers and another for IRQs and softirqs; system services and timers run on separate cores. This ensures clean cache profiles and reduces preemption effects. Hyper-threading can increase jitter in individual cases; I test whether deactivating it per core pair smoothes the p99 latencies before making a global decision.
Diagnostic playbook and typical anti-patterns
When drops or latency peaks occur, I take a structured approach: 1) /proc/interrupts Check for uneven distribution. 2) ethtool -S on RX/TX drops, FIFO errors, rx_no_buffer_count check. 3) /proc/net/softnet_stat according to „time_squeeze" or "drops“. 4) mpstat -P ALL and top for ksoftirqd activity. 5) Application metrics (number of active connections, retransmits with ss -ti). Anti-patterns that I avoid: huge RX rings (hidden congestion), wild switching on/off of offloads without measurement, mixing of fixed affinities with aggressive irqbalance, or RPS and RSS simultaneously without a clear target architecture. Each change gets a measurement before/after comparison and a short protocol.
Example concepts for web hosting and APIs
Classic web hosting server
For many small websites I activate irqbalance, I set up several queues and select the performance governor. I measure L7 latencies during peaks and pay attention to pps peaks, which mainly occur with TLS and HTTP/2. If hardware queues are not sufficient, I add RPS for additional distribution at the software level. This adjustment keeps response times constant, even if the overall capacity utilization appears moderate. Regular checks of /proc/interrupts show me whether individual cores are tilting.
High-load reverse proxy or API gateway
For frontends with a high number of connections, I pin RX queues finely to defined cores and position proxy workers on nearby cores. I consciously decide whether irqbalance remains active or whether fixed pinning delivers the clearer results. If there are not enough queues, I specifically select RPS/XPS and calibrate Coalescing, to avoid IRQ storms. This allows me to achieve low latency at a very high pps rate and keep tail latencies under control. Documentation of every change facilitates subsequent audits and keeps the behavior predictable.
Provider selection and hardware criteria
I pay attention to NICs with Multi-queue, reliable latency in the backbone and up-to-date kernel versions of the platform. Balanced CPU topology and clear NUMA separation prevent network interrupts from reaching into remote memory zones. For projects with high pps rates, the choice of infrastructure rewards every hour of tuning because the hardware provides reserves. In practical comparisons, I have worked well with providers that disclose performance profiles and provide IRQ-friendly defaults, such as providers like webhoster.de. Such setups allow me to use IRQ balancing, RSS and affinity effectively and reduce response times. close to hold.
Step-by-step procedure for your own tuning
Step 1: I determine the current status with iperf3, sar, mpstat, nstat and ethtool -S, so that I have clear initial values. Step 2: If irqbalance is not running, I activate the service, wait under load and compare latency, pps and drops. Step 3: I adjust the queue number and RSS configuration to the cores of the associated NUMA node. Step 4: I set the CPU governor to „performance“ and assign central services to the appropriate cores. Step 5: Only then do I tweak manual affinity and NUMA pinning if the measured values still show bottlenecks. Step 6: I check trends over the course of days in order to reliably classify event peaks, backups or marketing peaks.
Briefly summarized
Effective IRQ balancing distributes network work across suitable cores, reduces latencies and prevents drops at high pps rates. In combination with multi-queue NICs, RSS/RPS, a suitable CPU governor and clean thread affinity, I reliably utilize the net stack. Measured values from ethtool -S, nstat, sar and iperf3 lead me step by step to my goal instead of poking around in the dark. If you think about NUMA topology, IRQ pinning and application placement together, you can keep response times to a minimum. low - even during peak loads. This means that high-load hosting remains noticeably responsive without burning unnecessary CPU reserves.


