I show how softirq cpu together with NAPI, IRQ distribution and queue design limits or unleashes the network throughput in hosting. With clear measuring points, targeted tuning and clean affinities, I reduce Latencies and consistently increase the pps throughput on productive servers.
Key points
These core ideas carry network packets efficiently via CPU, kernel and NIC - and maintain response times constant low.
- NAPI budget fine-tuning: More packages per poll reduce overhead and smooth out the CPU load.
- IRQ balancing and affinity: avoid hotspots, increase cache hits, Latency peaks Press.
- Multi-queue, RSS/RPS/XPS: parallelize flows, maintain NUMA alignment, pps raise.
- Offloads consciously use: GRO/LRO, TSO, evaluate coalescing, Jitter keep an eye on.
- Insulation and Busy Polling: Predictable response times on dedicated Cores.
Basics: What happens in the kernel during network traffic
A packet first lands in a hardware interrupt, after which the kernel takes over the work in SoftIRQs and NAPI poll loops. I make sure that the fast HardIRQ phase remains really short and that the actual logic moves to the right context so that the CPU time does not fizzle out. The ksoftirqd threads only step in if direct processing is not possible, which quickly leads to queues under continuous load. This is exactly where waiting time occurs, which is reflected in increased TTFB and fluctuating throughput. If you want to delve deeper, you can find practical knowledge on IRQ processing in this article on Interrupt handling and CPU performance, which I use for the classification.
NAPI, SoftIRQs and ksoftirqd: controlling latency instead of managing it
NAPI reduces interrupt storms by fetching several packets per run within a defined budget and thus minimizing the interrupt time. Overhead lowers. If the budget is not sufficient, parcels pile up, ksoftirqd runs hot and the Latency increases measurably. In such situations, I systematically check /proc/softirqs and /proc/net/softnet_stat to make drops, time_squeeze or overflowing queues visible. Then I gradually increase net.core.netdev_budget or net.core.netdev_budget_usecs and monitor CPU load, p95/p99 distribution and packet loss in parallel. The trick is to get enough work done per poll without crowding out the interactive execution of userland threads.
IRQ balancing and affinity: avoid hotspots, increase cache hits
A single core with all NIC IRQs becomes a bottleneck because it has to serve interrupts, soft IRQs as well as app threads; I therefore distribute IRQs targeted. The irqbalance service helps, but for high pps rates I explicitly map RX/TX queues via affinity to suitable cores. On NUMA systems, I bind queues to cores of the same node to avoid remote memory accesses. Application threads run on adjacent but separate cores, which improves cache locality and schedulability. A good overview of strategic distribution can be found in this guide to IRQ balancing in the data center, which I use as a reference for fine-tuning.
Multi-queue, RSS/RPS/XPS: Using parallelization correctly
Modern NICs come with several RX/TX queues, which I can use via RSS to flows and thus achieve real parallelism. If the card offers too few queues, I use RPS/XPS to make adjustments on the software side in order to distribute packets sensibly across flows. cores to push. Clean hash distribution is important so that a flow always remains on the same CPU and no expensive cache distortions occur. At the same time, I keep TX and RX paths close together to avoid lock contention and unnecessary cross-node accesses. This increases the pps throughput without a single core putting the brakes on.
CPU affinity right into user space: end-to-end thinking
I plan the data path from the NIC-IRQ via NAPI queues to the worker threads of the app so that packets reach their destination without unnecessary hooks and the Response time remains constant. To achieve this, I consistently separate cores for interrupts/softIRQs from app cores and create clear Affinity-rules. Web servers, reverse proxies and databases are given fixed CPU sets that are close to the IRQ cores in order to keep the paths short. In addition, I set the CPU governor to performance so that clock changes do not push jitter into p99. This consistent assignment makes behavior predictable and helps to diagnose bottlenecks cleanly.
Offloads, GRO/LRO, firewall and eBPF: save load without flying blind
Save checksum offload, TSO and coalescing CPU time, but can change packet sizes, burst behavior and jitter, which is why I measure effects specifically. GRO/LRO bundle frames and relieve the stack, but for real-time requirements I decide on a situational basis about Deactivation or limited use. Conntrack tables and deep nftables/iptables chains cost clocks, so I clean up redundant rules and simplify paths. If needed, I turn to eBPF (XDP, tc-BPF) to make early decisions at the NIC and avoid costly paths. A good starting point for fine-tuning practice is this overview of Interrupt coalescing, which I take into account for sensitive latency budgets.
Busy polling and CPU isolation: locking in response times
For hard latency targets, I use busy polling so that userspace sockets pick up packets even earlier and Waiting times shorten. This increases the load, but gives me very tight p99 distributions for API or trading workloads on dedicated Cores. In addition, I isolate cores with isolcpus=, nohz_full= and rcu_nocbs= so that timers, RCU and system services only run on housekeeping CPUs. This separation prevents interference on the latency cores and makes behavior reproducible. The result is a clear roadmap: dedicated cores, early packet collection, defined budgets.
Monitoring and troubleshooting: from symptom to cause
I start with pps, throughput and core load, then check drops and the activity of the ksoftirqd-threads over time to reliably recognize patterns. Tools such as sar, htop, ss, nload and ethtool show me when and where congestion occurs and whether the Cues reach their limits. Distributions are important instead of mean values so that evening peaks, cron windows or campaigns are not lost. I correlate TTFB peaks with IRQ distribution, NAPI budget and offload settings in order to make targeted adjustments. An adjusted IRQ affinity or a newly tailored NAPI budget is often enough to noticeably reduce timeouts.
Tuning parameters at a glance
The following overview helps me to use changes wisely and assign effects clearly before I make permanent changes. rollouts plan. I test each adjustment iteratively, measure latency distributions and observe side effects on CPU and memory. I only ever change one point per test window so that cause and effect remain clear. I then document the results and set threshold values for alerts. In this way, I achieve reproducible improvements without risking surprises in productive traffic.
| Parameter/Feature | Effect in the data path | When to raise/activate | Risks/side effects |
|---|---|---|---|
| net.core.netdev_budget | More packages per NAPI poll | For drops in softnet_stat | Longer polls displace user threads |
| net.core.netdev_budget_usecs | Limit time window per poll | For jitter due to large bursts | Too small: more context changes |
| RSS/RPS/XPS | Distribute flows across cores | For hotspots on a core | Incorrect hashes: cache warping |
| IRQ affinity | Bind IRQs close to the core | With NUMA-Missmatch | Misallocation creates new hotspots |
| GRO/LRO/TSO | Reduces the number of packages | With CPU bottleneck | Jitter, larger bursts |
| Busy polling | Early parcel collection | For tough p99 targets | More CPU consumption |
RX/TX rings and cue depth: dimension buffers correctly
Even with properly distributed IRQs and suitable budgets, NIC rings that are too small or too large can depress performance. I therefore check the RX/TX ring sizes of the card and adapt them to the burst character and latency targets. Rings that are too small lead to drops in the NIC during traffic peaks, visible as rx_missed_errors or fifo_errors in the driver statistics. Rings that are too large disguise congestion, increase latency and create long trailing edges in p95/p99. I'm looking for the middle ground: enough buffer to absorb short bursts, but not so much that packets “age” in queues.
In addition, I look at the host-side tx_queue_len and the Qdisc used. With sch_fq or fq_codel I can smooth burst behavior and distribute large TSO packets via pacing. This reduces microbursts at the switch port and makes the latency curve smoother - important for mixed workloads in which small RPCs run alongside large uploads. I monitor ethtool statistics and correlate them with softnet_stat in order to recognize whether the congestion is occurring in the NIC ring, in the netdev backlog or in the Qdisc.
MTU, jumbo frames and segmentation
The MTU is a classic lever that is often underestimated. Jumbo frames reduce the number of packets per Gbit/s and reduce the load on the CPU - but only if the path is truly end-to-end jumbo-capable. I therefore systematically validate the remote stations, switches and tunnels. As soon as there is fragmentation back to 1500 somewhere, there is a risk of path MTU problems, retransmits and unnecessary Jitter. In data centers with dominant East/West communication, a homogeneous 9k strategy is worthwhile, while 1500 is often the more stable choice for Internet-facing workloads.
I always evaluate the MTU in conjunction with TSO/GSO/GROOverly aggressive bundling can lead to large bursts in the TX that fill upstream buffers and generate latency peaks. The goal is a consistent path: sensible segmentation at the transmitter, sufficient pacing mechanisms and GRO that saves work on the receiver side without thwarting real-time requirements.
UDP, QUIC and streaming workloads: consider the specifics
Not all traffic is TCP. UDP-heavy profiles (DNS, VoIP, QUIC, telemetry) behave differently in RSS/RPS and GRO. Modern stacks support UDP-GRO/GSO, which can reduce the load on the CPU - I use this selectively and measure whether reordering risks or jitter increase. For QUIC/HTTP3 loads, clean flow distribution is crucial: RPS can help if the NIC offers too few RSS queues, but must not “throw around” hot cache flows. On the TX side, I set XPS to bundle transmission paths and reduce lock contention. In practice, a quiet, core-affine allocation pays off, especially with many medium-sized UDP flows where every cache hit counts.
Virtualization and containers: clean integration of host, guest and vhost
In virtualized environments, work shifts between host, vhost threads and guest IRQs. I make sure that vhost-net-threads receive their own cores and do not collide with app workers. Their affinities must match the physical RX/TX queues, otherwise there will be unnecessary cross-CPU migration. In the guest, I check virtio-net queues, activate multi-queue and set up RSS/RPS analogous to bare metal. Where latency and pps are in the foreground SR-IOV further reduce overheads - the prerequisite is a consistent NUMA topology: VF, vCPU and memory belong on the same node.
In the container stack, overlay networks, deep NAT chains and complex CNI topologies cause additional hops. For latency-critical services, I prefer hostNetwork or lean networks (macvlan/ipvlan), equalize NAT paths and keep Conntrack as small as possible. A consistent CPU strategy is important: IRQ and NAPI cores of the host should be located in the vicinity of the cores on which vhost/container workers are running - this is the only way to keep the data path short and predictable.
Scheduling, C-States and IRQ-Threading
Because latency is not only computing time, but also Wake-up time I minimize deep C-states on the latency cores. An aggressive powersave can cost milliseconds before a SoftIRQ actually runs. I therefore rely on performance governors, limit deep C-states and keep turbo consistent to make frequency jumps predictable. Equally important is IRQ threadingWhere drivers allow it, I move work to IRQ threads and prioritize so that RX starts before downstream work without completely displacing userland. The interplay of sched policies, affinities and budgets is tricky; I test step by step, log p99 and watch out for interference with ksoftirqd, which otherwise becomes a secret bottleneck.
Observation in depth: tracepoints, counters, histos
If metrics remain vague, I go one step deeper: I use kernel tracepoints around netif_receive_skb, napi_poll and net_dev_queue, to view poll durations, packet quantities and waiting times as histograms. Such distributions show whether 1 % of the polls are taking too long or whether individual queues are running out. In addition, ethtool-rx/tx-counters, TCP retransmits, busy poll hits and softnet_stat clearly indicate where packets are being lost. With drop analysis, I can see whether the NIC is dropping (ring full), the netdev backlog is collapsing (time_squeeze) or Qdisc/firewall is slowing down. Only when these pieces of the puzzle fit together do I tweak rings, budgets or offloads.
Streamline security and filtering paths
Complex ACLs, deep nftables/iptables chains and wide conntrack tables add constant latency per packet. I consolidate rules, work with sets/maps and move generic drops as far forward in the path as possible - ideally as early as possible at the NIC (XDP/clsact) if latency is critical. Stateless flows, telemetry or known “safe” ports can be used in a targeted manner. without tracking to eliminate the need for costly lookups. At the same time, I keep state tables fresh, adjust hash sizes to load peaks and aggressively clean up orphaned entries. The goal is a clean, traceable policy path that is not noticeable in the profile as a permanent load.
Typical anti-patterns and how I avoid them
- All IRQs on one core: leads to congestion and hot ksoftirqd. Antidote: targeted affinities per cue, NUMA-coherent.
- Blindly maximizing rings/budgets: conceals congestion, increases latency tails. Antidote: increase incrementally, measure distributions.
- Improper flow hashing configuration: Flows jump between cores, caches fizzle out. Antidote: stable RSS keys, RPS/XPS only with a clear objective.
- App threads on the same cores as SoftIRQs: Interference and jitter. Antidote: hard separation, neighborly allocation.
- Overlays/NAT without budget: added to each hop. Remedy: Streamline paths, host networks for latency workloads.
- Power saving on latency cores: Deep C-states slow down reaction. Antidote: performance governor, C-state limitation.
- Offloads without measurement: TSO/GRO can exacerbate bursts and jitter. Remedy: Activate workload-specific, observe p99.
Practical hosting: steps that work
I start with a clean measurement phase, set baselines and keep all changes small in short time windows so that I can Causes can be separated. I then activate irqbalance, check the automatic distribution and, if necessary, set manual affinities until no Hotspots are no longer visible. I then set up Multi-Queue, RSS and - if necessary - RPS/XPS, coordinated with NUMA. I bind the app workers to cores in the vicinity of their IRQ cores, but without direct collision. Finally, I purge firewall paths, check conntrack tables and make conscious decisions about offloads based on latency targets.
Example playbook for p99 latencies
First I measure p95/p99 via representative load and secure logs from /proc/softirqs and /proc/net/softnet_stat in order to Drops and time_squeeze are clearly visible. Then I increase netdev_budget or netdev_budget_usecs step by step and hold p99 after each change so that I can see real Trends recognize. In parallel, I bind IRQs to cores of a NUMA node and move app workers to suitable neighbors. If p99 continues to jump, I test GRO/LRO variants and interrupt coalescing profiles, each with a short measurement path. Only when the distribution remains stable do I transfer the configuration to Ansible roles or systemd dropins.
Short version for admins
I achieve the greatest leverage by SoftIRQs, NAPI budget, IRQ affinities and app threads as a coherent data path. I distribute network work across cores, keep NUMA-coherent queues and connect workers sensibly so that Routes stay short. I set offloads deliberately and measure jitter instead of blindly optimizing for throughput. For hard latency targets, I rely on busy polling and CPU isolation, while housekeeping CPUs intercept interference. If you implement these steps in a disciplined manner, you get constant throughput, tighter latency distributions and a hosting environment that reacts predictably to load peaks.


