I optimize the network paths of a server by IRQ Affinity and map RX/TX queues to cores to control latency, throughput and p99 jitter. Those who consistently use multi-core CPUs orchestrate interrupts, SoftIRQs, NAPI and NUMA in such a way that flows remain core-affine, context switches are reduced and the application responds measurably faster.
Key points
- IRQ distribution determines which cores carry hardware interrupts and prevents hotspots.
- NUMA proximity reduces remote access and lowers latency peaks.
- SoftIRQs & NAPI control batch processing and reduce the load on cores.
- RPS/RFS keeps flows close to the consuming threads.
- Measurement & Pinning makes performance more deterministic.
Why IRQ Affinity counts in server operation
High packet rates quickly load individual cores if all interrupts land on a few CPUs, so I distribute the load selectively in order to Hotspots to avoid. I assign RX/TX queues to the appropriate cores to keep data paths short and caches warm. This reduces p95/p99 latencies because I avoid unnecessary migrations and keep processing steps on the same cores. I take into account the physical proximity of NIC, memory channels and CPU sockets so that the path from the packet to the application worker remains consistently fast. This core affinity creates measurable stability during traffic peaks without having to upgrade the hardware immediately.
IRQ balancing vs. fixed affinity
The standard service irqbalance distributes interrupts automatically, but it does not know my application logic, NUMA targets and latency budgets. I bind critical network IRQs to selected cores, while noisy or less important interrupts move to other cores. This binding harmonizes with the pinning of the application processes so that the pipeline per flow remains consistent. With heavy traffic, I avoid redistributions that generate additional overhead and weaken the cache effect. If you want to delve deeper, you can find practical background information in this guide: IRQ balancing in the data center.
CPU affinity, NUMA and the short data path
I prefer to pin application workers and network IRQs on the same NUMA-nodes so that memory accesses remain local. If a NIC hangs on node 0, I also set the associated RX queues there and bind the relevant processes to these cores. In this way, I avoid expensive remote memory accesses, which have a major impact on latency at high packet rates. I also include hyper-threading pairs so that sister threads do not interfere with each other. This triangle of process pinning, IRQ affinity and NUMA topology makes the network paths more predictable and increases throughput.
Understanding SoftIRQs, NAPI and queue design
After the hardware interrupt, the kernel takes over the processing in SoftIRQs, often on the same core that received the IRQ. When the load is high, I deliberately distribute the SoftIRQ load to alleviate bottlenecks without unnecessarily fragmenting the data path. Multi-queue NICs help because I can assign clearly defined cores to each queue and thus achieve true parallelization. I use NAPI to process packets in batches so that no interrupt storms occur and CPU time is used efficiently. This article provides background knowledge on this path: SoftIRQ and network throughput.
RPS/RFS and flow locality
I use RPS for a broader distribution of the packages and set RFS so that flows end up in the consuming threads. This keeps cache accesses efficient and the application benefits from consistent response times. I coordinate the hash strategy of the NIC, the number of queues and the RPS CPU sets so that no kernel queue overflows. The flow affinity is particularly effective for many short requests, such as those generated by APIs and microservices. In this way, I build a pipeline in which each flow touches the same core as often as possible and avoids unnecessary migrations.
RSS, indirection table and XPS: targeted control of hashing
To ensure that the distribution starts cleanly at the NIC, I adjust RSS (Receive Side Scaling) and the indirection table so that RX queues are assigned exactly to the cores that will later carry the app threads. I make sure that the number of queues matches the number of cores used and that the hash keys remain stable so that flows do not wander unexpectedly. If the hash algorithm changes or the indirection table is dynamically overwritten, this otherwise tears up the flow locality and promotes cache misses.
On the TX path, I additionally activate XPS (Transmit Packet Steering) so that outgoing packets are sent by the core that is processing the application. This also keeps the TX caches close to the worker, and the path from the socket queue to the NIC queue remains short. I keep RX and TX mappings consistent, document them per interface and define them in startup scripts so that a reboot does not blur the architecture.
Interrupt coalescing: weighing up latency against throughput
With Coalescing I combine interrupts to reduce the overhead, but pay attention to the latency limits of my application. For streaming and VoIP, I tend to keep the intervals short, while bulk transfers tolerate longer batches well. I test step by step, measure p95/p99 and check drops, retransmissions and CPU utilization per core. Only then do I write down the settings and document them for each host and NIC. This practical article provides a deeper insight into the trade-off: Interrupt coalescing explained.
Dosing offloads and aggregation correctly
I set GRO/LRO to reduce CPU overhead, but check whether my workloads benefit from larger batches. Latency-sensitive APIs often respond better when GRO is moderate and LRO is off, because large super-packets can exacerbate head-of-line blocking effects. For bulk transfers, replication or backups, I use GRO/GSO/TSO more aggressively as long as the receiver side remains stable and CPU utilization drops.
Checksum offloads and TSO/GSO significantly reduce the load on the CPU, but I make sure that middleboxes, tunnels or offload incompatibilities (e.g. with certain encapsulations) work properly. If anomalies occur, I gradually reduce individual offloads and measure the effects on throughput, retransmits and CPU time. The goal is a set that remains stable across the board and predictable at peak times.
CPU isolation, scheduler and energy states
For hard latency budgets, I isolate cores for network paths and app workers. With CPU isolation and lean housekeeping strategy, I prevent system tasks, Kthreads or timer interrupts from getting onto the „hot“ cores. In addition, I fix the CPU Governor to „performance“ and limit deep C-states, if these cause wake-up latencies. I keep an eye on the core temperatures, as thermal rotting can otherwise ruin any finishing touches.
The choice of Scheduling classes influences the predictability. I run network-related threads prioritized, but not aggressively exclusive, so that they don't compete with ksoftirqd for CPU time. I regularly check to see if ksoftirqd starts on individual cores - a clear sign that the SoftIRQ load is too high or incorrectly distributed.
Busy polling and low-latency paths
When microseconds count, I set Busy polling in a targeted manner. Applications can define polling windows for selected sockets so that they pull packets directly from NAPI budgets without waiting for interrupts. I choose short poll intervals to avoid burning CPU time and limit this technique to hot paths with constant traffic. In parallel I adapt netdev budgets moderately so that batches are large enough without starving the rest of the system.
Network queue discipline and pacing
I set up the qdisc per interface to match the workload. I use modern disciplines such as fq/fq_codel to regulate pacing and queue lengths in order to smooth bursts and avoid bufferbloat. In multi-queue setups, I combine this with mqprio, so that traffic classes remain consistently assigned to the correct HW queues. Together with BQL (Byte Queue Limits) on the driver reduces the latency under full load because the queue does not grow uncontrollably.
It is important to interact with XPS on the TX path: I map the send queues to the cores on which the associated RX flows also land. In this way, both directions of a flow remain close to the CPU and I achieve more stable response times with bidirectional protocols (e.g. HTTP/2, gRPC).
Practice workflow under Linux
I start with a load recording, check the CPU distribution in top/htop, look at /proc/interrupts and /proc/softirqs and read ethtool statistics to detect bottlenecks and plan the next Workflow-step. I then determine the IRQ IDs of the relevant NIC queues and set suitable CPU masks that occupy the cores evenly and take NUMA into account. I then pin the application workers via taskset or systemd-CPUAffinity to the same cores that also serve the associated queues. I only activate RPS/RFS where it strengthens the flow locality and keep the configuration consistent per interface. Finally, I measure throughput, latency and jitter again before rolling out changes uniformly across multiple hosts.
Measurement, avoid p95/p99 and regressions
I don't rely on gut feeling, but measure latencies, error rates and core utilization before and after each tuning round so that p99 remains stable. I also track context changes, migration rates and load per SoftIRQ type to identify hidden side effects early on. I keep tests reproducible, use the same data sets and fixed versions so that the results remain comparable. I detect regressions with cross-checks under peak and idle conditions and with long endurance runs. Only when metrics, logs and application traces match do I declare the configuration as the new baseline status.
Virtualization, containers and SR-IOV
In virtualized environments, I make sure that vCPUs, memory and vNICs of the VM are located on the same NUMA node on which the associated physical NIC ends. Where possible, I use SR-IOV, so that the data path is short and the IRQs can be bound directly to guest cores. I pin vCPUs of the critical VMs to dedicated host cores and make sure that host IRQs and guest IRQs do not overlap. In container setups, I set cpusets and „guaranteed“ QoS classes so that worker containers and their network IRQs receive CPU time in a predictable manner.
I check whether irqbalance should have the lead in the guest or on the host - otherwise double „automatic“ produces blurring. With virtio, I set several queues and map them cleanly to vCPUs to enable parallel work. If vhost-net utilizes individual host cores, I redistribute the backends and keep vhost threads NUMA-close to the physical NIC.
Troubleshooting: recognize patterns quickly
- Cores saturated, ksoftirqd active: Pin RX queues closer together, check the number of queues, adjust RPS/RFS or increase coalescing slightly.
- Jumpy p99 jitter: Check NUMA drift, verify C-States/Governor, adjust off-loads and GRO sizes step by step.
- Many retransmissions/drops: Check RX/TX ring sizes, qdisc and BQL, check indirection table and XPS for consistency.
- Unevenly distributed flows: Balance RSS hash and indirection table, consider hot flow pinning, keep hash seed stable.
- VM-only problem: Place vhost/virtio backends close to NUMA, evaluate SR-IOV, unbundle IRQs between host and guest.
Include applications and databases
A clean network path is of little use if app servers or databases are not working in parallel, which is why I set up the Worker-number, thread pools and connection limits to the available cores. I pin NGINX or HAProxy workers to the appropriate cores so that they match the RX queues. I scale PHP-FPM, Node.js, Java or Go so that they prefer the local NUMA domain and use multiple instances if required. I integrate caches such as Redis or Memcached close to the CPU and pay attention to their own network and thread parameters. Only the interplay of IRQ affinity, process pinning and app scaling provides the noticeable boost in latency and throughput.
Hosting scenarios with high benefits
I mainly invest in deep tuning when APIs generate a lot of short requests or when Real time-communication such as VoIP and chats require low jitter values. E-commerce setups with peak loads benefit because checkout flows are sensitive to latency. Multi-tenant hosts with a high density benefit because dedicated cores per queue reduce neighborhood effects. Streaming services can also achieve more throughput per euro without immediately purchasing new hardware. The costs remain calculable as long as I keep changes measurable and roll them out accurately.
Quick reference table: Cores, queues, tools
I use the following Table as a reminder when I set up new hosts or recalibrate existing setups. It shows typical targets, appropriate measures, common Linux tools and the intended effect on latency and throughput. I don't use it dogmatically, but as a starting point for series of measurements with real traffic. If the NIC architecture or the NUMA topology varies, I adapt the core selection. It remains important to keep the documentation for each host and to keep changes traceable.
| Goal | Measure | Linux tool/location | Expected effect |
|---|---|---|---|
| Distribute IRQ load | Bind cues to cores | /proc/irq/*/smp_affinity | Fewer hotspots, more constant latency |
| Increase flow locality | Set RPS/RFS CPU sets | /sys/class/net/*/queues/*/rps_cpus | Fewer migrations, better caches |
| Control batch processing | Fine-tune NAPI/Coalescing | ethtool -C / Driver Defaults | Lower overhead, controlled jitter |
| Pairing the app and IRQ | Pin worker | taskset, systemd CPUAffinity | Shorter path, lower p99 |
| Avoid NUMA | Co-localize devices and cores | numactl, lscpu, lspci -vv | Less remote access, more throughput |
Best practices that work in the long term
I only change one control lever per test round, document the metrics and save the results. Documentation in the repo of the host. I keep configurations consistent by clearly describing queue-to-core mappings and using scripts for replication. I monitor logs for drops, retransmissions and timeouts and correlate them with kernel metrics. I include the hypervisor and storage level in the analysis so that no shadow bottlenecks remain. I have rollbacks ready in case tests show negative effects or workloads change.
Briefly summarized
I achieve maximum network performance by using interrupts, Cues and workers and thus keep the data path per flow stable. IRQ Affinity distributes the hardware load sensibly, while SoftIRQs, NAPI and RPS/RFS make processing efficient. NUMA proximity protects against avoidable memory detours and reduces jitter. Step-by-step tuning with reproducible measurements prevents misconfigurations and shows real progress. If you think about these building blocks together, you can confidently exploit the capabilities of modern multi-core servers for latency-critical services.


