Server NUMA Locality and CPU-Memory Affinity for maximum hosting performance

Server NUMA Locality and CPU memory affinity determine how close threads work to their RAM and how constant latencies remain in hosting stacks. I show you in a practical way how you can achieve measurably more throughput with topology recognition, affinity strategies and I/O paths close to the node and Latency noticeably lower.

Key points

For quick orientation, I'll summarize the key messages before explaining the steps in detail and backing them up with examples; this will allow you to see directly where you should start in order to Locality and Affinity profitably. I emphasize clear relationships between threads, memory and I/O so that you can derive priorities cleanly and Decisions meet. I also identify scenarios in which Interleave makes sense without diluting your critical paths and show how you can demonstrate real progress via monitoring and Error are avoided. For virtualized environments, I provide tips on the placement of vCPUs and vRAM so that guest systems do not slide across multiple nodes and remote-accesses explode. Finally, I translate the findings into a short roadmap so that you can proceed in a structured manner and measurable secure.

  • Locality first: keep threads close to your own RAM, avoid remote.
  • Affinity fix: Bind cores and memory together by policy.
  • Topology read: Nodes, cores, PCIe devices per socket.
  • I/O paths bundle: Couple NIC, NVMe and app in the same node.
  • trade fairs instead of guessing: P95/ P99, remote access and throughput tracking.

Understanding the NUMA topology

Before I move workloads, I read the Topology of the server: How many NUMA nodes exist, how many cores and how much RAM are connected to each node. I also pay attention to which PCIe devices - such as NICs or NVMe SSDs - are connected to which socket, because this determines interrupt paths and memory accesses, and Latency characterized. A node provides local memory access with a short distance; anything beyond that costs time and bandwidth. The larger the machine scales with multiple sockets, the more remote access affects response times and eats up bandwidth. Throughput. For an understandable introduction to the hardware logic, I use a compact NUMA nodes at a glance, to consciously take node boundaries into account and avoid incorrect distributions.

In practice, I start with a short topology inventory and document it so that I can later derive affinity decisions in a comprehensible way. Useful commands:

# cores and NUMA assignment
lscpu -e=CPU,Core,Socket,Node

# NUMA hardware overview
numactl --hardware

# Assign PCIe devices to their NUMA node
lspci -nn | grep -E "Ethernet|Non-Volatile"
for d in /sys/bus/pci/devices/*; do echo -n "$d: "; cat $d/numa_node; done

The important thing is that you PCIe Root Complex and device slots to the sockets. Two ports of the same NIC can be assigned to different nodes; this influences where RX/TX queues and IRQs land best. The same applies to NVMe: modern controllers have several queues that you should bind to cores close to the node so that DMA does not trigger any node hops.

Using CPU memory affinity correctly

With CPU-Memory Affinity, I firmly link processes to core areas and enforce local memory allocation as far as possible, so that Threads do not constantly reach over the edge of the node. In Linux, I define CPUs via systemd or cgroups and combine this with memory policies so that RAM is preferably created on the same node and remote remains minimized. Critical services - API front-ends, in-memory caches, databases - benefit immediately because memory controller wait times are reduced and cache hits are more frequent. However, pinning limits that are too hard can restrict scheduling, so I back up every adjustment with benchmarks and monitor P95/P99 values for noticeable effects on User-experience. A compact introduction to Affinity in hosting helps you get started: Affinity and NUMA awareness provide the necessary tools for clean placement.

The decisive factor is the First-touch principleMemory is created on the node that writes to the page first. Therefore, initialize large heaps or buffers on the target cores of the node in which the service will later run - ideally with the CPU and memory policy already set (e.g. via systemd unit or numactl). If you start cold on node 0 and then move threads to node 1, the majority of the pages remain remote. For heaps of large runtimes, it is worth using „pre-touch“ during the bootstrap so that pages rot locally and then stay warm.

NUMA awareness in the hosting stack

A NUMA-aware operating system, a suitable hypervisor and applications with thread pinning unfold their full potential together. Potential. The OS prefers local placement when free resources are available in the node, while the hypervisor allocates VMs in such a way that vCPUs and vRAM do not drift apart and Locality is maintained. In the application, I separate worker pools per node and keep queues local instead of operating global pools crosswise. I organize database processes, cache daemons and web server instances on a node-by-node basis so that hotpaths remain short and Jitter decreases. This increases consistency and predictability under load, which directly influences the predictability of SLAs in euros and saves expensive overprovisioning.

At the Ingress level, I take care of Node affinity of the sessions, for example through sticky routing or consistent hashing (e.g. on client IP or session tokens), so that requests end up back at „their“ node-local worker and cache. For stateful services, I plan replicas per node and balance read access locally; I equalize write paths via asynchronous replication or batching to avoid inter-node ping-pong.

Schedule services node by node

I group the layers of a stack in such a way that each layer has a clear node reference and Paths stay short. A classic separation: web/API per node, app worker next to it, plus the local cache; the database also sits close to the node if the RAM footprint fits in and IO-path is not interrupted. I move reporting jobs, backups or batch workers to less critical nodes so that interactive requests remain unaffected. I avoid large monolith instances because they often cross node boundaries and thus generate remote load that Performance blurred. Smaller, replicated instances per node often deliver better throughput in everyday use, as they respect the NUMA rules and smooth out peaks.

For capacity planning, I calculate headroom separately for each node: CPU buffer for bursts, RAM buffer against OOM and separate margins for page cache. In this way, I prevent the kernel from unintentionally switching remotely. I define clear switchover paths for failover: If a node fails, replacement instances can run cross-node, but I limit their concurrency until the original node is restored - this keeps the overall latency stable.

Setting CPU affinity: Methods and pitfalls

For core allocation, I use systemd with CPUAffinity or cgroups with cpuset.cpus, so that services have fixed Core areas get. When pinning, I pay attention to hyper-threading pairs, because two logical threads of a physical unit share resources and can slow each other down if I combine them unhappily and Tips create. Latency paths - TLS termination, API ingress, cache readers - get exclusive cores, while logs, compression or backups move to other pools. Pools that are too narrow without buffers cause queues, so I factor in headroom and check context switches, runqueue length and IRQ-distribution. From the observation I deduce whether I open the cores wider or concentrate them further until the latency distribution drops off cleanly and the P99 peaks become quieter.

For further jitter reduction, I selectively set kernel switches such as nohz_full and rcu_nocbs for exclusive latency cores, isolate them from system services and deliberately place IRQs only on CPUs intended for this purpose. I use the „irqbalance“ service with caution: either configure it specifically or deactivate it if it counteracts your manual IRQ affinity. I use SCHED_FIFO/SCHED_RR sparingly and only with Be limits to avoid priority inversion or starvation.

Memory policies and NUMA masks

For the memory policy, I differentiate between preferred local allocation, interleave across multiple nodes and fixed NUMA masks via cpuset.mems, so that RAM flows to where threads are actually running. For interactive services, I usually set „preferred“, which means that the system allocates locally and only deviates when there is a shortage, which is remote-accesses are limited. Analytics or streaming jobs sometimes benefit from interleave because bandwidth is distributed across nodes and pressure on a controller is reduced. Fixed masks offer control, but require discipline in capacity planning so that no unwanted OOM events in a node go up and Services interfere. The following table assigns common policies to typical scenarios and helps you to make a quick decision.

Policy Effect Typical workloads Risk
Preferred (local) RAM primarily in the local node, fallback option in case of scarcity Web/ API, caches, OLTP databases Slight drift at full load on other nodes
Interleave Even distribution across selected nodes Streaming, analytics, large scans Higher latency for individual accesses
Fixed NUMA mask Strict binding to defined memory nodes Strictly encapsulated services, deterministic tests Risk of OOM if the budget is planned incorrectly

Keep an eye on system-wide switches: zone_reclaim_mode influences whether a node aggressively cleans up its own memory before allocating remotely - often undesirable for latency paths. Transparent Huge Pages (THP) can trigger page migration or generate stalls; for latency-sensitive services, I usually choose „madvise“ and use static hugepages where it makes sense, so that TLB hits increase and page fault peaks decrease.

Bind network and I/O paths close to the node

I align NIC queues (RX/ TX) so that their IRQs point to cores of the appropriate node and packet processing takes place where the App computes. The same applies to NVMe SSDs or RAID controllers: I/O threads should run on the node to which the device is connected via PCIe, so that DMA paths remain short and Bottlenecks are missing. On Linux, I adjust IRQ affinity masks and link them to CPU pools of my services to create a continuous path. With microbursts from the network, such as many TLS handshakes, this proximity pays off directly because copy paths are shorter and CPU caches stay warm and Context less frequently. This creates a consistent data flow from the package to the application to the memory, without unnecessary node hops.

Concrete levers in the network stack: RSS for hardware distribution to queues, RPS/RFS for software-based CPU control and XPS for TX selection. I use ethtool to assign RX queues to core groups that run in the same node as your workers. For storage I use blk-mq-tuning and queue mapping per node; NVMe controllers offer several submission/completion queues, which I scale and affiliate ≤ number of cores per node. Regularly check whether interrupts (cat /proc/interrupts) are firing where your app cores are located - you can recognize drift by increasing remote bytes despite a stable load.

Structure application architecture in line with NUMA

At app level, I set up my own worker pools for each NUMA node, keep queues local and avoid global lock hotspots so that Threads do not jump back and forth. I set up session and data sharding so that hot partitions stay where the requesting workers are running and Time does not get lost in inter-node traffic. For caches, I often use replicas instead of a central instance so that readers hit node-local copies. In Netty, Tokio, libuv or DB clients, I pin event loops to fixed cores and pay attention to IRQ proximity so that task changes remain limited and Caches hit better. This layout reduces ping-pong effects and makes response times more consistent over the course of the day.

One underestimated lever is allocator and runtime options: NUMA-aware allocators (jemalloc/tcmalloc) reduce cross-thread contention and keep pages closer to thread home kernels. In JVM stacks, options like NUMA awareness and pre-touch help for deterministic fault phases; in .NET, I align GC threads close to nodes and pay attention to server GC to smooth stop times. In Go, I size GOMAXPROCS per node pool and keep goroutine schedulers away from latency cores that operate close to IRQ.

Sensible control of NUMA autobalancing

Automatic NUMA balancing mechanisms of the kernel can help to smooth out distributed load, but I always check whether they can handle my Affinity are undermined. In latency-critical services, I disable or throttle automatic moving when it pulls threads out of their local memory and Tips generated. For analytics jobs or broad batch processing, I tend to leave balancing on because it can increase bandwidth without degrading interaction. A practical introduction to balancing strategies provides me with additional starting points: Understanding NUMA balancing shows when the automatic system should carry and when it should be assigned manually. In the end, I make a data-based decision for each service class instead of blindly adopting a global default setting and Goals to miss.

When balancing is activated, I monitor migration rates, minor/major fault peaks and CPU steal per node. If pages are moved back and forth cyclically, I counter this with tighter pinning, pre-touch and narrower memory masks. In workloads with long, sequential scans, on the other hand, balancing can harmonize load as long as no interactive latency paths are affected.

Monitoring: measure, compare, decide

Without measurement, tuning remains a guessing game, so I track CPU load per core and per node, memory usage per node and the proportion of remote-accesses. For user experience, P95/ P99 latencies count much more than mean values, because outliers shape SLA impressions and Costs upwards. I run realistic load profiles with cold and warm caches because both worlds show different bottlenecks. After each change, I document the settings, test date and results so that I can safely reverse modifications later and Knowledge is not lost. If you also correlate app metrics - queue lengths, retries, garbage collection - alongside system values, you can recognize cause and effect more quickly.

Practical help in the analysis:

  • numastat (system- and process-related) for local vs. remote-Hit
  • /proc/interrupts and SoftIRQ time after CPU for IRQ drift
  • perf events and scheduler statistics for runqueue depth, context switches, LLC misses, etc.
  • fio/iperf/wrk with node-specific worker pools for reproducible comparisons

The evaluation is done per node: I expect latency histograms to be close together. If a node moves upwards, I first look for incorrectly distributed IRQ load, drift in the page cache or heaps that were allocated to the wrong node during warm-up.

NUMA in VMs and containers

In virtualization, the placement of vCPUs and vRAM on a common node is important so that the guest workloads do not fray and Latency pulls up. I dimension RAM so that it fits into the local node, and avoid large VMs that span several nodes and require a lot of RAM. Drift trigger. For containers, I use cpuset controllers so that pod groups work consistently on one node and storage is created locally. I prefer to place I/O-heavy guests on the node with a direct storage connection in order to keep DMA paths short and IRQ-reduce noise. This means that even dense virtualization hosts remain predictable and carry more projects on the same hardware.

I pay attention to vNUMA-Exposure: The guest should see the same node structure that the hypervisor physically provides. vCPU pinning and vRAM binding belong together; I move hot-adds during maintenance windows if possible, because otherwise new pages end up remotely. In Kubernetes, I set to „guaranteed“ QoS, CPU manager „static“ and topology-aware placement so that pods do not move across nodes. For SR-IOV/VFs, I assign VFs to the appropriate physical node and bind the IRQ queues to the CPU sets of the pods or VMs they serve.

Targeted preparation of first touch, warm-up and heaps

Many performance errors occur during StartHeaps grow in the warm-up phase where the first requests land - often centrally on a node. I therefore run controlled warmups for each node: start instances with the CPU/memory mask set, execute targeted pre-load queries and initialize caches in parallel for each node. For JVM services, I activate pre-touch of the heap; for databases, I segment buffer pools node by node. This reduces subsequent page migrations and ensures that the first requests do not randomly shape the memory distribution.

Kernel/BIOS tuning for constant latencies

Under the hood, I adjust the power and interrupt policy:

  • Set CPU governor to „performance“, limit deep C-states, use package C-states carefully in order to Jitter to reduce.
  • Do not throttle memory frequency; balanced energy profiles often reduce Throughput under load.
  • Avoid spread spectrum/clock modulation if consistency is more important than minimal energy savings.

At kernel level, I keep housekeeping CPUs separate from latency cores, minimize timer interrupts on hot cores (nohz_full) and park background work (compaction, Kswapd) preferably on system cores of a node that does not run latency paths.

Troubleshooting and typical anti-patterns

  • SymptomP99 latency jumps after deploys. CauseHeaps/Caches first-touch on wrong node. SolutionWarmup/Pre-Touch under target affinity, then open load distributor.
  • SymptomHigh SoftIRQ time on „wrong“ CPUs. Causeirqbalance distributed over nodes. SolutionFix IRQ affinity, set RPS/RFS/XPS node-compliant.
  • SymptomOOM in a node, although system RAM is free. CauseStrict NUMA mask without buffer. SolutionCorrect capacity or use „preferred“, establish alerts per node.
  • SymptomIrregular throughput with NVMe. CauseIncorrect queue mapping, shared queues cross-node. Solution: blk-mq/NVMe queues per node, I/O threads pinned.

Practice checklist

  • Record topology: Nodes, cores, RAM, PCIe devices per socket.
  • Draw service section: Which paths are Latency-critical, which batch?
  • Set CPU/memory affinity for each class; note first touch at start.
  • Bind IRQ/Queues close to the node; check RSS/RPS/XPS and NVMe queues.
  • Monitoring on P95/P99, remote access, run queue, IRQ distribution.
  • Control autobalancing specifically; select THP/zone_reclaim_mode appropriately.
  • Keep vNUMA, vCPU pinning and vRAM binding consistent in VMs/containers.
  • Test iteratively, document, roll back in case of drift and fine-tune.

Summary and tuning schedule

It brings the greatest return, Threads and memory together, shorten I/O paths and only distribute them carefully. I start with topology analysis, plan services node by node, set CPU and memory affinity, connect network/storage appropriately and monitor P95/ P99 values with a focus on remote-accesses. I then tweak pool sizes, IRQ masks and policies until latency peaks subside and throughput increases. I check the placement of VMs and containers separately because the hypervisor has a lot of influence and Boundaries work differently. If you repeat and document this process, you will get measurably more performance out of Server NUMA Locality and CPU-Memory Affinity - often cheaper than upgrading additional hardware in euros.

Current articles