I show how NUMA Balancing Server on hosting hardware streamlines memory access and reduces latencies by binding processes and data to the appropriate NUMA node. The decisive factor is the Memory access optimization through local access, task placement and targeted page migration to Linux hosts with many cores.
Key points
- NUMA separates CPUs and memory into nodes; local accesses provide low Latency.
- Automatic NUMA Balancing migrates pages and places tasks close to the node.
- VM size per node, otherwise there is a risk of NUMA Trashing.
- Tools as numactl, lscpu, numad show Topology and use.
- TuningC-States, Node Interleaving from, Huge Pages, Affinities.
What NUMA is - and why it counts for hosting
NUMA divides a multiprocessor system into Node, which each contain their own CPUs and local memory, making nearby accesses faster than remote ones. While UMA sends all cores on a common path, NUMA prevents bottlenecks due to local memory channels per node. In hosting environments with many parallel VMs, every millisecond of latency adds up, so every request benefits measurably. If you would like more background information, you can find out more about NUMA architecture. For me, one thing is certain: if you understand and use nodes, you get more bandwidth from the same hardware.
Automatic NUMA balancing in the Linux kernel - how it works
The kernel periodically scans parts of the address space and „unmapped“ pages so that a hinting fault does not affect the optimal node visible. If the fault occurs, the algorithm evaluates whether it is worth migrating the page or moving the task and avoids unnecessary movements. Migrate-on-Fault brings Data closer to the executing CPU, task NUMA placement moves processes closer to their memory. The scanner distributes its work piece by piece so that the overhead remains within the noise of the normal load. This results in ongoing fine-tuning that reduces latency without requiring hard pinning rules.
Memory Access Optimization: local beats remote
Local accesses use the Memory controller of your own node and minimize waiting times for the interconnect. Remote accesses cost cycles via QPI/UPI or Infinity Fabric and thus reduce the effective Bandwidth. High core counts exacerbate this effect because more and more cores are competing for the same connections. I therefore plan so that hot code and active data come together on one node. If you disregard this, you give away percentage points that determine response time or timeout during load peaks.
VM sizes, NUMA trashing and host cropping
I dimension VMs so that vCPUs and RAM fit into a NUMA node to avoid cross-node access. Often 4-8 vCPUs per node provide good performance. Hit rates, depending on the platform and cache hierarchy. Huge pages also help because the TLB works more efficiently and page migrations occur less frequently. If required, I set CPU affinity for latency-critical processes to bind threads to suitable cores - for more information see CPU affinity. If you span VMs across nodes, you risk NUMA trashing, i.e. a ping-pong of data and threads.
Tools in practice: numactl, lscpu, numad
With „lscpu“ I read Topology and NUMA nodes, including assignment of the cores. „numactl -hardware“ shows me memory per node and available distances, which makes it easier to evaluate the paths. The „numad“ daemon monitors utilization and dynamically adjusts affinities when load centers migrate. For fixed scenarios, I use „numactl -cpunodebind/-membind“ to explicitly pin processes and memory. In this way, I combine automatic balancing with targeted specifications and control the result via „perf“, „numastat“ and „/proc“.
How I measure impact: Key figures and commands
I always rate NUMA-Tuning via Measurement series, not by gut feeling. Three indicators have proven their worth: Ratio of local to remote page views, migration rate and latency distribution (P95/P99).
- System-widenumastat„ shows local/remote accesses and migrated pages per node.
- Process-related: „/proc//numa_maps“ reveals where memory is located and how it was distributed.
- Scheduler viewCpus_allowed_list„ and real “Cpus_allowed„ check whether bindings apply.
# System-wide view
numastat
numastat -m
# Process-related distribution and binds
pid=$(pidof )
numastat -p "$pid"
cat /proc/"$pid"/numa_maps | head
cat /proc/"$pid"/status | grep -E 'Cpus_allowed_list|Mems_allowed_list'
taskset -cp "$pid"
# Kernel counter for NUMA activity
grep -E 'numa|migrate' /proc/vmstat
# Trace events for deep analyses (activate for a short time)
echo 1 > /sys/kernel/debug/tracing/events/mm/enable
sleep 5; cat /sys/kernel/debug/tracing/trace | grep -i numa; echo 0 > /sys/kernel/debug/tracing/events/mm/enable
I compare in each case A/B: unbound vs. bound, automatic balancing on/off and different VM slices. The goal is a clear reduction in remote accesses and migration noise as well as tighter P95/P99 latencies. Only when the measured values are stable and better will I take over the tuning.
BIOS and firmware settings that really work
I switch off „Node Interleaving“ in the BIOS so that the NUMA structure remains visible and the kernel local can plan. Reduced C-states stabilize latency peaks because cores are less likely to fall into deep sleep states, which saves wake-up time. I allocate memory channels symmetrically so that each node can use its maximum memory capacity. Bandwidth achieved. I test prefetchers and RAS features with workload profiles, as they help or harm depending on the access pattern. I measure every change against a baseline and only then do I adopt the setting permanently.
Kernel and sysctl parameters that make the difference
Fine-tuning the kernel helps me, Overhead and Response time of the balancer to match the workload. I start with conservative defaults and work my way forward step by step.
- kernel.numa_balancingOn/off of automatic balancing. I leave it on for moving loads; I switch it off for strictly pinned special services as a test.
- kernel.numa_balancing_scan_delay_msWaiting time before the first scan after process creation. Select larger if many short-lived tasks are running; smaller for long-running services that require fast proximity.
- kernel.numa_balancing_scan_period_min_ms / _max_msBandwidth of the scan intervals. Narrow intervals increase responsiveness, but also CPU load.
- kernel.numa_balancing_scan_size_mbProportion of the address space per scan. Too large generates hint-fault storms, too small reacts sluggishly.
- vm.zone_reclaim_mode: When memory is scarce, the kernel prefers local reclaim instead of remote alloc. For general hosting workloads I usually leave 0; For strictly latency-sensitive, local memory services, I carefully test higher values.
- Transparent Huge Pages (THP): Under „/sys/kernel/mm/transparent_hugepage/{enabled,defrag}“ I usually set to madvise and conservative defragmentation. Hard „always“ profiles bring TLB advantages, but risk stalls due to compaction.
- sched_migration_cost_ns: Cost estimate for task migration. Higher values dampen the redistribution of aggressive schedulers.
- cgroups cpusetWith cpuset.cpus and cpuset.mems I separate services cleanly by node and make sure that first touch remains within permissible nodes.
# Example: conservative but responsive balancing
sysctl -w kernel.numa_balancing=1
sysctl -w kernel.numa_balancing_scan_delay_ms=30000
sysctl -w kernel.numa_balancing_scan_period_min_ms=60000
sysctl -w kernel.numa_balancing_scan_period_max_ms=300000
sysctl -w kernel.numa_balancing_scan_size_mb=256
# Use THP carefully
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo defer > /sys/kernel/mm/transparent_hugepage/defrag
It remains important: Only change one adjusting screw per test round and test the effect against the same load curve. This is how I disentangle cause and effect.
Position workloads correctly: Databases, caches, containers
Databases benefit when buffer pools remain local per NUMA node and threads are bound close to their heaps. In in-memory caches I set sharding to Node to avoid remote fetches. Container platforms receive limits and requests so that pods do not jump across nodes. For memory reservations, I use Huge Pages, which makes hotsets better organized in Caches fit. The following table provides a compact summary of strategies and typical effects.
| Strategy | Use | Expected effect | Note |
|---|---|---|---|
| first touch | Databases, JVM heaps | Local side allocation | Execute initialization on target node |
| Interleave | Broadly distributed load | Even distribution | Not optimal for hotspots |
| Task Pinning | Latency-critical services | Constant latency | Less flexible during load changes |
| Automatic balancing | Mixed workloads | Dynamic proximity | Weighing overhead against profit |
| Huge Pages | Large heaps, caches | Fewer TLB misses | Plan clean reservations |
Virtualization: Virtual NUMA, scheduler and guest customization
Virtual NUMA passes the host topology to the guest OS in a simplified form so that First-Touch and allocator work sensibly. Hypervisor schedulers pay attention to node proximity when distributing vCPUs and migrating VMs. I rarely align large VMs across multiple nodes unless the workload streams widely and benefits from interleave. In the guest, I customize the heaps of JVMs or databases so that they remain local on visible NUMA nodes. For memory management in the guest, a look at virtual memory, to tame page sizes and swapping.
PCIe proximity: NVMe and NICs at the right nodes
If possible, I assign NVMe SSDs and fast NICs to the node on which the Workload is running. This prevents I/O requests from crossing the interconnect and adding latency. I bind multiqueue NICs to core sets of a node with RSS/RPS so that IRQs remain local. For storage stacks, it is worth splitting the thread pools node by node. If you pay attention to this, you will noticeably reduce P99 latencies and create headroom for load peaks.
IRQ and queue affinity in practice
I first check on which NUMA Node devices and pin IRQs and queues appropriately. This ensures data path locality is maintained.
# Device-to-node assignment
cat /sys/class/net/eth0/device/numa_node
cat /sys/block/nvme0n1/device/numa_node
# Set IRQ affinity specifically (example: cores 0-7 of a node)
irq=
echo 0-7 > /proc/irq/$irq/smp_affinity_list
# Bind NIC queues to cores (RPS/RFS)
for q in /sys/class/net/eth0/queues/rx-*; do echo 0-7 > "$q"/rps_cpus; done
sysctl -w net.core.rps_sock_flow_entries=32768
for q in /sys/class/net/eth0/queues/rx-*; do echo 4096 > "$q"/rps_flow_cnt; done
# Improve NVMe queue affinity
echo 2 > /sys/block/nvme0n1/queue/rq_affinity
cat /sys/block/nvme0n1/queue/scheduler # "none" preferred
„I run “irqbalance" with node awareness or set the exceptions for hot-path interrupts. The result is more stable latencies, fewer cross-node IRQ hops and a measurable increase in local I/O hits.
Static binding vs. dynamic balancing - the middle way
With „taskset“ and cgroups I set hard rules when deterministic Latency counts. I leave automatic NUMA balancing active when the load moves and I need adaptive proximity. A mixture often works best: hard pins for hotpaths, more open boundaries for ancillary work. I regularly check whether migrations are increasing noticeably, as this signals poor planning. The aim remains to select data and thread locations in such a way that migration remains rare but possible.
NUMA in containers and Kubernetes
I bring containers cpusets and Huge Pages on line. I assign pods/containers to a NUMA node by storing consistent CPU and memory amounts. In orchestrations, I set policies that favor single-node assignments and thus respect first-touch.
- Container runtime: „-cpuset-cpus“ and „-cpuset-mems“ keep tasks and memory together; assign huge pages as resources.
- Topology/CPU ManagerStrict or preferred assignments ensure that related cores and memory areas are allocated.
- Guaranteed QoSFixed requests/limits minimize redistribution by the scheduler.
I deliberately split sidecars and auxiliary processes to other cores within of the same node so that the hotpath remains undisturbed but does not enter the cross-node race.
Understanding CPU topologies: CCD/CCX, SNC and Cluster-on-Die
Current server CPUs break down sockets into Subdomains with its own caches and paths. I take this into account when cutting cores/heaps:
- AMD EPYCCCD/CCX and „NUMA per socket“ (NPS=1/2/4) influence how finely NUMA is cut. More nodes (NPS=4) increase locality, but require clean pinning.
- IntelSub-NUMA Clustering (SNC2/4) divides LLC into clusters. Good for memory-bound loads, provided the OS and workload are node-aware.
- L3 proximityI bind threads that use the same heaps into the same L3 cluster to save coherence traffic and cross-cluster hops.
These options act like a multiplier: used correctly, they raise Locality In addition - incorrectly configured, they increase fragmentation and remote traffic.
Step-by-step introduction and rollback plan
I never introduce „big bang“ NUMA tuning. A resilient Plan avoids surprises:
- BaselineHardware topology, P50/P95/P99 latencies, throughput, numastat rate capture.
- HypothesisFormulate a specific target (e.g. remote access -30%, P99 -20%).
- One stepChange only one adjusting screw (e.g. VM cut, cpuset, THP policy, scan intervals).
- CanaryTest on 5-10% of the fleet under real load, keep rollback ready.
- RatingCompare measured values, define regression windows, log side effects.
- RolloutRoll out shaft by shaft, measure again after each shaft.
- MaintenanceRe-measure quarterly (kernel, firmware and workload updates change the optimum).
This ensures that improvements are reproducible and can be reversed within minutes in the event of an error.
Common mistakes - and how to avoid them
A typical misstep is to activate node interleaving in the BIOS, which hides the NUMA topology and Balancing more difficult. Equally unfavorable: VMs with more vCPUs than a node offers, plus uncleanly reserved huge pages. Some admins pin everything hard and thus lose all flexibility when workloads shift. Others rely completely on the kernel, even though hard hotspots require clear rules. I record measurement series, recognize outliers early on and adjust the setup and policies step by step.
- THP „always“ without control: Unplanned compacting disrupts latency. I prefer to set „madvise“ and reserve Huge Pages specifically.
- vm.zone_reclaim_mode too aggressive: Local reclaim can do more harm than good at the wrong moment. First measure, then sharpen.
- irqbalance blindUncritical IRQs move across nodes. I set exceptions or fixed masks for hotpaths.
- Mixture of interleave + hard pinningContradictory policies create ping-pong. I opt for a clear line for each service.
- Unclean cpusetsContainers see a node, but map memory to other nodes. Always set „cpuset.mems“ consistently with the CPU set.
- Sub-NUMA features activated but not used: More nodes without planning increase fragmentation. Only switch on after tests.
Briefly summarized
NUMA Balancing Server brings processes and data together in a targeted manner, making local accesses more frequent and more efficient. Latencies become shorter. With a suitable VM size, clean BIOS configuration and tools such as numactl, a clear topology is created that the kernel utilizes. Virtual NUMA, huge pages and affinities supplement automatic balancing instead of replacing it. Connecting I/O devices close to nodes and using hotpaths eliminates expensive remote access. In this way, hosting hardware scales reliably and every CPU second delivers more payload.


