The NUMA architecture determines how quickly modern servers supply threads with memory and how well workloads scale under high loads. I show why local memory accesses dominate latency and bandwidth, how hypervisors use NUMA, and which settings in VMs unlock direct performance gains.
Key points
I will briefly summarize the most important findings and highlight the factors that have the greatest impact in data centers.
- Local memory minimizes latency and increases throughput
- NUMA node structure CPUs and RAM efficiently
- vCPU size Adjust per VM to node size
- Virtual NUMA pass to the guest OS
- Spanning rules define for large RAM requirements
I consistently focus on Latency and data proximity, because that is where server performance is determined. Large sockets, many cores, and lots of RAM are of little use if threads are constantly waiting for remote memory areas. I dimension VMs so that they fit into a NUMA node and memory allocation remains local. I support hypervisor features selectively instead of activating everything globally. This is how I ensure Scaling without surprises during peak loads.
What NUMA really is
I think in NodeEach NUMA node combines CPU cores and a local RAM area with very short access paths. If a thread hits the data in the L1, L2, or L3 cache, everything runs extremely fast; if the data set is in the local RAM, latency remains low. However, if the thread accesses another node, the wait time increases and throughput drops. It is precisely these differences that make non-uniform Memory access. I therefore organize workloads so that the majority of accesses remain local.
Why UMA has its limitations
UMA shares a common storage path which causes congestion as the number of cores increases. Each additional core joins the same queues and competes for bandwidth. In many older setups, this led to latency building up until CPU utilization was high but the application responded sluggishly. This feels like „CPU at its limit,“ even though the bottleneck is actually in memory access. NUMA solves precisely this problem. Blockages through local paths and node topology.
NUMA vs. UMA: Differences at a glance
I like to summarize the most important differences in a compact Table firmly established so that decisions can be made more quickly. This overview shows what is important in terms of architecture, latency, and scaling. It helps me with sizing new hosts as well as troubleshooting in productive environments. Those who clearly understand the difference between local and remote access make better decisions when it comes to VM customization and RAM allocation. This is precisely where the difference is made. Performance under load.
| Criterion | NUMA | UMA | Practical effect |
|---|---|---|---|
| memory access | Local or remote | Uniform | Local access is faster; remote access incurs latency. |
| Scaling | Very good with knots | Early limited edition | More cores scale more reliably with NUMA |
| Topology | Multiple nodes | Uniform pool | Topology-aware planning required |
| hypervisor | Virtual NUMA available | Less relevant | Guest OS can schedule NUMA-aware |
| fine-tuning | vCPU/RAM per node | Global tuning | Node-appropriate VMs deliver stability |
NUMA in virtual environments
I let the hypervisor handle the Topology pass it on to the guest OS so that schedulers and memory management can plan locally. Virtual NUMA shows the guest its node boundaries, allowing databases, JVMs, and .NET workers to arrange their heaps and threads more efficiently. This avoids expensive remote access and keeps latency stable. In sensitive setups, I combine this with a consistent pinning strategy and fixed RAM allocation. For extremely short response times, I also use Micro-latency hosting considered in order to further reduce jitter.
Best practices for VM sizes and CPU allocation
I dimension vCPUs so that a VM fits into a NUMA node or only barely touches it. Example: If a host has two nodes with 20 cores each, I prefer to plan VMs with 4 to 16 vCPUs within one node. Going beyond this risks remote access and unnecessary waiting times. I distribute RAM as statically as possible so that the guest OS keeps its pages local. For workloads with a high single-thread component, I incorporate the right core strategy and use analyses such as Single-thread vs. multi-core.
Specific advantages for hosting hardware
With clean NUMA planning, I increase the density per host without sacrificing response times. In many data centers, this allows significantly more VMs to be operated per socket, while applications respond reliably. Shorter latency directly contributes to user experience and batch throughput. Costs per workload are reduced because CPU time and RAM are used more efficiently. Those who make informed hardware choices also benefit from modern High-performance web hosting hardware with high memory bandwidth.
Workload tuning: databases, caches, containers
I make sure that Databases Keep your heaps local and run worker threads on „your“ node. For SQL engines, in-memory caches, and JVMs, it is worth assigning CPUs and reserving memory. Container orchestration benefits from node affinities so that pods use the shortest memory paths. For heavy I/O, I rely on NUMA-close NVMe allocations to keep data close to the node. This keeps hot paths short and the Response time friendly.
Monitoring and troubleshooting with NUMA
I measure Latency and remote accesses in a targeted manner, instead of just looking at CPU percentages. Tools show me per node how many pages are remote and which threads are creating memory pressure. If remote misses increase, I adjust the vCPU size, affinities, or RAM allocation. If throughput remains weak despite high CPU reserves, memory paths are often the cause. For me, visibility at the node level is the fastest way to Causes, not just symptoms.
NUMA spanning: using it correctly
I activate Spanning Specifically for VMs with very high RAM requirements or exceptional bandwidth. The VM can then obtain memory from multiple nodes, which is what makes single instances with a massive footprint possible in the first place. The price is occasional remote access, which I mitigate with CPU affinities and a larger page locality share. For mixed loads, I prefer to choose several medium-sized VMs instead of one very large instance. This way, Plannability in everyday life.
Licensing, density, and real costs
I rate Costs not at the host level, but per workload and month in euros. When NUMA increases VM density, fixed costs per instance decrease and performance reserves increase. This affects licenses per core as well as support and energy costs. Reducing remote access shortens computing time and saves energy for the same task. In the end, what counts is the Overall balance sheet per result, not just per server.
Reading hardware topology and interconnects correctly
I refer to the physical Topology actively into my planning. Modern servers use multi-part CPU designs and connect chiplets or dies via interconnects. This means that not every core has the same path to every RAM module, and even within a socket there are preferred paths. The more traffic that runs over the cross-socket links, the greater the increase in Latency and coherency overhead. I therefore check how many memory channels are active per node, whether all DIMM slots are populated symmetrically, and how the nodes are connected on the motherboard. Sub-NUMA features that divide nodes into smaller domains can equalize hotspots if workloads are clearly segmented. I also observe the L3 topologyIf threads and their data are located in different cache domains, the cache transfer alone will have a noticeable impact on performance. A simple bandwidth test and a topology overview will quickly show whether the platform delivers the expected locality or whether interconnects become a bottleneck.
Firmware and BIOS options with effect
I make sure in the BIOS that Node interleaving is disabled so that the NUMA structure remains visible. I specifically use sub-NUMA clustering or comparable modes when workloads have many medium-sized, clearly separated workloads. For consistent latencies, I choose performance-oriented energy profiles and reduce deeper C-states and avoid aggressive core parking. I optimize memory allocation for full memory channel bandwidth; asymmetrical DIMM configurations have a direct impact on throughput and latency. I also check prefetcher and RAS options: some protection mechanisms increase latency without benefiting the workload. Important: I test every BIOS adjustment with real load, because micro-effects caused by caches and interconnects often only become apparent under pressure.
Guest OS and runtime tuning: from first touch to huge pages
In the guest, I use first touch-Allocation to my advantage: Threads initialize „their“ memory so that pages are created locally. Under Linux, I enable or disable automatic NUMA balancing depending on the workload; database-related systems often benefit from stable binding, while distributed web workers can cope with low migration. With numactl or task pinning, I bind services to nodes and define membind-Guidelines. Huge Pages Reduce TLB pressure; for latency-critical databases, I prefer static huge pages and warm memory (pre-touch) to avoid page fault spikes. Depending on the engine, I run transparent huge pages on „madvise“ or disable them if they cause defragmentation latencies. I control IRQ affinities and distribute network and NVMe interrupts to the appropriate nodes; RPS/XPS and multiple queues help to keep data paths consistent. On Windows, I use processor groups and soft NUMA in the stack, ensure „lock pages in memory“ for memory-intensive services, and enable server GC on .NET. For JVMs, I use NUMA-aware heuristics, pre-touch heaps, and control thread affinity so that GC and workers use the same nodes.
Cleanly align hypervisor-specific settings
I'll pass on the vNUMA topology to the physical structure. I select the parameters „sockets,“ „cores per socket,“ and „threads per core“ so that the hypervisor does not split the VM across nodes. For latency-sensitive instances, I reserve RAM so that neither ballooning nor swapping occurs, and I secure pCPU resources via affinity or appropriate scheduler options. Be careful with CPU or memory hot adds: Many platforms disable vNUMA in the guest, resulting in hidden remote access. I plan live migration so that target hosts have a compatible NUMA topology, and I give VMs time after migration to page locality Rebuild (pre-touch, warm-up). In KVM environments, I use the NUMA tuning options and cpuset cgroups; in other hypervisors, exstop/similar tools help to see vCPU distribution and node hits in real time.
Don't waste PCIe and I/O locality
I arrange NVMe-Drives, HBAs, and NICs to the node on which the computing threads are running. I bind SR-IOV or vNIC queues to cores of the same node and control interrupts accordingly. For high packet rates, I scale receive/transmit queues and distribute them consistently across the local cores. For storage stacks, I make sure that worker threads for I/O submits and completions work on the same node so that the data path does not run across the interconnect. I also plan multipathing and software RAID on a node-specific basis; a „shorter“ path almost always beats the „broader“ path with external accesses. This reduces jitter and brings the I/O load under control. CPU time where it has an effect.
Capacity planning, overcommit, and memory features
I prefer to run latency-oriented workloads without Overcommit on RAM and moderately on vCPU. Ballooning, compression, and hypervisor swapping generate external accesses or page fault spikes—exactly what I want to avoid. Transparent page sharing is ineffective in many setups and can obscure the view of true locality. I calibrate the mix of VMs so that multiple memory bandwidth-hungry instances do not collide on the same node. For in-memory engines, I plan generously. Reservations and, where appropriate, Huge Pages in the guest that the hypervisor can pass through. This keeps TLB hit rates and access times predictable.
Live migration and high availability
I take into account that a Migration temporarily destroys the side locality of a VM. After the move, I warm up critical heaps and let background jobs rebuild the hotsets. I plan target hosts with similar NUMA topology so that vNUMA does not have to be recut. For HA cases with heterogeneous hardware, I define policies: Either I accept higher latency for a short time, or I prioritize hosts with compatible node sizes. It is important to monitor after migration: If remote page shares increase, I adjust affinities or trigger pre-faulting until the Locality fits again.
Practical diagnostic patterns
I can recognize typical NUMA problems by a few patterns: The CPU runs „hot,“ but the Instructions per Cycle remain low; latency jumps in waves; individual threads block on memory accesses even though cores are free. In such cases, I look at remote hits, interconnect utilization, TLB misses, and the distribution of active threads per node. I correlate interrupt load with the cores carrying the application and check whether caches between nodes are constantly being invalidated. A simple cross-check is to reduce the VM to one node: if latencies drop immediately, spanning or scheduling was the cause. Similarly, dedicated tests reveal the RAM bandwidth per node and show whether DIMM configuration or BIOS options are slowing things down.
Practical checklist
- Understanding topology: nodes, memory channels, PCIe mapping, cache domains
- Check BIOS: Node Interleaving off, Performance power profile, flat C-states
- Cutting VMs: vCPUs per VM ≤ node size, vNUMA correct, note hot add
- Secure RAM: Reservations for latency workloads, huge pages where appropriate
- Set affinity: Bind threads, IRQs, and I/O queues to the same node
- Containers/Pods: Utilize node affinity, CPU managers, and topology awareness
- Spanning only where needed: Support large instances with policies and monitoring
- Planning migration: Target topology suitable, pre-touch heaps, observe locality
- Sharpen monitoring: remote access, bandwidth per node, interconnect utilization
- Test regularly: Bandwidth/latency checks after firmware or host changes


