Technology

NUMA architecture: Why it plays an important role in modern servers

The NUMA architecture determines how quickly modern servers supply threads with memory and how well workloads scale under high loads. I show why local memory accesses dominate latency and bandwidth, how hypervisors use NUMA, and which settings in VMs unlock direct performance gains.

Key points

I will briefly summarize the most important findings and highlight the factors that have the greatest impact in data centers.

Local memory minimizes latency and increases throughput
NUMA node structure CPUs and RAM efficiently
vCPU size Adjust per VM to node size
Virtual NUMA pass to the guest OS
Spanning rules define for large RAM requirements

I consistently focus on Latency and data proximity, because that is where server performance is determined. Large sockets, many cores, and lots of RAM are of little use if threads are constantly waiting for remote memory areas. I dimension VMs so that they fit into a NUMA node and memory allocation remains local. I support hypervisor features selectively instead of activating everything globally. This is how I ensure Scaling without surprises during peak loads.

What NUMA really is

I think in NodeEach NUMA node combines CPU cores and a local RAM area with very short access paths. If a thread hits the data in the L1, L2, or L3 cache, everything runs extremely fast; if the data set is in the local RAM, latency remains low. However, if the thread accesses another node, the wait time increases and throughput drops. It is precisely these differences that make non-uniform Memory access. I therefore organize workloads so that the majority of accesses remain local.

Why UMA has its limitations

UMA shares a common storage path which causes congestion as the number of cores increases. Each additional core joins the same queues and competes for bandwidth. In many older setups, this led to latency building up until CPU utilization was high but the application responded sluggishly. This feels like „CPU at its limit,“ even though the bottleneck is actually in memory access. NUMA solves precisely this problem. Blockages through local paths and node topology.

NUMA vs. UMA: Differences at a glance

I like to summarize the most important differences in a compact Table firmly established so that decisions can be made more quickly. This overview shows what is important in terms of architecture, latency, and scaling. It helps me with sizing new hosts as well as troubleshooting in productive environments. Those who clearly understand the difference between local and remote access make better decisions when it comes to VM customization and RAM allocation. This is precisely where the difference is made. Performance under load.

Criterion	NUMA	UMA	Practical effect
memory access	Local or remote	Uniform	Local access is faster; remote access incurs latency.
Scaling	Very good with knots	Early limited edition	More cores scale more reliably with NUMA
Topology	Multiple nodes	Uniform pool	Topology-aware planning required
hypervisor	Virtual NUMA available	Less relevant	Guest OS can schedule NUMA-aware
fine-tuning	vCPU/RAM per node	Global tuning	Node-appropriate VMs deliver stability

NUMA in virtual environments

I let the hypervisor handle the Topology pass it on to the guest OS so that schedulers and memory management can plan locally. Virtual NUMA shows the guest its node boundaries, allowing databases, JVMs, and .NET workers to arrange their heaps and threads more efficiently. This avoids expensive remote access and keeps latency stable. In sensitive setups, I combine this with a consistent pinning strategy and fixed RAM allocation. For extremely short response times, I also use Micro-latency hosting considered in order to further reduce jitter.

Best practices for VM sizes and CPU allocation

I dimension vCPUs so that a VM fits into a NUMA node or only barely touches it. Example: If a host has two nodes with 20 cores each, I prefer to plan VMs with 4 to 16 vCPUs within one node. Going beyond this risks remote access and unnecessary waiting times. I distribute RAM as statically as possible so that the guest OS keeps its pages local. For workloads with a high single-thread component, I incorporate the right core strategy and use analyses such as Single-thread vs. multi-core.

Specific advantages for hosting hardware

With clean NUMA planning, I increase the density per host without sacrificing response times. In many data centers, this allows significantly more VMs to be operated per socket, while applications respond reliably. Shorter latency directly contributes to user experience and batch throughput. Costs per workload are reduced because CPU time and RAM are used more efficiently. Those who make informed hardware choices also benefit from modern High-performance web hosting hardware with high memory bandwidth.

Workload tuning: databases, caches, containers

I make sure that Databases Keep your heaps local and run worker threads on „your“ node. For SQL engines, in-memory caches, and JVMs, it is worth assigning CPUs and reserving memory. Container orchestration benefits from node affinities so that pods use the shortest memory paths. For heavy I/O, I rely on NUMA-close NVMe allocations to keep data close to the node. This keeps hot paths short and the Response time friendly.

Monitoring and troubleshooting with NUMA

I measure Latency and remote accesses in a targeted manner, instead of just looking at CPU percentages. Tools show me per node how many pages are remote and which threads are creating memory pressure. If remote misses increase, I adjust the vCPU size, affinities, or RAM allocation. If throughput remains weak despite high CPU reserves, memory paths are often the cause. For me, visibility at the node level is the fastest way to Causes, not just symptoms.

NUMA spanning: using it correctly

I activate Spanning Specifically for VMs with very high RAM requirements or exceptional bandwidth. The VM can then obtain memory from multiple nodes, which is what makes single instances with a massive footprint possible in the first place. The price is occasional remote access, which I mitigate with CPU affinities and a larger page locality share. For mixed loads, I prefer to choose several medium-sized VMs instead of one very large instance. This way, Plannability in everyday life.

Licensing, density, and real costs

I rate Costs not at the host level, but per workload and month in euros. When NUMA increases VM density, fixed costs per instance decrease and performance reserves increase. This affects licenses per core as well as support and energy costs. Reducing remote access shortens computing time and saves energy for the same task. In the end, what counts is the Overall balance sheet per result, not just per server.

Reading hardware topology and interconnects correctly

I refer to the physical Topology actively into my planning. Modern servers use multi-part CPU designs and connect chiplets or dies via interconnects. This means that not every core has the same path to every RAM module, and even within a socket there are preferred paths. The more traffic that runs over the cross-socket links, the greater the increase in Latency and coherency overhead. I therefore check how many memory channels are active per node, whether all DIMM slots are populated symmetrically, and how the nodes are connected on the motherboard. Sub-NUMA features that divide nodes into smaller domains can equalize hotspots if workloads are clearly segmented. I also observe the L3 topologyIf threads and their data are located in different cache domains, the cache transfer alone will have a noticeable impact on performance. A simple bandwidth test and a topology overview will quickly show whether the platform delivers the expected locality or whether interconnects become a bottleneck.

Firmware and BIOS options with effect

I make sure in the BIOS that Node interleaving is disabled so that the NUMA structure remains visible. I specifically use sub-NUMA clustering or comparable modes when workloads have many medium-sized, clearly separated workloads. For consistent latencies, I choose performance-oriented energy profiles and reduce deeper C-states and avoid aggressive core parking. I optimize memory allocation for full memory channel bandwidth; asymmetrical DIMM configurations have a direct impact on throughput and latency. I also check prefetcher and RAS options: some protection mechanisms increase latency without benefiting the workload. Important: I test every BIOS adjustment with real load, because micro-effects caused by caches and interconnects often only become apparent under pressure.

Guest OS and runtime tuning: from first touch to huge pages

In the guest, I use first touch-Allocation to my advantage: Threads initialize „their“ memory so that pages are created locally. Under Linux, I enable or disable automatic NUMA balancing depending on the workload; database-related systems often benefit from stable binding, while distributed web workers can cope with low migration. With numactl or task pinning, I bind services to nodes and define membind-Guidelines. Huge Pages Reduce TLB pressure; for latency-critical databases, I prefer static huge pages and warm memory (pre-touch) to avoid page fault spikes. Depending on the engine, I run transparent huge pages on „madvise“ or disable them if they cause defragmentation latencies. I control IRQ affinities and distribute network and NVMe interrupts to the appropriate nodes; RPS/XPS and multiple queues help to keep data paths consistent. On Windows, I use processor groups and soft NUMA in the stack, ensure „lock pages in memory“ for memory-intensive services, and enable server GC on .NET. For JVMs, I use NUMA-aware heuristics, pre-touch heaps, and control thread affinity so that GC and workers use the same nodes.

Cleanly align hypervisor-specific settings

I'll pass on the vNUMA topology to the physical structure. I select the parameters „sockets,“ „cores per socket,“ and „threads per core“ so that the hypervisor does not split the VM across nodes. For latency-sensitive instances, I reserve RAM so that neither ballooning nor swapping occurs, and I secure pCPU resources via affinity or appropriate scheduler options. Be careful with CPU or memory hot adds: Many platforms disable vNUMA in the guest, resulting in hidden remote access. I plan live migration so that target hosts have a compatible NUMA topology, and I give VMs time after migration to page locality Rebuild (pre-touch, warm-up). In KVM environments, I use the NUMA tuning options and cpuset cgroups; in other hypervisors, exstop/similar tools help to see vCPU distribution and node hits in real time.

Don't waste PCIe and I/O locality

I arrange NVMe-Drives, HBAs, and NICs to the node on which the computing threads are running. I bind SR-IOV or vNIC queues to cores of the same node and control interrupts accordingly. For high packet rates, I scale receive/transmit queues and distribute them consistently across the local cores. For storage stacks, I make sure that worker threads for I/O submits and completions work on the same node so that the data path does not run across the interconnect. I also plan multipathing and software RAID on a node-specific basis; a „shorter“ path almost always beats the „broader“ path with external accesses. This reduces jitter and brings the I/O load under control. CPU time where it has an effect.

Capacity planning, overcommit, and memory features

I prefer to run latency-oriented workloads without Overcommit on RAM and moderately on vCPU. Ballooning, compression, and hypervisor swapping generate external accesses or page fault spikes—exactly what I want to avoid. Transparent page sharing is ineffective in many setups and can obscure the view of true locality. I calibrate the mix of VMs so that multiple memory bandwidth-hungry instances do not collide on the same node. For in-memory engines, I plan generously. Reservations and, where appropriate, Huge Pages in the guest that the hypervisor can pass through. This keeps TLB hit rates and access times predictable.

Live migration and high availability

I take into account that a Migration temporarily destroys the side locality of a VM. After the move, I warm up critical heaps and let background jobs rebuild the hotsets. I plan target hosts with similar NUMA topology so that vNUMA does not have to be recut. For HA cases with heterogeneous hardware, I define policies: Either I accept higher latency for a short time, or I prioritize hosts with compatible node sizes. It is important to monitor after migration: If remote page shares increase, I adjust affinities or trigger pre-faulting until the Locality fits again.

Practical diagnostic patterns

I can recognize typical NUMA problems by a few patterns: The CPU runs „hot,“ but the Instructions per Cycle remain low; latency jumps in waves; individual threads block on memory accesses even though cores are free. In such cases, I look at remote hits, interconnect utilization, TLB misses, and the distribution of active threads per node. I correlate interrupt load with the cores carrying the application and check whether caches between nodes are constantly being invalidated. A simple cross-check is to reduce the VM to one node: if latencies drop immediately, spanning or scheduling was the cause. Similarly, dedicated tests reveal the RAM bandwidth per node and show whether DIMM configuration or BIOS options are slowing things down.

Practical checklist

Understanding topology: nodes, memory channels, PCIe mapping, cache domains
Check BIOS: Node Interleaving off, Performance power profile, flat C-states
Cutting VMs: vCPUs per VM ≤ node size, vNUMA correct, note hot add
Secure RAM: Reservations for latency workloads, huge pages where appropriate
Set affinity: Bind threads, IRQs, and I/O queues to the same node
Containers/Pods: Utilize node affinity, CPU managers, and topology awareness
Spanning only where needed: Support large instances with policies and monitoring
Planning migration: Target topology suitable, pre-touch heaps, observe locality
Sharpen monitoring: remote access, bandwidth per node, interconnect utilization
Test regularly: Bandwidth/latency checks after firmware or host changes

Current articles

Analytical view of website performance diagnosis with data metrics and system analysis

SEO

Why many speed optimizations only treat symptoms: The difference between root cause analysis and superficial fixes

Many speed optimizations fail because they treat symptoms. Learn how root cause analysis solves real performance problems and saves resources.

January 4, 2026 No Comments

MySQL buffer pool performance visualization with fast RAM access

Databases

How different MySQL buffer pools affect performance: A comprehensive guide

Learn how to properly configure the innodb buffer pool to maximize your database performance. MySQL tuning guide for better hosting performance.

January 4, 2026 No Comments

Servers with high uptime but poor performance in the data center

Administration

Server Uptime Myth: Why high availability does not guarantee good performance

Server uptime myth debunked: High availability does not guarantee good performance. Learn about performance analysis and hosting monitoring for optimal servers.

January 4, 2026 No Comments