...

Optimizing server process affinity and NUMA awareness in hosting

I increase the server performance by Process Affinity and NUMA awareness in a targeted manner and thus optimally arrange threads, cores and memory in relation to each other. This allows me to reduce latencies, increase throughput and achieve consistent response times in hosting environments with many applications.

Key points

Before I make any specific settings, I clarify my goals, workload patterns and the existing hardware topology. I analyze which threads are particularly memory-hungry and which processes need short response times. I consider how many cores are available per NUMA node and how much local RAM is there. I plan to bundle services node by node so that CPU locality is maintained. I measure every change with benchmarks and monitoring to avoid false assumptions.

  • AffinityBind processes to core groups
  • NUMAKeep memory local
  • Topology: Scale node by node
  • Monitoring: Make remote accesses visible
  • HostingControl hypervisor placement

What does Process Affinity mean on the server?

With Process Affinity I specify on which CPU cores a process or thread runs instead of letting the operating system decide freely. This keeps cache contents consistent, which reduces cache misses and context switches. I pin threads so that they use their L1/L2/L3 caches effectively and don't jump between cores. This improves the predictability of latencies under high load and ensures even utilization of the reserved cores. For a practical introduction, this guide to CPU affinity in hosting, because I use it to compare typical pinning variants.

Understanding NUMA: local vs. remote access

NUMA divides the working memory into nodes, each of which is closely coupled to specific CPU sockets. A thread accesses local RAM faster than remote memory on other nodes. This asymmetry has a significant impact on real workloads, especially with many cores and a large amount of RAM. I therefore assign threads and their memory accesses to a common node to reduce latencies and increase bandwidth. If you want to delve deeper into the topology, check out practical tips on NUMA nodes in the server and then measures the effects in everyday life.

NUMA awareness in the operating system and app

I activate NUMA Awareness in the operating system, hypervisor and application so that memory is allocated locally. If possible, I keep threads of an instance on cores of the same NUMA node instead of distributing them across nodes. I prefer to create large heaps or buffers in the local RAM so that expensive remote accesses remain rare. If an application has several workers, I structure them node by node in pools to avoid interference. This creates a clear allocation of CPU and memory, which noticeably reduces response times.

Interaction between Affinity and NUMA

Affinity without NUMA scheduling wastes potential if the memory is located on remote nodes. Likewise, NUMA consideration is of little use if the scheduling moves threads frequently. I therefore bind threads to cores of a specific node and ensure local memory allocation in parallel. If I scale the application, I first fill a node before including further nodes. This coupling of core pinning and memory policy generates constant latency profiles under load.

Hardware and firmware tuning (UEFI/BIOS)

To make Affinity and NUMA work, I set the base in the firmware to be stable. I prefer consistent performance modes instead of aggressive energy-saving options so that clock and latency fluctuations are minimized. Important points that I check:

  • Power profile: Maximum power/performance instead of balanced; restrict low C-states if latency is more critical than efficiency.
  • Turbo/boost strategy: Deterministic boost on demand to avoid fluctuating P cores.
  • SMT/Hyper-Threading: Test depending on the workload - for hard latency SLAs, I often pin critical threads to physical cores and separate SMT siblings.
  • Memory interleaving: Deactivated for NUMA optimization so that nodes remain sharply delimited.
  • Memory channels: Symmetrical configuration of the DIMM slots per node for maximum bandwidth.

Configuration path: Analysis to pinning

I start with a topology recording, typically with lscpu, numactl -hardware or hwloc. I then define the required number of cores for each service and assign them to a node. I implement the pinning with taskset or via systemd options so that the assignment remains reproducible. During the test, I adjust the size of the core groups until latency and throughput are in a good ratio. I make sure that no CPU-intensive services share the same core pool and thus displace each other's caches.

In Linux I like to set affinity and memory policy declaratively via cgroups (v2): I define cpuset.cpus and cpuset.mems node-wise and start services with systemd parameters like CPUAffinity= and NUMAMask=. I keep separate pools for batch or secondary processes so that they do not get into cores of the latency-critical tier. For recurring jobs, I plan exact start windows in which cores are free.

Interrupt and I/O affinity

Not only app threads need locality - also Interrupts and I/O paths close to the node:

  • Network: Bind RX/TX queues of a NIC to cores of the same NUMA node (configure RSS/XPS) so that packet processing and app threads share cache and RAM locality.
  • Storage: Pin NVMe queues and IO threads per node; check the queue distribution for blk-mq so that hot volumes do not cross nodes.
  • irqbalance: Either configure specifically or deactivate for critical queues and set manually via smp_affinity.

Targeted use of operating system features

I deliberately use kernel features for strict latency profiles:

  • isolcpus/nohz_full/rcu_nocbs: Decouple cores from general scheduling, minimize tick load and relocate RCU callbacks - ideal for high-perf threads.
  • Scheduler policies: Use SCHED_FIFO/RR sparingly for real-time shares; otherwise use CFS with close affinity.
  • Auto NUMA Balancing: Often deactivated for strictly pinned workloads so that the kernel does not move memory.
  • Transparent Huge Pages: Mostly set to madvise and use explicit Huge Pages for really large heaps to reduce TLB misses.

NUMA-conscious storage policy

With numactl I enforce preferred local memory allocation or use policies such as preferred and interleave. Where possible, I keep large in-memory structures such as database buffer pools within a node. If the memory requirement increases, I observe the increase in remote accesses and react by segmenting or sharding. Practical insights into tuning provide me with guidelines for NUMA balancing, which I then confirm with load tests. This keeps the memory access time low and predictable.

Storage techniques: Huge pages, heaps and garbage collection

Memory management often determines P99 latencies. I use huge pages where large, long-lived heaps dominate (e.g. DB buffers, JVM heaps). This reduces TLB misses and page walks. For JVM workloads, I pay attention to heap size per node and activate NUMA optimization so that GC threads and heaps remain local. For .NET and Go, I plan GCs and goroutine pools so that they do not fill cores across nodes in an uncontrolled manner. In databases, I split large buffer pools into node-local segments or run multiple, smaller instances per node.

Hosting practice: typical workloads

Databases, caches and large application servers react sensitively to CPU locality and memory latency. A distributed VM across several NUMA nodes increases computing and memory paths and slows down queries or API calls. I therefore place VMs so that their vCPUs are assigned to a physical node and the memory remains there. Container pools are given consistent CPU sets so that workers do not jump across nodes. This care pays off especially for e-commerce and API services with high parallelism.

Fine-grained app strategies

At application level, I decouple nodes so that locality is maintained:

  • Worker pools: One pool per NUMA node, each with a local queue to avoid cross-node communication.
  • Sharding: Keep data and sessions node-local; select hashing so that hot shards do not cross multiple nodes.
  • Caches: Replicated instead of centralized; readers prefer node-local copies.
  • Thread pinning in runtimes: For network stacks (e.g. Netty) and DB clients, bind workers to fixed cores, observe IRQ proximity.

Monitoring and troubleshooting

Sensible monitoring shows more than the overall capacity utilization, because NUMA-effects are hidden in node detail values. I monitor CPU load per core and node, memory usage per node and remote access rates. If individual cores overflow while others remain unused, this indicates poor affinity setups. If one node's RAM is full while another has a reserve, I have to adjust the memory policy or placement. I use these signals to objectively document bottlenecks and derive the next changes.

Metrics Note/Symptom Typical cause Fast action
CPU per core Some cores permanently high Incorrect pinning Redistribute core groups
RAM per node A node in the limit Memory not local set numactl preferred
Remote rate High remote access VM/container via nodes Bundle vCPU/CPU set
Context switches Erratic latency Thread hike Pin Affinity harder

Anti-patterns and typical stumbling blocks

I avoid global CPU limits regardless of NUMA because they allocate cores across nodes. Also „One big VM“ with too many vCPUs rarely scales linearly - multiple, node-local instances are better. Transparent Huge Pages in Always mode sometimes causes page fault spikes; madvise plus targeted Huge Pages is more predictable. irqbalance running uncontrolled dilutes I/O locality. And: Pinning too hard without buffer cores can stifle maintenance and sideload - I always plan a few free cores per node.

Making performance effects measurable

I measure the effects of Affinity and NUMA changes always with reproducible benchmarks. Before and after comparisons with an identical data set show improvements transparently. I combine synthetic tests with realistic load profiles so that optimizations bear fruit in everyday use. Key result figures such as P95 and P99 latencies are often more meaningful than mean values. This allows me to validate decisions and identify side effects at an early stage.

Virtualization and containers

In hypervisor setups I use vNUMA, so that the guest VM understands the physical topology. I pack vCPUs of a VM into a physically matching node to minimize remote access. For containers, I define CPU requests and limits so that CPU sets remain consistent and the topology manager respects node localization. I only stagger large VMs with many vCPUs across nodes if the application allows internal segmentation. I evaluate each placement based on latency, throughput and utilization per node.

Orchestration: Cgroups, Kubernetes and co.

In containers, I use guaranteed or burstable classes with stable CPU sets and mems assignment. The topology manager in „single-numa-node“ mode helps to keep pods node-local. For long-running realtime parts, I use the CPU manager in „static“ mode to keep cores exclusive. I schedule HugePages as requests/limits and group pods by workload role so that nodes are not heterogeneously overloaded. Important: Maintain node labels cleanly so that placement rules do not break locality unintentionally.

Role of the hosting provider

A good provider delivers transparent NUMA topology, affinity options and insight into node metrics. I make sure that the hypervisor and orchestration take NUMA awareness seriously and that vCPU placement remains controllable. Monitoring that provides CPU, RAM and remote quotas per node is also important. This allows me to decide for myself how strictly I pin and how I set memory policies. This control makes demanding workloads reliable and predictable.

Operating model: introducing changes safely

I introduce pinning and NUMA policies iteratively: first on a node, with clearly defined rollback steps. I document topology, assignments and kernel parameters to ensure reproducibility. For releases, I use canary traffic, monitor P95/P99, context switches and remote rates for at least one full load phase and only then roll out more broadly. This keeps improvements stable and risks manageable.

Best practices, compactly applied

I start every optimization with a thorough Topology analysis and document the core and node assignment. I then divide workloads so that the database, cache and app server receive separate node resources. I pin critical processes and preferably set the memory locally before fine-tuning the group size. I accompany every tuning with benchmarks and node metrics to clearly see the effects. For growth, I plan node by node and keep instances lean instead of blowing up a monolithic giant instance.

Summary and next steps

With targeted Process Affinity and real NUMA awareness, I bring workloads on the same hardware noticeably forward. Clear placement, local memory allocation and consistent measurement of the results are crucial. Bundling VMs and containers close to the node reduces latency and increases throughput. I recommend starting a pilot project on a host, testing affinity and memory policy and adopting the best settings. This way, performance increases step by step without having to buy new servers.

Current articles