The server cache hierarchy determines how fast requests reach data from L1/L2/L3, RAM, page cache, object cache and edge layers and how I choose optimal access patterns to minimize latencies. I show concrete patterns and tuning steps that increase cache hits, reduce misses and minimize latency. TTFB measurable pressure.
Key points
The following key aspects guide my practical guide to the server cache hierarchy and the appropriate access patterns.
- multilayer use: Combine CPU, RAM, page, object and edge cache in a targeted manner
- Access pattern master: Read-/Write-Through, Write-Back, Read-Through
- Miss types minimize: Reduce Compulsory, Capacity, Conflict by Design
- TTFB lower: Caching header, purges and edge close to the user
- Monitoring establish: Continuously measure hit rate, evictions, latencies
What a server cache hierarchy does
I always organize caches by proximity to the CPU and by latency. At the top are registers and L1/L2/L3, below that RAM, followed by SSD/HDD and archive storage. The further down I fetch data, the greater the capacity, but the slower the access. That's why I keep frequently used data as close as possible to the computing core and minimize paths. This thinking scales from individual instances to edge nodes in the CDN, that cache content close to the user.
CPU to RAM cache: Understanding latencies
I make architectural decisions based on typical sizes and cycles because each level has different strengths. L1 delivers data with almost no latency, L2/L3 increase the hit space, RAM absorbs large working sets. Secondary memory moves data volumes, but reacts more slowly. If you pay attention to this staggering, you can design algorithms, data structures and server setups that avoid miss-chains. This is how the cache hierarchy their effect during real load peaks.
| Level | Typical size | Latency (bars) | Typical use |
|---|---|---|---|
| L1 (I/D) | 32–64 KB per core | 1-4 | Hottest instructions/data |
| L2 | 256 KB-1 MB | 10-20 | Working window of the thread |
| L3 (Shared) | 2-32 MB | 40-75 | Cross-core buffer |
| RAM | GB to TB | Hundreds of thousands | Process and object pools |
| NVMe SSD | Hundreds of GB-TB | million | Persistence, hot set spillover |
I adapt data flows: small, frequented structures target L1, wider sequences benefit from L2/L3, while streams and large files are buffered via RAM. Code layout, prefetching instructions and the working set size determine how well this works. Even a few percentage points higher hit rates are noticeable in every latency measurement. This thinking has a direct impact on TTFB and throughput.
Application caches on the server
I supplement CPU and RAM proximity with application-specific caches because they eliminate bottlenecks directly at the request. OP cache holds precompiled PHP bytecode and saves interpreter time with every call. A page cache delivers finished HTML, completely eliminating PHP and the database for hits. Object caches such as Redis or Memcached park query results and session data in RAM. These layers reduce I/O, lower the overhead and significantly increase the response speed per request.
I prioritize the page cache for non-personalized routes first, then object caches for expensive queries. Static Assets get long TTLs, dynamic views get short ones. In this way, I keep variable areas fresh and save bandwidth at the same time. When performance targets become tighter, I limit PHP startup costs with persistent OP cache and rely on reuse of data structures. This creates a fast, easily controllable data path to the socket.
Write strategies and access patterns
I choose the pattern to match the workload to balance consistency and pace. When Read-Through the cache loads from the source during the miss and saves the result, which keeps code clean and deterministic. Write-through writes synchronously to cache and backend, simplifies read consistency, but costs latency. Write-back collects changes in the cache and writes them later in a bundle, which increases throughput but requires maintenance when flushing. I combine these rules depending on the situation: sessions write-through, product lists read-through, metrics write-back.
In addition to patterns, I also take cache classes into account. Distributed Caches avoid duplicate work for multiple app servers and smooth out load peaks. In the CDN, edge nodes cancel out network latency, especially for large assets and recurring routes. With suitable invalidation signals, I ensure freshness without emptying the entire layer. This is how I keep consistency and performance in balance.
Minimize misses: Block sizes, associativity, prefetching
I am fighting the three C's: Compulsory, Capacity and Conflict-Misses. Larger cache lines help with sequential scans, smaller lines score points with highly scattered accesses. Higher associativity reduces collisions, while targeted prefetching relieves critical paths. Data structures with spatial and temporal locality contribute to all levels. I explain more details about L1-L3 and RAM here: Making sensible use of CPU caches.
I arrange objects in the memory so that neighboring fields are placed together in a Cache line fall. I dimension hash tables in such a way that collision rates remain low. I avoid heavy pointer jumps or bundle them into batches. I use profiling to see where miss-chains occur and remove them specifically. The result is more hits per cycle and fewer wasted bars.
Tuning for web servers: Header, TTL, Purge
I control cache behavior via headers and clear life cycles. Cache control, Expires, ETag and Vary define how intermediaries and browsers handle content. For HTML I set short TTLs plus event-controlled purges, for assets long TTLs with hash in the file name. A clean purge target only deletes affected routes and protects the rest. I pay explicit attention to the kernel page cache, because the Linux Page Cache serves many files even before the web server userland.
I also check how upstream and downstream caches interact. Vary on Accept-Encoding, Cookie or Authorization prevents incorrect reuse. For personalized content, I work with hole-punching or edge-side includes so that only dynamic sections are freshly calculated. Where sessions are mandatory, I exclude these routes from the page cache. These measures keep responses consistent and still fast.
WordPress practice: Redis, OP cache and page cache
I reduce TTFB by activating OP-Cache, activating a page cache and Redis for object caching. Plugins that deliver HTML statically save CPU and database time on each hit. Redis intercepts recurring queries and keeps results in RAM. A reverse proxy such as NGINX or an edge layer shortens the network route to the user. If you want to get an overview, you can find the most important levels at Caching levels in hosting.
I strictly separate public routes (cache bar) from personalized views (no-cache). Cookies and headers decide what the proxy passes on and what it delivers from memory. For content updates, I initiate targeted purges instead of global flushes. This keeps archive pages long-lived, while fresh articles appear immediately. The result is constant response times even during traffic peaks.
Monitoring and key figures
I make data-driven decisions and measure everything related to caching. Central metrics are Hit rate, miss rate, latency distribution, evictions, RAM consumption and network RTT. A hit rate above 95% for pages and above 90% for objects indicates a healthy setup. If the value drops, I check the TTLs, setsize, key distribution and eviction strategy. LRU, LFU or ARC fit better or worse depending on the access pattern.
I analyze time windows in which evictions increase and then selectively enlarge the relevant pools. Dashboards with correlations from app logs, proxy stats and CDN metrics show bottlenecks in context. I evaluate error rates and re-validations separately from hard misses. Then I optimize step by step so that I don't inadvertently cold-switch hotpaths. This routine saves me a lot of nightly deployments.
Cleanly solve consistency and invalidation
I define clear rules for when caches lose or renew content. event-based purges for publications, price changes or stock levels ensure freshness. For regular pages, I use TTLs as network backups so that old entries disappear automatically. I render personalized components via ESI or Ajax and keep the rest cacheable. Cookies serve as a switch to determine which parts of a route may be served from the cache.
I minimize full cache flushes because they cost performance and cause cold starts. Segmentation by site areas, clients or language versions reduces the number of inodes and increases accuracy. I trigger edge validations in batches to comply with CDN rate limits. This creates a predictable lifecycle for each piece of content. Consistency is guaranteed without sacrificing performance.
Practical check: typical TTFB scenarios
I often observe similar patterns in projects with performance problems. Without caching, every request ends up in PHP and the Database, which generates TTFB beyond 500 ms. With OP-Cache the PHP time is often halved, a page cache eliminates it completely on hits. Redis reduces the database load and noticeably accelerates repeated views. An edge layer shortens the network distance and brings TTFB to double-digit milliseconds.
I start with clean miss-analyses and scale layer by layer. NVMe-Memory reduces backend latencies, sufficient RAM feeds the object and file system cache. Reverse proxies encapsulate heavy upstream services and deliver assets directly. I use regular measurement windows to ensure that optimizations have a lasting effect. In this way, stability and speed grow together.
Key design, TTL and segmentation
I design keys in such a way that they both minimize collision risks and simplify purges. A consistent naming scheme with prefixes for client, environment, language and resource type (e.g. tenant:env:lang:route:vN) allows targeted invalidations and prevents „blind“ flushes. Version tags (vN) help me to delete old entries immediately without emptying the entire store.
I differentiate between hard and soft service life. One Soft TTL defines how long an entry is considered fresh, a Hard TTL the absolute sequence. In between, I use grace periods, stale-if-error and stale-while-revalidate to continue to respond quickly under load or in the event of upstream errors. For product detail pages, for example, I choose 60-120 s soft TTL plus grace, for price/stock data short TTLs and event-based purges. This keeps user perception fast while maintaining consistency.
I segment large caches along the access behavior: hot sets with short TTL and aggressive revalidation, cold sets with long TTL and sluggish eviction. This segmentation reduces evictions on hot paths and increases the desired hit rate in the important routes.
Cache warming, preload and cold start resistance
I schedule cold starts and preheat critical paths. After deployments or cache flushes, I automatically warm up the top URLs from logs, including typical Vary variants (language, device, encoding). For OP cache, I use preloading so that central classes and functions are located directly in the working set. Careful throttling prevents the warming itself from becoming a load peak.
I work with rolling and canary warmings: first warm a part of the nodes, check telemetry, then roll out step by step. I combine edge and origin warming: CDN edges preload popular assets, while the origin fills page and object caches in parallel. In this way, I avoid the „cold chain“, where a miss hits the entire line through to the database.
Kernel, network and file system tuning
I regard the Linux page cache as a silent accelerator and adjust kernel parameters to my profile. I set readahead values per block device to match the access pattern: sequential log or asset reads benefit from more readahead, highly randomized accesses tend to benefit from less. Dirty-I select the write thresholds (background/total) so that write peaks do not increase the read latencies. I keep swap low so as not to run into I/O storms.
In the network, I reduce connection overhead by using keep-alive, HTTP/2 or HTTP/3 and compression in a coordinated manner. TLS benefits from session resumption and reuse at edge and origin level. On the socket side, sensible backlog and reuseport settings help me so that workers can take over quickly. These settings reduce the load on upstream services and ensure that cached responses land on the wire without context changes.
NUMA, CPU affinity and process topology
I keep data and compute threads together. On NUMA systems, I pin services so that they use memory local to the node on which they are running. I bind Redis or Memcached to a NUMA node and prefer to serve application workers of the same pool from there. In this way, I reduce expensive cross-node accesses, stabilize L3 hit rates and lower the latency variance.
For proxies and app servers, I define the number of workers according to the number of cores and workload without over-committing. I decouple short, latency-critical paths (e.g. page cache hits) from long backends (DB accesses) so that queues do not block each other. This topology prevents head-of-line blocking and ensures that fast responses are not held up.
Hot keys, sharding and replication
I recognize hot keys early on and distribute their load. Instead of reading a single object millions of times, I split it across shards or use replicas for read accesses. In distributed caches, consistent hashing helps to limit rebalancing pain. For app-side micro-caches (per process), I use small LRU buffers that hold hot keys in the RAM of the workers and save the network RTT to Redis/Memcached.
I use negative caches deliberately: 404 results, empty query results or feature flags cache briefly so that repeated misses do not occupy the entire stack every time. At the same time, I set conservative TTLs to get rid of misinformation quickly. For large lists, I save paginations separately and invalidate them separately instead of globally.
Cache security and correctness
I prevent cache poisoning by normalizing inputs: Host, scheme, port and query parameters are clearly defined, unsafe headers are cleaned up. Vary strictly and sparingly: only on what really influences the display. For static assets, I remove irrelevant query strings and set long TTLs with file hashes to avoid confusion.
I make a hard distinction between authenticated and public responses. Authorized routes receive explicit no-store/no-cache rules or hole-punching. I design ETags coherently so that revalidations work correctly. I use stale-if-error and grace as a safety net so that failures in the upstream do not immediately translate into error spikes for users. This keeps performance and correctness in balance.
Runbook: TTFB under 100 ms - my steps
- Measure baseline: record p50/p95 TTFB, miss rate per layer, RTT and CPU load.
- Set page cache in front: identify public routes, define TTL/Grace, minimize Vary.
- Activate OP cache/preload: Reduce start-up costs, load hot code, reduce autoloader hits.
- Pull in object cache: cache expensive queries and serials, key design with versions.
- Sharpen edge layer: long TTLs for assets, short TTLs for HTML, wire purges/events.
- Fine-tune kernel/FS: Page cache, readahead, dirty limits, keep-alive and compression.
- Warming & Grace: Preheat critical routes, Stale-While-Revalidate against load peaks.
- Defuse hot keys: shard, replicate, use micro caches in the workers.
- NUMA/topology: Pin processes, increase L3 locality, avoid blockages between pools.
- Continuously check: Dashboards and alerts, evictions vs. RAM, purge hit rate.
Briefly summarized
I prioritize the Server cache-levels according to proximity to the CPU, minimize misses and thus reduce access times. I use access patterns such as read/write-through and write-back in a targeted manner so that consistency and speed go together. Web server headers, purge strategies and object caches form the backbone of fast responses. Edge caching reduces latency in the network and stabilizes the TTFB even during peaks. With monitoring, clear rules and a few effective levers, I reliably bring systems up to speed.


