Too high logging level server slows down web servers due to additional I/O, CPU parsing and memory buffers, while too low a level weakens diagnostics and security. I'll show you how to set up logging so that latency, IOPS and p99 values remain stable and all necessary events are still documented.
Key points
- Balance between diagnosis and performance
- Debug-logs only for a limited time
- Buffering and rotation consistently
- Asynchronous instead of synchronous writing
- Monitoring of IOPS and p99
What does the correct logging level mean?
A web server logs events in several stages: from error via warn to info and debug. Each level increases the level of detail and therefore the amount of formatting, caching and writing required. In productive environments, I use warn or error as standard because these levels make errors visible without turning every request into megabytes of text. During traffic peaks, each additional field in the access log costs I/O bandwidth and measurably increases the response time. If you also tweak the application, you can shift the log load; a look at PHP error levels shows how closely application and web server logs are linked.
How debug logs push performance
Debug entries often generate several kilobytes of text per request, which, with thousands of requests per second, can quickly add up to hundreds of kilobytes. IOPS only binds for logging. In addition, formatting strings and JSON costs CPU time, which I prefer to reserve for TLS, compression or dynamic content. If the log volume increases, the memory requirement for buffers in Nginx or Apache grows; under load, this leads to additional garbage collection or kernel flushes. CPU steal time then occurs in virtualizations because the platform distributes the many sync writes. I therefore only activate debug for a limited time, log specific endpoints and use the note from WP debug logging, to strictly limit debug operation.
I/O, CPU and memory: the bottleneck in detail
Already 20-30 percent of the available IOPS can be consumed for log writes alone with high traffic. Depending on the file system, mount options and SSD write amplification, the write latency increases, which I find in p95/p99 response times as 50-200 milliseconds extra delay. On the CPU side, formatting, regex filters and JSON encoding burden every worker thread; this reduces free cycles for TLS handshakes and HTTP/2 multiplexing. In memory, large buffers generate backpressure if the data carrier does not write fast enough. I therefore actively plan log volumes and take write queues and journal parameters into account so that the stack is clearly prioritized under load.
Apache: Configuration for lean logging
I write Apache as sparingly as possible in production and focus on warn or error to avoid unnecessary details. I lower the level in httpd.conf or apache2.conf and reduce the access format to the bare essentials. Fields such as %u (authentication) or %h (reverse DNS) cause additional work, which I only activate if I really need to evaluate them. I encapsulate rotatelogs using a pipe so that no large files grow and the rotation works without locking. This significantly reduces overhead and lock retention in busy VirtualHosts.
# Apache: Logging close to production
LogLevel warn
# Slim access log (no %u, no reverse DNS)
LogFormat "%a %t \"%r\" %>s %b %D" minimal
CustomLog "|/usr/bin/rotatelogs /var/log/apache2/access-%Y%m%d.log 86400" minimal
ErrorLog /var/log/apache2/error.log
The combination of minimal format, rotating per pipe and moderate LogLevel saves CPU when formatting and reduces I/O per request. I deactivate mod_status in the public context or protect it strongly so that analysis endpoints do not become a load factor themselves. For short-term analyses, I activate a second, more granular log only for affected locations and separate it using its own rotation cycle. I then consistently remove the additional logs again so as not to risk any permanent performance leaks. This keeps Apache responsive without sacrificing error visibility.
Nginx: lean access_log and error_log
Nginx benefits greatly from streamlined Access formats and moderate error_log-levels. I set the error level to warn because info/debug generates too much I/O in running productions. For access logs, I define a minimum log_format, optionally deactivate the access log for static files and only activate it for dynamic paths. In Edge scenarios, I route logs via syslog/UDP to a collector to avoid local writes. In this way, I decouple the app performance from the slowest part of the system: the data carrier.
# Nginx: Minimal logging
error_log /var/log/nginx/error.log warn;
log_format minimal '$remote_addr [$time_local] "$request" $status $bytes_sent $request_time';
access_log /var/log/nginx/access.log minimal;
# Optional: No access log for static files
location ~* \.(css|js|jpg|png|gif|ico|svg)$ {
access_log off;
expires 7d;
}
With this setup, Nginx logs all relevant key figures such as request_time, without bloating the logs. For debugging purposes, I temporarily set a second access log with a more comprehensive format so that I don't bloat the standard log. After the analysis, I switch it off again. In this way, I keep the response times constant while still tracking specific error sources. This is particularly useful during periods of high traffic.
Log rotation, sampling and buffering
Large log files worsen file accesses, slow down grep/parsing and increase Backup-time. I therefore rotate daily or according to file size, compress old logs and limit retention periods according to compliance. Where completeness is not necessary, I use sampling: only 1-5 percent of access requests are logged, while error logs remain complete. Buffering reduces syscalls and summarizes entries; in Nginx I use buffered logging or syslog buffers. The aim is always to reduce the write rate and smooth out peaks without losing critical information.
Asynchronous logging and central aggregation
Synchronous writing blocks worker threads and extends Latency under pressure. I decouple this with asynchronous pipes, local queues (e.g. journald) and central aggregation via log collector. The web server only writes to a fast local buffer, an agent then moves the data to the central system at its leisure. If the line fails, the agent continues to buffer locally without slowing down the web server. In this way, I ensure evaluability without sacrificing application performance.
Monitoring: correlating metrics and logs
Without measurement, each Tuning Rates. I monitor IOPS, write latency, CPU steal, RAM usage and p95/p99 latency in parallel with the log volume. Correlation IDs in the header connect web server logs with application and DB traces so that I can find hotspots accurately. A central evaluation tool that visualizes peak times, end points and error codes helps me with my daily work. If you want to delve deeper, click through the notes at Analyze logs and builds its own lean dashboard on it.
Key figures and target values: p95/p99, IOPS, log volume
I define clear target values so that changes to the Logging remain measurable. For productive pages, I aim for access log volumes of less than 5-10 percent of the total write performance. The p99 latency should never deteriorate by more than 50-100 milliseconds due to logging; otherwise I shorten formats or activate sampling. I leave error logs complete because they show the relevant outliers. The following table serves as a rule of thumb for different levels and their effects.
| Level | Protocol type | Estimated IOPS share | Latency impact (p99) | Typical scenario |
|---|---|---|---|---|
| error | Error log | 1-3 % | < 10 ms | Production with a focus on faults |
| warn | Error log | 2-5 % | 10–30 ms | Production with early warnings |
| minimal | Access log | 5-10 % | 20-60 ms | Production under full load |
| combined | Access log | 10-20 % | 40-120 ms | Standard operation with analysis requirement |
| debug | Error/Access | 20-40 % | 100-250 ms | Short-term troubleshooting |
These orientation values vary depending on the data carrier, FS-options and traffic profile. I calibrate them on real data before setting permanent levels. I test new features in staging environments with production load to see the logging impact in advance. I then set limit values and alarms that kick in if the log volume jumps. This ensures that performance can be reliably planned.
Hosting tuning around logging
Good logging is no substitute for Caching, it supports it. I combine lean logs with opcode cache, Redis/Memcached and compact keep-alive timeouts so that the web server has less work per request. I treat TLS parameters, compression levels and HTTP/2/3 settings separately from logging, but check the overall impact on latency. With heavy growth, I distribute load with a load balancer and disable access logs on edge nodes, while central gateways log more completely. At system level, I keep an eye on kernel parameters such as swappiness and TCP buffers so that I/O load is properly buffered.
Security, compliance and storage
Even if performance counts, I lose Compliance not out of sight. I keep error logs for as long as required by law, contracts or internal standards, and I strictly separate personal data. Where possible, I anonymize IPs in access logs or shorten them to comply with data protection regulations. I store old logs in compressed form so that storage and backup costs remain stable. I only allow personalized and organized access so that no sensitive details circulate uncontrolled.
Measurement methodology and controlled experiments
Before I change levels, I measure reproducibly: identical load profiles, fixed data sets and a clean separation of control and test group. I run A/B tests over short, defined test windows (e.g. 2 × 20 minutes) with pre-warmed caches and empty OS page caches so that warm-up effects do not distort the results. For each variant, I record p50/p95/p99, error rates and write rates and keep the infrastructure constant (threads/workers, CPU frequency, limits). Important: I measure end-to-end latency and server time in parallel to rule out network jitter. I then normalize to requests per second and compare variances, not just mean values. Only when the effect is above the measurement uncertainty (rule of thumb: >5-10 % on p99 or IOPS) do I make the change permanent.
Structured logs (JSON) vs. plain text
Structured logs make parsing and correlation easier, but cost CPU and bytes. A typical JSON access log with 12-20 fields quickly amounts to 400-800 bytes instead of 200-300 bytes in plain text. On the CPU side, JSON encoding requires additional formatting and escaping. I decide based on the context: For strong central analysis with parsers and correlation IDs, JSON is worthwhile despite the additional costs. For edge or cache nodes, I rely on plain text minimal formats. Mixed operation works well: locally minimal, centrally enriched. If you use JSON, you should consciously select fields (no null fields, short keys) and ensure stable field sequences so that downstream filters remain efficient.
Selective logging and sampling in practice
I don't log everything everywhere. Static assets are often excluded, dynamic paths are given a lean format, and I only temporarily increase the depth for certain hosts/endpoints. I build sampling deterministically so that analyses remain stable.
# Nginx: Selective logging and 5% sampling
log_format minimal '$remote_addr [$time_local] "$request" $status $bytes_sent $request_time';
# 5%-Sampling per split_clients (stable via key field)
split_clients "${remote_addr}${request_uri}" $log_sample {
5% 1;
* 0;
}
# Only log dynamic paths, exclude static ones
location / {
access_log /var/log/nginx/access.log minimal if=$log_sample;
}
location ~* \.(css|js|jpg|png|gif|ico|svg)$ {
access_log off;
}
# Apache 2.4: Selective and sampled
LogLevel warn
LogFormat "%a %t \"%r\" %>s %b %D" minimum
# 5% sampling with expression (rand() returns 0..1)
SetEnvIfExpr "rand() < 0.05" sampled
# Only log dynamic paths (example /app), assets muted
SetEnvIf Request_URI "\.(css|js|png|jpg|jico|svg)$" static=1
# Access log only if sampled and not static
CustomLog /var/log/apache2/access.log minimal env=sampled env=!static
This is how I keep access data statistically meaningful without constantly placing a full load on memory and CPU. Sampling does not apply to error paths: I log status ≥ 400 completely by setting condition variables accordingly.
Fine-tune buffer and flush parameters
Buffering smoothes peaks, too much buffering delays visibility. In Nginx, I set moderate buffers and short flush times so that entries are written promptly and yet efficiently. At system level, I regulate Journald and RSyslog to prevent queues from bursting.
# Nginx: Buffered access logs with short flush intervals
access_log /var/log/nginx/access.log minimal buffer=64k flush=1s;
open_log_file_cache max=1000 inactive=30s valid=1m;
# error logs remain moderate, but visible
error_log /var/log/nginx/error.log warn;
# systemd-journald: Rate limits and sizes
# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=1G
RuntimeMaxUse=256M
RateLimitIntervalSec=30s
RateLimitBurst=10000
Compress=yes
# rsyslog: Asynchronous queue and batch processing
# /etc/rsyslog.d/10-performance.conf
$MainMsgQueueType LinkedList
$MainMsgQueueDequeueBatchSize 1000
$MainMsgQueueWorkerThreads 2
# Target action with own queue (e.g. remote collector)
*.* action(type="omfwd" target="collector" port="514" protocol="udp"
action.resumeRetryCount="-1"
queue.type="LinkedList" queue.size="200000")
# logrotate: Regular, compressed rotation
/var/log/nginx/*.log {
daily
rotate 7
missingok
compress
delaycompress
notifempty
create 0640 www-data adm
sharedscripts
postrotate
[ -s /run/nginx.pid ] && kill -USR1 "$(cat /run/nginx.pid)"
endscript
}
At file system level, I reduce unnecessary metadata write accesses with mount options such as noatime/relatime and monitor the dirty page share so that flushes do not occur in unfavorable bursts.
Container, orchestration and cloud contexts
In containers, I prefer to write to stdout/stderr and have a lean log pipeline (sidecar/agent) collected. I limit local drivers with rotation parameters so that disks do not fill up. In Kubernetes, I use node-local buffers and a central collection; persistence is clearly separated from volatile pods. On edge instances in the cloud, I often dispense with access logs and only collect metrics; central gateways receive complete logs. Important: Set limits and budgets (I/O, network) per pod/VM so that logging does not displace the application.
# Docker: Limit rotating JSON logs
# daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
This ensures that the pipeline remains robust, even if the target system is temporarily unavailable. Sidecars with dedicated queues (e.g. fluent agents) provide additional decoupling.
Protection against backpressure and emergency strategies
I actively plan for incidents: What happens if the disk is full, the network connection to the collector is slow or the number of errors increases significantly? Emergency brakes such as temporarily switching off the access log, more aggressive rotation, increased sampling rates or switching to UDP syslog prevent logging from disrupting the service. Quotas per file system, dedicated partitions and alerts at 70/85/95 percent utilization provide a head start. Critical: The web server must never block on log write errors; rather discard entries than block users.
Runbooks, feature toggles and governance
Logging is an operational feature. I have runbooks available that describe step by step how to increase sampling, activate debug logs for a limited time and then deactivate them again. Feature toggles or configuration flags per host/service ensure that I can react without deploys. For governance, I define who is allowed to change levels, how long debug windows can be open (e.g. maximum 60 minutes) and when they are updated (rotation, cleanup, cost check). Compliance aspects (PII reduction, masking of sensitive fields) are part of the same policy.
Capacity planning: quick calculation examples
I'll make a rough calculation in advance: With 2,000 RPS and 300 bytes per minimum access line, this results in 600 KB/s, around 52 GB/day uncompressed. In combined format with 800 bytes, it is 1.6 MB/s, approx. 138 GB/day. At IOPS level, 600 KB/s with 4 KB blocks corresponds to around 150 IOPS, 1.6 MB/s to around 400 IOPS - without metadata and journal overhead. These thumb values quickly show how close I am to the device limits. With sampling (5 %), the volume in the example drops to 3 GB/day or 7 GB/day - often the difference between stable and shaky p99 under full load.
Step-by-step plan for optimization
I'll start with an inventory: current Level, log formats, volume per day, IOPS and p95/p99. I then reduce access formats to the bare essentials and reduce error logs to warn or error where appropriate. At the same time, I activate rotation, compression and, if appropriate, sampling. In the next round, I separate debugging purposes via targeted, time-limited logs for specific paths, hosts or services. Finally, I check metrics and set alarms so that future changes to the system do not generate new log loads unnoticed.
Summary: The optimal balance
The right logging level increases Performance, because it reduces I/O, CPU parsing and buffer pressure without sacrificing diagnostic capability. I use warn/error as standard, streamline access formats and only activate debug for a limited time and in a targeted manner. Rotation, buffering, asynchronous writing and central aggregation prevent bottlenecks under high load. I keep service times stable with clear target values for IOPS percentage and p99 latency. If you combine logs and metrics in a targeted way, you can resolve errors faster - and keep the server noticeably responsive.


