...

File Descriptor Limit Server: Optimize limits in hosting

I show how the File Descriptor Limit on the server limits connections, files and sockets and thus determines performance. I take clear steps to increase limits, measure demand and prevent EMFILE errors before services fail under load.

Key points

So that you can act quickly, I have summarized the most important Lever for optimizing the FD limits:

  • CauseEvery socket, every file, every DB connection consumes FDs.
  • SymptomsHTTP 500, EMFILE messages, blocked I/O, service crashes.
  • Measurement: ulimit, /proc/limits, file-max and lsof provide clarity.
  • Optimization: Raise limits in limits.conf, systemd and sysctl specifically.
  • SecurityFlank high limits with rate limiting and monitoring.

What are file descriptors and why limits count

A file descriptor is a simple integer Identifier, which the kernel uses to reference open files, sockets, pipes or devices per process. Each process has a soft and a hard limit, and there is also a system-wide global maximum that limits all processes together and thus Scarcity should be prevented. By default, there are often only 1024 FDs available per process, which quickly becomes cramped for high-traffic websites, API gateways or chat backends and so Load peaks is intensified. If a process reaches its limit, new connections fail, workers can no longer open files, and logs fill up with EMFILE, which can cause the Response times extended. It becomes particularly critical with setups that occupy several handles per request: PHP-FPM, cache backends, log files and reverse proxies accumulate FDs until the Boundaries block.

Recognizing and measuring symptoms

You often see the first signs of a limit that is too tight as HTTP-500 without a clear cause, slow responses or sporadic restarts of individual services. Typical log entries such as „Too many open files“ indicate EMFILE and signal immediate Need for action. I first check process-related limits and current consumption in order to differentiate between local bottlenecks and system-wide problems, and thus identify the Cause to be more precise. This compact guide is suitable for a structured introduction. Server-Ulimits Guide, if you want to have a quick overview of the adjusting screws and Steps plan. I then use lsof to measure how many descriptors a process really holds, because measured values beat assumptions as soon as load profiles change.

# Soft and hard limit of the current shell
ulimit -n
ulimit -Hn

# Check limits of a process
cat /proc//limits | grep "open files"

# Overall status of the system
cat /proc/sys/fs/file-nr # open | free | maximum
cat /proc/sys/fs/file-max # global maximum

# Rough estimate of consumption per process
lsof -p  | wc -l

Checking limits and interpreting key figures

I make a strict distinction between process-related and global boundaries so that I can identify bottlenecks in a targeted manner. eliminate instead of just moving it. The hard limit sets the upper limit for increases in sessions, while fs.file-max and fs.nr_open define the global frame and thus the Capacity of the host. A tried and tested rule of thumb is to allow at least 65535 FDs per process, provided the RAM and workload support this and you have the Load know. At the same time, I make sure that the sum of active workers, child processes and network connections in realistic high-load scenarios within the global Frame values remains. A clear view of these figures prevents blind increases without a plan and keeps system integrity under control. Pressure.

Command/Path Function What to look out for
ulimit -n / -Hn Soft/hard limit of the current session Hard limit sets upper limit for Raising
/proc//limits Per-process limits and open files Critical for daemons such as nginx/php-fpm
/proc/sys/fs/file-max Global maximum of all FDs Must be added to the process amount and RAM fit
/proc/sys/fs/file-nr Open, free, maximum in numbers Trend in load tests and Peaks check
lsof Shows open handles Per worker/thread consumed Measure FDs

Adjust FD limits temporarily and permanently

For quick tests, I use ulimit to set a higher Soft limit, before I define permanent rules and restart services. I then write the appropriate entries in /etc/security/limits.conf, add systemd overrides and verify the change with a targeted Load test. Important: The service user must be correct, otherwise the increase will have no effect and the Problem reappears under load. In addition, I adjust the global limit so that many worker processes do not exceed the system limit together. accumulate be left. Only when the process and system sides fit together can the configuration withstand real high-load scenarios and avoid EMFILE.

# Temporary (until logoff/restart)
ulimit -n 65535

# System-wide (until restart or permanently via sysctl.conf)
sudo sysctl -w fs.file-max=2097152

# Permanent (example for web server users)
echo -e "www-data soft nofile 65535\nww-data hard nofile 65535\n* soft nofile 65535\n* hard nofile 65535" | sudo tee -a /etc/security/limits.conf

# systemd services (e.g. nginx)
sudo mkdir -p /etc/systemd/system/nginx.service.d
cat <<'EOF' | sudo tee /etc/systemd/system/nginx.service.d/limits.conf
[Service]
LimitNOFILE=65536
EOF
sudo systemctl daemon-reload && sudo systemctl restart nginx

Set kernel parameters correctly

I validate fs.file-max and fs.nr_open together so that the kernel has enough Buffer for peak loads. If you only increase the per-process limit, you will otherwise hit the global limit and shift the bottleneck at system level. It makes sense to leave a gap between the typical peak load and the global values so that there are reserves for maintenance windows or burst traffic and Risk peaks can be dampened. You can find details on system-wide tuning in the article on Kernel tuning, which I like to use as a tool for in-depth OS customizations. Checklist use. After changes, I reload the parameters, check file-nr again and verify that all services have been restarted with the new limits and the Values take over.

# Permanent kernel parameters
sudo bash -c 'cat >> /etc/sysctl.d/99-ulimits.conf <<EOF
fs.file-max = 2097152
fs.nr_open = 2097152
EOF'
sudo sysctl --system

# Control
cat /proc/sys/fs/file-max
cat /proc/sys/fs/nr_open || sysctl fs.nr_open

Capacity planning and architecture

Realistic capacity planning starts with measured Profiles per request: How many FDs do the web server, app layer, database and cache need together? From these figures, I derive the total number of simultaneously open handles per host and plan buffers for Peaks on. Calculate conservatively: log rotation, additional sockets, temporary files and backup jobs increase the demand and eat up Reserves. I pay attention to the fact that horizontal scaling with load balancers reduces the FD load per node, which reduces failure tolerance and change windows. Simplified. Only with clear limit values per tier can you set specific limits and sensibly divide capacity between services. divide.

Web server and database fine-tuning

For web servers, I stick to the rule Threads*4 smaller than the FD limit so that there are reserves for upstream connections, temporary files and logs. For nginx and Apache, I include keep-alive, open access and error logs as well as upstream sockets in the calculation and thus secure myself Buffer. Databases such as MariaDB or PostgreSQL open many sockets against applications, replication and monitoring; their limits must match the connection pool and traffic peaks. fit. Caches (Redis, Memcached) reduce DB load, but do not necessarily reduce the number of FDs if many clients make parallel requests and connections. hold. I am therefore planning coordinated limits along the chain: frontend, upstream, DB, cache and message queues, so that nowhere is the first hard limit exceeded. Barrier takes hold.

# Example: nginx systemd limit and worker
LimitNOFILE=65536 # systemd
worker_processes auto;
worker_connections 4096; # 4096 * Worker <= 65536

# Example: PostgreSQL
max_connections = 1000 # FD requirement ~ 1-2 per connection + files/logs

Keeping WordPress and PHP stacks efficient

WordPress instances with many plugins open more files, more network connections and more Logs. I reduce the number of unnecessary includes with OPCache, reduce the load on databases with Redis Object Cache and outsource static assets via a CDN to minimize file accesses and Connections to reduce the load. At the same time, I specifically increase the limits for php-fpm and the web server so that peaks during cronjobs, crawlers or store checkouts are not a problem. abortions generate. A clean handling of error logs and rotations prevents log writers from running into nothing and not opening new files may. This is how I combine consumption reduction and limit increase so that the stack remains affordable under load and Throughput holds.

Containers and cloud environments

In Docker and Kubernetes, processes often inherit the FD limits of the nodes, which is why I first check the host parameters and then the service definitions. For systemd-nspawn or containerd, analogous principles apply, but the implementation is done in unit files, PodSpecs or daemon configurations with Overrides. I document the limits as code (IaC) and keep them consistent via playbooks so that new nodes have identical Boundaries bring along. With Kubernetes, I also check SecurityContexts and set the required capabilities so that system-side Limits take effect. In the end, the measurement in the cluster remains important, because scheduling, autoscaling and rolling updates change the distribution of open handles and test your Buffer.

# Example: systemd in container hosts
cat <<'EOF' | sudo tee /etc/systemd/system/myapp.service.d/limits.conf
[Service]
LimitNOFILE=65536
EOF

# Kubernetes: podSpec (container image must respect ulimit)
# Note: rlimit settings must be set differently depending on the runtime/OS

Security, rate limiting and monitoring

High limits provide breathing space, but they increase the attack surface for Floods, I therefore limit requests at the edge with rate limiting and set connection limits in the web server. A web application firewall and sensible timeouts prevent idle connections from permanently blocking the FDs. bind. For recurring tests, I use reproducible load profiles and use Prometheus, Netdata or Nagios to consistently monitor the metrics relating to open files and Sockets. Depending on the workload, I correct limit values gradually instead of increasing them abruptly so that effects remain measurable and Dismantling is easy. If you want to delve deeper into limit values on the connection side, this compact article on Connection Limits, which I use at network boundaries as Guide serves.

Troubleshooting with EMFILE: structured procedure

I start with a look into the Journal and in the service logs to narrow down the time and frequency of the errors. I then use lsof to check the top consumption per process and identify patterns such as leaks, increasing logging or unusual socket-types. Next, I compare set limits with real peaks and increase them temporarily at first, so that I can validate cause and effect in a controlled test and use the results as a basis for my own calculations. Permanent settings derive. If a leak is found, I patch or roll back the component, because higher limits only mask symptoms and postpone the problem. Problem. Finally, I document the correction, set alarms and plan a new load test so that the solution is valid and can be used again. Trust creates.

Realistically assess the resource costs of high FD limits

Every open file or socket costs kernel memory. Therefore, plan the RAM footprint Depending on the kernel version and architecture, a few hundred bytes to a few kilobytes are required per FD (including VFS/socket structures). With hundreds of thousands of FDs, this adds up. I dimension global file-max in such a way that, in the worst case, sufficient page cache and working memory remains free for applications. A simple countercheck is carried out via vmstat, free and the trend of open FDs via file-nr during a Peak load tests. The aim is a configuration that neither tips over into swap at peak load nor triggers excessive reclaim or OOM activity.

Distribution and start path traps (PAM, systemd, Cron)

Whether limits apply depends on the Start path from. PAM-based logins (ssh, su, login) read /etc/security/limits.conf, whereas systemd services primarily use their unit parameters (LimitNOFILE) and not mandatory PAM. Cron/at can have their own contexts. I therefore validate per service:

  • How does the process start? (systemctl status, ps -ef)
  • Which limits does it really see? (cat /proc//limits)
  • Does PAM work? (Check PAM modules in /etc/pam.d/*)
  • Do system-wide defaults exist? (systemd: DefaultLimitNOFILE in system.conf)

In this way, I prevent applications from receiving different FD limits depending on the start path and from inconsistent react.

Practical dimensioning with calculation examples

I count on the Workers and connection profiles from backwards to the required FD capacity:

  • nginx with 8 workers with 4000 connections each: ~32000 connections. As a rule, nginx reserves 1 FD per active connection; plus upstream (keep-alive) and logs add ~10-20% buffer. Result: ~38000 FDs for nginx alone.
  • php-fpm with 150 children, typically 20-40 FDs per child (includes, sockets, logs): conservatively 6000 FDs.
  • Redis/DB clients: 200 parallel connections, 1-2 FDs each: ~400 FDs.

Summed per host: ~44k FDs. I set LimitNOFILE for nginx on 65536, php-fpm analog, and plane global fs.file-max so that all services plus reserve (x1.5-x2) fit in. For several heavily utilized instances per host, scale globally to 1-2 million FDs if RAM and I/O paths are sufficient. surrender.

Deeper diagnosis: finding leaks and hotspots

If FDs rise continuously, I use targeted tools to find the cause:

# Open handles grouped by type
lsof -p  | awk '{print $5}' | sort | uniq -c | sort -nr

# Only sockets of one process
lsof -Pan -p  -i

# Which files are growing (logs, temp)
lsof +L1 # Deleted but still open files
ls -l /proc//fd

# Syscall view: who is constantly opening?
strace -f -p  -e trace=open,openat,close,socket,accept,accept4 -s 0

Particularly treacherous are Deleted logs, which are still open: They occupy space and FDs, but no longer appear in the file system. Restarting or an explicit reopen (e.g. with nginx via USR1) solves this problem cleanly. Incorrectly configured watchers/exporters can also constantly open new sockets - rate limits and pooling.

Clearly delimit Inotify, epoll and EMFILE

Not every resource limit is called an FD limit. In development and CI environments, builds often fail with ENOSPC in relation to Inotify (watcher limits). I check and set as a supplement:

# Inotify limits (user-wide and instances)
sysctl fs.inotify.max_user_watches
sysctl fs.inotify.max_user_instances

# Exemplary increase
sudo sysctl -w fs.inotify.max_user_watches=524288
sudo sysctl -w fs.inotify.max_user_instances=1024

While epoll works internally with FDs, the real bottleneck with massive Long-Lived-connections often exceed the FD limit itself. I therefore correlate epoll/event loop data (e.g. active handles) with file-nr and process-related consumption.

Language and runtime specialties (Java, Node.js, Go, Python)

Runtimes handle FDs differently:

  • Java/NettyMany NOK channels per process, logging frameworks keep file appenders open. I set generous limits and rotate logs with a reopen strategy instead of close/replace.
  • Node.jsEMFILE occurs quickly with file system heavy workloads (e.g. watcher, build pipelines). I regulate parallel fs operations, increase limits and set backoff/retry strategies.
  • GoHigh parallelism through goroutines can open many sockets. I limit dial and response header timeouts and check whether connections are closed cleanly (IdleConnTimeout).
  • Python/uWSGI/Gunicorn: Worker/thread models consume FDs for logs, sockets and temporary files; I harmonize worker count, thread pools and nofile-limits.

They all have one thing in common: Without coordinated log rotation and reliable Connection management FDs rise gradually.

Containers in concrete terms: Docker and Kubernetes settings

To ensure that containers actually see the desired limits, I set them consistently along the chain:

  • Docker Runstart with -ulimit nofile=65535:65535 or set by daemon default (e.g. default limits).
  • ImagesStart scripts should not reset a restrictive ulimit.
  • KubernetesDepending on the runtime (containerd, cri-o), rlimit settings take effect differently. I test in the pod via cat /proc/self/limits and adjust node/runtime defaults if pod specifics are not sufficient.

Especially in multi-tenant environments, I secure the Total amount against fs.file-max and isolate noisy neighbor effects with separate nodesets or pod budgets so that individual deployments do not consume the host-reserved FDs.

Clarify monitoring metrics and alarms

In addition to file-nr and file-max, I also monitor per-process FDs and trend lines:

  • System-wideAllocated vs. maximum, rate of change, peak/maximum ratio.
  • Per processFDs per worker/thread, Top-N processes, anomalies at night (batch/jobs).
  • QualitativeHTTP error rates, queue lengths, accept/handshake errors synchronized with the FD trend.

I set alerts multilevelWarning at 70-80%, Critical from 90% of the configured limit, plus Leakage detection via rising 7-day trends. This allows me to react in good time before hard barriers take effect.

Runbook for emergencies

When EMFILE strikes acutely, I act in clear steps:

  1. Identify top consumers (lsof, /proc//fd, journal entries).
  2. Temporarily increase soft limit (ulimit in session or LimitNOFILE override) and restart service.
  3. If logs are the reason: Stop rotation, trigger reopen, reduce log level.
  4. Burst traffic at the edge throttle (Increase or tighten rate limits/connection limits - depending on the situation).
  5. Fix root cause (leak, too aggressive parallelism, missing timeouts) and permanent limits tighten.

Follow-up is important: documentation, repeated load tests and sharpening alarms so that the same chain is recognized at an early stage.

Briefly summarized

With increased FD limits, cleanly set kernel parameters and a tested architecture, I can provide services to my customers. Scope under high load. I measure first, then set appropriate limit values, verify with load tests and secure the result with rate limiting, monitoring and clear Rules.

Current articles