Servers and Virtual Machines

Server health checks and automatic failover for high availability

Health Checks Failover protects web applications with closely timed checks and automatic switching to replacement systems as soon as services fail. I show how real-time monitoring, heartbeats, fencing and DNS or load balancer switching work together to achieve high availability with changeover times of seconds.

Key points

Real-time checks detect failures based on HTTP status, latency and resources.
Automatic failover switches services to healthy nodes within seconds.
Fencing & Quorum prevent split-brain and ensure data consistency.
DNS and LB switching direct traffic quickly to accessible instances.
Tests & Monitoring reduce false alarms and keep uptime high.

What do server health checks do?

I anchor Health checks directly to the services so that each instance clearly reports its status. A dedicated /health endpoint or a TCP check covers accessibility, response time and application status. In addition, I check CPU, RAM, disk I/O and network paths so that load peaks or faulty drivers do not go unnoticed. Heartbeat signals between cluster nodes run every second and only trigger verification after multiple failures. In this way, I reduce false alarms and obtain a reliable picture of the actual Service health.

How automatic failover works

A clear Failover design consists of detection, verification, restart and traffic switching. If a node fails, the cluster registers the heartbeat loss and starts fencing to safely isolate the faulty server. A healthy node then takes over the service, ideally with shared or replicated memory. Finally, DNS or the load balancer updates the target address so that users can continue without manual intervention. This chain remains short because each step uses mandatory thresholds and timeouts that I test and document in advance.

Phase	Duration	Description
Failure	0 s	Hardware- or software error occurs
Recognition	5-30 s	Heartbeat loss or negative health response
Verification	10-30 s	Fencing and quorum check against false alarms
Restart	15-60 s	Service starts on a healthy node
Network update	5-10 s	DNS or LB leads Traffic at
Total	30–120 s	Web application remains accessible

DNS failover in practice

I use DNS failover when I want to secure several locations or providers and need neutral control. Two A-Records with priorities, short TTL and an external health check are enough to ensure that the Resolution to the backup server. Reliable detection remains important: three consecutive errors are often enough for me to ensure that a brief hiccup does not switch directly. I also pay attention to monitoring the return so that the primary takes over again after stabilization. If you are looking for specific steps, you can find them in my guide to DNS failover step-by-step, which I have built up in a practical way.

Load balancer and health endpoints

For APIs and web front-ends, I prefer to use a Load balancer with active health checks. It separates faulty instances from the pool via HTTP or TCP checks and distributes requests to healthy nodes. Short intervals of 3-5 seconds with fall/rise thresholds lead to fast but stable switching behavior. An example is HAProxy with option httpchk and fine-tuned intervals per server entry. For more in-depth selection procedures, tried and tested Load balancing strategies, which I adjust depending on latency and session behavior.

A holistic approach to high availability

I am planning Redundancy in layers: Server, network, storage and DNS/LB. A single bottleneck will bring down any system, even if many nodes are available. Multi-zone or multi-region designs significantly reduce site risks. Replicated or distributed storage prevents data loss when pivoting. Without automation, reserves remain unused, so I firmly link checks, orchestration and switching.

Avoid fencing, quorum and split-brain

A reliable Fencing switches defective nodes off hard via IPMI or power strip. This prevents two nodes from writing the same data at the same time. Quorum mechanisms secure the majority in the cluster and prevent contradictory decisions. I deliberately test network divisions to check the behavior of isolated segments. I only classify the environment as sufficiently fail-safe when logs and alarms no longer show any duplication.

Best practices for health check intervals

I choose intervals and thresholds depending on Workload and risk. 30 seconds with three consecutive failures often provides a good middle ground between sensitivity and calm. I check latency-critical APIs more closely, but set a rise of two to three successful responses to avoid bounce effects. For state-heavy services, I prefer to count clear function signals in the body instead of just paying attention to 2xx status. I accompany every change with metrics and write down decisions in a comprehensible way.

Monitoring, alerting and testing

I combine Metrics, logs and traces so that I can quickly classify the causes of errors. Health check errors trigger a warning, but persistent errors or a failover generate a red escalation level. Synthetic checks from multiple regions uncover DNS issues that local agents don't see. Planned failure tests measure the switchover time and adjust timeouts without surprises in an emergency. A strong stack with Grafana and Prometheus shows me bottlenecks before users notice them.

Common errors and troubleshooting

Too sharp Timeouts generate false alarms, so I increase the thresholds and check the stability. If fencing is missing, there is a risk of split-brain and therefore data loss; I therefore prioritize IPMI and hard shutdown. High DNS TTLs extend switchover times, which is why I rarely go over 300 seconds in production. In Windows environments, cluster validations and event IDs help to narrow things down quickly. I conceal network failures with redundant links and active path monitoring on all nodes.

Windows and cloud environments

In Windows Server clusters I observe Resources, memory and role status via the Health Service. Dependencies must be clearly defined, otherwise starting will fail despite free capacity. In the cloud, I use provider health checks that make decisions based on status codes, latency and body matches. For global latency, I choose Anycast-LB or GeoDNS, whereby I set the TTLs tightly. I intercept regional disruptions with a second location whose data path is mirrored synchronously or asynchronously.

Practical configuration: HAProxy checks

For web services I use HTTP checks to /health, clear interval values and fall/rise thresholds. This reduces flutter and reliably keeps faulty nodes out of the pool. I document the semantics of the health endpoint so that teams can interpret it clearly. During maintenance, I put servers in DRAIN to end running sessions cleanly. This keeps the user experience consistent, even if I rotate nodes.

defaults
  mode http
  option httpchk GET /health
  timeout connect 5s

backend api_servers
  balance roundrobin
  server s1 192.0.2.1:80 check inter 3000 fall 3 rise 2
  server s2 192.0.2.2:80 check inter 3000 fall 3 rise 2 backup

Multi-location design and data paths

I am planning Storage depending on the latency budget: synchronous for transactional systems, asynchronous for read-intensive applications. Object storage is suitable for static assets, while block storage supplies VMs and databases. A clear restart plan defines how new primary roles are assigned. Network routes and firewalls must not hinder the switchover, so I test them early on. A clean switchover is only possible if data paths and security rules work together.

Provider orientation and performance values

I compare Failover times, check depth and degree of automation rather than just raw performance. The decisive factor is how quickly a provider recognizes errors, isolates them and redirects traffic. For many projects, 30-120 seconds total time provides a noticeable advantage over manual intervention. Health checks should evaluate status codes, response bodies and latency to measure true function. Consistent evaluation across multiple sites separates network disruptions from true service outages.

Provider	Failover time	Health checks	High Availability
webhoster.de	30–120 s	HTTP, TCP, latency, body	Cluster with automatic switching
Other	variable	partly reduced	Standard functions

Using readiness, liveness and startup probes correctly

I differentiate between Liveness (is the process alive?), Readiness (can it handle traffic?) and Startup (is it fully initialized?). Liveness prevents zombie processes, readiness keeps faulty instances out of the pool, and startup protects against premature restarts in long boot phases. In container environments, I encapsulate these checks separately so that a service can be accessible but only appears on the load balancer after successful initialization. For monolithic systems, I map the semantics in the /health endpoint, for example with partial states such as degraded or maintenance, which the LB can interpret.

Services and databases in a stateful state

Stateful workloads need special care. I plan leader selection cleanly (e.g. via integrated consensus mechanisms), store fencing actions for old leaders and differentiate synchronous from asynchronous Replications according to RPO/RTO. During failover, I evaluate whether a read replica is promoted or a shared block storage is remounted. Write-ahead logs, snapshot chains and replication lags are included in the decision. Health checks for databases not only check TCP ports, but also perform light transactions so that I can verify genuine read/write functionality without unnecessarily burdening the system.

Sessions, caches and user experience

I decouple Session data from the app instances. I either use stateless tokens or outsource sessions to Redis/SQL. This way, a switchover remains transparent without forcing sticky sessions. Before a planned switchover, I preheat caches, synchronize critical keys or use staged rollouts with throttled traffic so that the new primary does not start cold. Connection draining on the LB as well as timeouts and keep-alive values are coordinated so that users do not experience any interruptions.

Graceful degradation and resilience patterns

I build Circuit Breaker, timeouts and retries with jitter to prevent cascading effects. If a downstream fails, the application switches to degradation (e.g. cached content, simplified search, asynchronous queues). Idempotency keys prevent double bookings on retries. Health checks do not become a load trap: I limit their frequency per node, cache results for a short time and decouple their evaluation from the critical request path.

Auto-scaling, capacity and warm starts

Failover only works if Capacity reserves are available. I maintain headroom (e.g. 20-30 %), use warm pools or preheated containers, and set up scaling policies with cooldowns. For deployments, I prevent capacity dips through rolling or blue/green strategies (maxSurge/maxUnavailable) and define pod disruption budgets so that maintenance does not lead to unintentional outages. Metrics such as requests/s, P95 latencies and queue lengths trigger scaling instead of just CPU values.

Network routing: VRRP, BGP and anycast

In addition to DNS, I use VRRP/Keepalived for virtual IPs on layer 3 or BGP/ECMP for faster reroutes. Anycast LBs reduce latency and isolate location errors. For DNS, I consider resolver behavior, negative caches and TTL respect: even with short TTLs, some clients can hold stale records. That's why I combine DNS failover with LB health checks so that even sluggish resolvers don't become a single point.

Kubernetes and orchestration aspects

In container clusters, I add liveness/readiness/startup probes, pod priorities and affinity rules. Node drains run in coordination with the ingress so that connections end cleanly. For stateful sets, I define pod management policies and secure storage attachments against race conditions. An example of differentiated probes:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: api
        image: example/api:latest
        startupProbe:
          httpGet: { path: /health/startup, port: 8080 }
          failureThreshold: 30
          periodSeconds: 2
        livenessProbe:
          httpGet: { path: /health/live, port: 8080 }
          periodSeconds: 10
          timeoutSeconds: 2
        readinessProbe:
          httpGet: { path: /health/ready, port: 8080 }
          periodSeconds: 5
          failureThreshold: 3

Security of the Health Checks

Health endpoints must not reveal any sensitive details. I minimize spending, blacken internal paths and distinguish public readiness from internal deep checks. Rate limits and separate management networks prevent misuse. For TLS failover, I schedule certificate provisioning and key rotation automatically so that no warnings arise. I optionally sign checks with a token or restrict them via IP-ACL without hindering the LB checks.

Failback and return to primary

After a successful failover, I do not immediately rush to the failback. A hold-down timer ensures stability while replication statuses catch up. Only when logs, latencies and error rates give the green light do I switch back - preferably in a controlled manner outside of peak times. The LB only removes the backup status when the primary has proven that it is sustainably healthy. In this way, I avoid ping-pong and unnecessary customer influence.

SLOs, error budgets and chaos tests

I connect failover designs SLIs/SLOs (e.g. 99.9 % over 30 days) and consciously manage error budgets. Game days and targeted chaos experiments (network disconnection, memory failure, full disks) show whether thresholds, timeouts and alerts are realistic. I record metrics such as Mean Time to Detect/Recover (MTTD/MTTR) in the dashboard and compare them with the targeted 30-120 seconds in order to prioritize optimizations based on data.

Runbooks, ownership and compliance

I document runbooks from detection to switchover, including the backout plan. On-call teams have clear escalation paths and access to diagnostic tools. Backups, restore tests and legal requirements (storage, encryption) are incorporated into the design so that a failover does not violate compliance. After incidents, I create postmortems without assigning blame, update threshold values and add tests - so the system is constantly learning.

Briefly summarized

Consistent Health checks and a clean failover design keep services online, even in the event of hardware or software errors. I rely on clear thresholds, fencing and short TTLs so that switchovers run reliably and quickly. DNS and load balancers complement each other because they provide better control depending on the scenario. Monitoring, tests and documentation close gaps before users notice them. A clever combination of these components ensures high availability without operational surprises.