...

Database replication in hosting: master-slave vs. multi-master

Database replication In hosting, it determines how well applications remain available when the load increases and how quickly they can write and read again after disruptions. I clearly show the difference between master-slave and multi-master, including tuning, failover strategies and suitable deployment scenarios.

Key points

The following key aspects help me to choose the right replication strategy.

  • Master-SlaveSimple writes, scalable reads, clear responsibilities.
  • Multi-MasterDistributed writes, higher availability, but conflict management.
  • GTIDs & Failover: Faster switchovers and cleaner replication paths.
  • Hosting realityLatency, storage and network influence consistency.
  • Monitoring & Tuning: Metrics, catch-up times and binlog settings at a glance.

What replication does in hosting

I use replication to Availability to increase read performance, distribute read loads and enable maintenance windows without downtimes. Master-slave topologies handle writes centrally, while multiple replicas serve masses of reads and thus reduce response times. Multi-master variants allow distributed writes, which reduces latencies in global setups and makes it easier to cope with a node loss. For web stacks from WordPress, store engines or APIs, this means more buffering against traffic peaks and faster recovery after incidents. If you are planning horizontal growth beyond pure replication, link it step by step with Sharding and replication, to distribute data and load more widely and Scaling to make it plannable.

Master-slave: functionality and strengths

In a master-slave setup, I consistently write only to the Master, while slaves take over read access and follow binlogs. The clear allocation of roles avoids write conflicts and keeps the model clear. This is perfect for many hosting scenarios with a high proportion of reads, such as product catalogs, content portals or reporting dashboards. I add more slaves as required without changing the write path. I plan in buffers for replication delays so that reports or caches can be processed correctly despite short delays. Results deliver.

MySQL Master-Slave step by step

I start on the master with binary logging and a unique server-id, so that slaves can follow suit: In the my.cnf I set server-id=1, log_bin=mysql-bin, optional binlog_do_db for filtered replication. I then create a dedicated replication user and limit its rights to the bare minimum. For the initial synchronization, I create a dump with --master-data, import this on the slave and memorize the log file and position. On the slave I define server-id=2, activate relay logs and connect it to CHANGE MASTER TO ...followed by START SLAVE. With SHOW SLAVE STATUS\G I hold Seconds_Behind_Master and react if the delay increases.

Optimizations for hosting environments

For clean failover I activate GTIDs and thus simplify switching without having to laboriously readjust the log positions. I route reads specifically via proxy layers such as ProxySQL or the application logic in order to avoid hotspots and increase the cache hit rate. With sync_binlog=1 I secure binlogs against crashes, while moderate values for sync_relay_log Reduce write overhead without letting the delay get out of hand. I pay attention to I/O capacities, because slow SSDs or shared storage pools drive up the backlog. For audits and compliance, I encrypt replication channels with TLS and keep keys separate from the data path.

Multi-Master: When it makes sense

I use Multi-Master when I need to distribute writes geographically or when a single Node can no longer carry a write load. All nodes accept changes, propagate them reciprocally and thus compensate for failures more easily. The price is conflict management: simultaneous updates of the same line require rules, such as last-writer wins, application-side merges or transactional sequences. In latency-sensitive workloads, such as payment gateways or global SaaS backends, the setup can significantly reduce response times. I assess in advance whether my application can tolerate conflicts and whether I need clear Strategies for resolution.

MySQL Multi-Master in practice

I rely on GTID-based replication because it simplifies channels and failover and Error more quickly. Multi-source replication allows me to feed several masters into one node, for example for central evaluations or aggregation. For real peer topologies, I define low-conflict key strategies, check auto-increment offsets and reduce drifting timestamps. I monitor latency peaks, because parallel writes across regions increase the coordination effort and can cost throughput. Without clean monitoring and clear operator rules, I would not use multi-master productively. switch.

Comparison table: Master-slave vs. multi-master

The following table summarizes the most important differences and makes it easier for me to Decision in everyday hosting.

Criterion Master-Slave Multi-Master
Writes A master processes all Write operations All nodes accept writes
Consistency Strict, conflicts unlikely Softer, conflicts possible
Scaling Reads very well expandable Reads and writes expandable
Setup effort Manageable and easy to control More effort and more rules
Typical use cases Blogs, stores, reporting Global apps, latency-critical APIs

High availability, RTO/RPO and security

I define clear RTO/RPO-targets and align replication with them: How long can the recovery take, how much data can I lose. Synchronous or semi-synchronous replication can reduce losses, but costs latency and throughput. Backups do not replace replication, they supplement it for point-in-time recovery and historical statuses. I regularly check restore tests, because only a tested backup counts in practice. For proper planning, please refer to my guide to RTO/RPO in hosting, so that the key figures correspond to the operational reality and the Risks fit.

Scaling path: From single node to cluster

I often start with a single Master, I add a replica for reads and backups and then scale up step by step. As the read share grows, I add additional slaves and round off the setup with caching. If the write capacity is no longer sufficient, I plan multi-master paths, check conflict risks and add idempotency to the application. For larger conversions, I migrate with rolling strategies, blue/green or dual-write phases and keep reserves ready for rollbacks. For conversions without downtime, I use the guide to Zero-downtime migrations, so that users do not Interruptions feel.

Performance tuning: latency, I/O and caching

I monitor latency in the network, IOPS on the storage and CPU peaks on the Node, because all three factors control the replication delay. A local Redis or Memcached layer takes reads from the stack and keeps slaves unloaded. I split large transactions to avoid binlog floods and reduce commit jams. For write-heavy workloads, I increase innodb log buffers moderately and regulate flush intervals without undermining durability. I keep query plans clean, because bad indexes cause expensive Scans.

Conflict avoidance and resolution in Multi-Master

I avoid conflicts by separating writing areas logically, for example by Client, region or key space. Auto-increment offsets (e.g. 1/2/3 for three nodes) prevent collisions with primary keys. Where simultaneous updates are unavoidable, I document clear rules, for example last-writer wins or application-side merges. Idempotent writes and deduplicating consumers protect against duplicate processing. I also record audit information so that decisions can be made quickly in the event of a dispute. comprehend to be able to.

Troubleshooting: What I check first

In case of delay I check Seconds_Behind_Master, the I/O and SQL threads as well as relay log sizes. I look at binlog sizes and formats because STATEMENT vs. ROW can massively change the volume. Storage metrics like flush times and queues show whether SSDs are maxing out or throttling. If GTIDs are active, I compare applied and missing transactions per channel. In an emergency, I stop and start the replicator specifically to resolve blockages and only then do I correct the Configuration.

Consistency models and read-after-write

With asynchronous replication I consciously plan eventual consistency on. For user actions with direct feedback, I ensure read-after-write, by binding write sessions to the master for a short time or routing reads in a lag-aware manner. I use application flags (e.g. „stickiness“ for 2-5 seconds) and check Seconds_Behind_Master, before I allow a replica for critical reads. I rely on replicas read_only=ON and super_read_only=ON, so that no accidental writes slip through. With properly selected isolation levels (REPEATABLE READ vs. READ COMMITTED) I prevent long transactions from slowing down the Apply thread.

Topologies: star, cascade and fan-out

In addition to the classic star (all slaves pull directly from the master), I rely on Cascading replication, if many replicas are required or WAN links are limited. To do this, I activate the following on intermediate nodes log_slave_updates=ON, so that they serve as a source for downstream replicas. This relieves the load on the master I/O and distributes bandwidth better. I pay attention to additional latency levels: Each cascade potentially increases delay and requires close monitoring. For global setups, I combine regional hubs with short distances and keep at least two replicas per region for maintenance and Failover ready.

Planned and unplanned failover

I document a clear Promotion process1) Stop writes on the master or turn traffic flow to read-only, 2) Select candidate replica (lowest lag, complete GTIDs), 3) Promote replica and read_only deactivate, 4) realign remaining nodes. Against Split-Brain I protect myself with clear routing (e.g. VIP/DNS switching with short TTLs) and automatic blocking. Orchestration tools help, but I practice manual paths regularly. I keep runbooks, alarms and Drills ready so that nobody has to improvise in an emergency.

GTIDs in practice: stumbling blocks and healing

For GTIDs I activate enforce_gtid_consistency=ON and gtid_mode=ON step by step. I use auto-position, to simplify source changes, and avoid replication filters on GTID routes, as they make debugging more difficult. Step errant transactions (transactions that exist on a replica but not on the source), I identify them via the difference of gtid_executed and the source and clean up in a controlled manner - not blindly with purges. I plan binlog retention in such a way that rebuilds are possible without gaps, and when restoring I check the consistency of gtid_purged.

Parallelization and throughput on replicas

To reduce apply lag, I increase replica_parallel_workers according to the number of CPUs and select replica_parallel_type=LOGICAL_CLOCK, so that related transactions remain orderly. With binlog_transaction_dependency_tracking=WRITESET I increase parallelism because independent writes can be applied simultaneously. I monitor deadlock and lock wait times on replicas: excessive parallelism can generate concurrent updates. Additionally helps Group Commit at the master (attached flush delays) to bundle related transactions more efficiently - without exceeding the P95 latency range.

Schema changes without downtime

I prefer Online DDL with InnoDB (ALGORITHM=INPLACE/INSTANT, LOCK=NONE) to carry table changes through replication without blocking queries. For very large tables, I choose chunk-based methods, split indexes and keep an eye on the binlog load. For multi-master, I schedule DDL windows strictly, as concurrent schema changes are hard to heal. I test DDLs on a replica, measure their impact on lag and only promote when the replication path remains stable.

Delayed replication as a safety net

Against logical errors (DROP/DELETE) I consider a delayed replica ready, for example with replica_sql_delay=3600. This allows me to return to a clean state within an hour without immediately running PITR from backups. I never use this replica for reads or failovers - it is purely a safety buffer. I automate copies from this node so that I can quickly pull up a fresh, up-to-date read node in an emergency.

Upgrades, compatibility and operation

I keep source and target versions close together and upgrade rollingfirst replicas, then the master. I take a critical view of mixed environments with MySQL/MariaDB, as binlog formats and features can diverge. I use binlog_row_image=MINIMAL where it makes sense to reduce binlog volume and check application dependencies for triggers or stored procedures. I reduce the WAN load for protocol and binlog compression, but take care not to exceed CPU budgets.

Observability and capacity planning

I define SLOs for Lag, catch-up times, error rates and throughput. Core variables include applied transactions per second, relay log fill levels, I/O queues, lock wait times and network latency. I record binlog growth, plan binlog_expire_logs_seconds and check whether rebuilds remain within the retention periods. I set limits on replicas such as max_connections and monitor aborts so that read loads do not run into nothing. For costs and size, I calculate fan-out levels, storage requirements and Peak loads against RPO/RTO targets.

Security and compliance in replication operations

I close connections end-to-end and strictly separate operator, application and replication accounts. Regular rights audits prevent replication users from retaining unnecessary DDL/DML authorizations. I protect offsite backups with separate key management and check access paths against lateral movement. For data protection, I adhere to deletion rules and replicate pseudonymized or minimized data records if the purpose allows it. I share logging and metrics according to least-privilege so that telemetry is not used carelessly. Leak generated.

Briefly summarized

Master-Slave provides a reliable solution for hosting scenarios. Basis, because reads scale easily and conflicts rarely occur. When global writes, low latency and failure tolerance are priorities, I consider multi-master and plan conflict resolution rules. I combine GTIDs, clean monitoring and thoughtful backups to safely achieve recovery goals. By tuning binlog, storage and query parameters, I reduce delay and keep throughput high. This allows me to choose the right topology, scale in a controlled manner and keep downtime to a minimum for users. invisible.

Current articles