Skip to main content

Replication

Sync Replication vs Async Replication

  • Replication can be done synchronously or asynchronously
  • Semi-synchronous: If the DB is configurated for synchronous replication, it means one slave is synchronous with master. Other slaves are asynchronous with master

replication-1

  • Slave 1 is synchronous with master
  • Slave 2 is asynchronous with master

Synchronous Replication

  • Advantages:
    • Data consistency is guaranteed. If master fails, we can make sure the data is stored in slave
  • Disadvantages:
    • If the asynchronous slave has no response, the master will not be able to write data

Setup New Replica

  1. Take a snapshot of the primary database (master)
  2. Copy the snapshot to the new replica node
  3. The replica connects to the master and fetches all data changes that occurred after the snapshot. This requires the snapshot to be precisely associated with a position in the master's replication log. This position has different names, such as PostgreSQL’s log sequence number (LSN) or MySQL’s binlog coordinates.
  4. Once the replica processes the backlog of data changes accumulated after the snapshot, it is said to have caught up with the master. It can then continue processing data changes produced by the primary in real-time

Handling Node Outages

Follower failure: Catch-up recovery

  • On the local disk of each follower, it keeps a log of data changes it has received from master
  • From the log of last transaction that was processed before the fault occurred
  • The follower can connect to the leader and request all data changes that occurred during the time when the follower was disconnected

Leader failure: Failover

  • One of the followers promotes to be the new leader
  • Steps in automatic failover
    1. Determining that the leader has failed by timeout
    2. Choosing a new lead: usually the replica with the most up-to-date data changes from the old leader (minimize data loss)
    3. Reconfiguring the system to use the new leader: Client send their write requests to new leader && Force the old leader to step down and recognize the new leader
  • Possible Issue:
    1. If async replication is used, the new leader may not have received all the writes from the old leader before it failed. The new leader may have received conflicting writes in the meantime. Solution => discarded the unreplicated writes in old leader
    2. Split-Brain: Two nodes both believe that they are the leader. Both leader accept writes, and there is no process for resolving conflict. Solution => shut down one node if two leaders are detected
    3. Timeout is too short, there could be unnecessary failovers. Sometimes, the temporary load spike could cause a nodes response time to increase above the timeout