Replication

Replication can be done synchronously or asynchronously
Semi-synchronous: If the DB is configurated for synchronous replication, it means one slave is synchronous with master. Other slaves are asynchronous with master

replication-1

Synchronous Replication

Advantages:
- Data consistency is guaranteed. If master fails, we can make sure the data is stored in slave
Disadvantages:
- If the asynchronous slave has no response, the master will not be able to write data

Take a snapshot of the primary database (master)
Copy the snapshot to the new replica node
The replica connects to the master and fetches all data changes that occurred after the snapshot. This requires the snapshot to be precisely associated with a position in the master's replication log. This position has different names, such as PostgreSQL’s log sequence number (LSN) or MySQL’s binlog coordinates.
Once the replica processes the backlog of data changes accumulated after the snapshot, it is said to have caught up with the master. It can then continue processing data changes produced by the primary in real-time

On the local disk of each follower, it keeps a log of data changes it has received from master
From the log of last transaction that was processed before the fault occurred
The follower can connect to the leader and request all data changes that occurred during the time when the follower was disconnected

One of the followers promotes to be the new leader
Steps in automatic failover
1. Determining that the leader has failed by timeout
2. Choosing a new lead: usually the replica with the most up-to-date data changes from the old leader (minimize data loss)
3. Reconfiguring the system to use the new leader: Client send their write requests to new leader && Force the old leader to step down and recognize the new leader
Possible Issue:
1. If async replication is used, the new leader may not have received all the writes from the old leader before it failed. The new leader may have received conflicting writes in the meantime. Solution => discarded the unreplicated writes in old leader
2. Split-Brain: Two nodes both believe that they are the leader. Both leader accept writes, and there is no process for resolving conflict. Solution => shut down one node if two leaders are detected
3. Timeout is too short, there could be unnecessary failovers. Sometimes, the temporary load spike could cause a nodes response time to increase above the timeout