Durability & Fault Tolerance
Durability
- Goal: Minimize message loss.
- Solution:
- Replication Mechanism: Each partition is replicated across multiple brokers (one leader, multiple followers).
- Common practice: Confluent Cloud enforces a replication factor of 3 (1 leader + 2 replicas).
- Producer Acks:
- Set
acks=all
for highest durability: Leader waits for all in-sync replicas (ISRs) to acknowledge the message. - Trade-off: Increased latency.
- Set
- Replication Mechanism: Each partition is replicated across multiple brokers (one leader, multiple followers).
Duplication and Ordering
- Retries for Durability:
- Producers retry sending messages on failure to ensure delivery, controlled by
retries
anddelivery.timeout.ms
(timeout for message validity). - Issues:
- Duplication: Transient failures may cause producers to send duplicate messages.
- Ordering: Retries can disrupt message order if a failed message is sent after a newer one succeeds.
- Solution:
- Set
enable.idempotence=true
:- Ensures exactly-once delivery using sequence numbers tracked by brokers.
- Brokers ignore duplicates and preserve message order.
- If idempotence isn’t possible, consumers must handle duplicates.
- Set
- Producers retry sending messages on failure to ensure delivery, controlled by
Consumer Failure
- Offset Management:
- Consumers commit offsets to Kafka after processing messages.
- On failure and restart, consumers resume from the last committed offset.
- Rebalancing:
- In consumer groups, if a consumer fails, Kafka redistributes partitions among remaining consumers to ensure continuous processing.
Handling Retries and Errors
- Producer Retries:
- Handle errors (e.g., network issues, broker unavailability) by retrying sends.
- Consumer Retries:
- If a message fails processing too many times, move it to a Dead Letter Queue (DLQ) for later investigation.
- DLQ: Stores failed messages; Apache Kafka requires custom implementation (unlike Amazon SQS, which has built-in DLQ support).
Example Command
- Create a topic with 6 partitions and replication factor of 2:
kafka-topics --bootstrap-server localhost:9092 --create --topic my-topic --partitions 6 --replication-factor 2