-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[improve][broker] Fix replicated subscriptions race condition with mark delete update and snapshot completion #16651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] Fix replicated subscriptions race condition with mark delete update and snapshot completion #16651
Conversation
|
@lhotari Please provide a correct documentation label for your PR. |
b034bc6 to
a4401df
Compare
|
The pr had no activity for 30 days, mark with Stale label. |
a4401df to
baeba7b
Compare
|
@codelipenghui PTAL |
|
Since we will start the RC version of
So drag this PR to |
baeba7b to
480b2e9
Compare
…ubscription-advanced-markdelete-position
I've validated the assumption and this PR makes sense. Before this PR a race condition is possible where the local cluster's mark delete position is updated before the snapshot is completed. That results in the case, that the most recent snapshot cannot immediately be used for sending the replicated subscription update back to the remote cluster (via writing the marker message to the topic). |
|
There's another related flaw in the replicated subscription snapshots that should be fixed. The commit 50ca143 is already in this direction, but a similar one is needed for the responses. Fixing this would also avoid the current solution of two rounds of snapshot requests when there are more than 2 remote clusters. |
This turned out to be the incorrect direction. However, one change that continues to make sense is cb2d873. In the case of the replication snapshot response, the request position would be sent back instead of the latest topic position. |
This reverts commit cb2d873.
I reverted the change related to this since it doesn't change the behavior when the last position is used. The response will get appended at the end and the subscription update will get sent from the other cluster only after all messages have been acknowledged before that point. For more than 2 cluster configurations, the 2 rounds of snapshots solution is used to avoid message loss (acknowledging messages that haven't been processed). |
…rk delete update and snapshot completion (#16651)
…rk delete update and snapshot completion (#16651)
Fixes #10054
Motivation
There has been a long-standing issue (#10054) where replicated subscriptions were not working correctly across clusters. The test case
testReplicatedSubscriptionAcrossTwoRegionsdemonstrates this problem:pulsar/pulsar-broker/src/test/java/org/apache/pulsar/broker/service/ReplicatedSubscriptionTest.java
Lines 107 to 116 in a1a2b36
Previously, this test would only pass when configured to allow duplicates (
allowDuplicates = true), indicating that subscription replication was not functioning as expected.This PR (#16651) addresses this issue:
Race Condition: A race condition existed between snapshot creation and mark-delete position updates, which could cause synchronization failures. If the mark delete position got updated before the snapshot was completed, it wouldn't be used even when it would have been suitable for the snapshot.
Modifications
Handle
ReplicatedSubscriptionsSnapshoton completion: When a snapshot is completed, it is now processed immediately, passing the current mark-delete position to trigger the update logic. This handles the case where the mark delete position was updated before the snapshot completed.Refactor snapshot handling logic: The previous snapshot handling code has been moved to
notifyTheMarkDeletePositionChanged. Since snapshot handling is idempotent, it's safe to handle snapshots both when they arrive and when the mark-delete position moves forward.Documentation
docdoc-requireddoc-not-neededdoc-complete