|
| 1 | +# PIP-377: Automatic retry for failed acknowledgements |
| 2 | + |
| 3 | +## Motivation |
| 4 | + |
| 5 | +Apache Pulsar currently has known gaps in acknowledgement (ack) handling, particularly in scenarios involving key ordered message processing requirements (Failover, Exclusive, or Key_Shared subscriptions) and during broker restarts or topic unloads triggered by Pulsar load balancing events. In these scenarios, acknowledgements can be lost, resulting in stuck consumers due to key order message delivery rules and additional message duplication, which affects the reliability and end-to-end latency of message processing. |
| 6 | +The intention is to have a solution that doesn't require enabling Pulsar transactions. |
| 7 | + |
| 8 | +Pulsar's default mode is at-least-once messaging, so duplicates are acceptable, but lost acknowledgements cause unnecessary duplicate messages. In the case of key-ordered message processing with Key_Shared subscriptions, a lost acknowledgement will cause message delivery to stop for further messages with keys that the lost acknowledgement's message has. |
| 9 | + |
| 10 | +These situations currently cause unnecessary disruptions to Key_Shared processing applications, where manual intervention or automated monitoring solutions are needed to detect stuck consumers and recover the situation by restarting individual consumers. |
| 11 | + |
| 12 | +## Detailed Design |
| 13 | + |
| 14 | +This proposal aims to address these issues by enhancing the existing "ack receipt" feature with an automatic retry mechanism for failed acknowledgements. Users do not need to configure the "ack receipt" feature explicitly when `autoRetryAcknowledgement` is enabled. The solution is built upon the existing "ack receipt" feature at the binary protocol level. The gaps in the current "ack receipt" feature, such as [Bug: When ack receipts are enabled, no response is sent to the client if the topic has been unloaded or is being transferred #23261](https://github.com/apache/pulsar/issues/23261), need to be addressed to achieve the desired outcome. |
| 15 | + |
| 16 | +### Public API |
| 17 | + |
| 18 | +The following new methods will be added to the `ConsumerBuilder` interface: |
| 19 | + |
| 20 | +```java |
| 21 | + /** |
| 22 | + * Enable or disable automatic retry for failed acknowledgements. |
| 23 | + * |
| 24 | + * @param autoRetryAcknowledgement whether to automatically retry failed acknowledgements |
| 25 | + * @return the consumer builder instance |
| 26 | + */ |
| 27 | + ConsumerBuilder<T> autoRetryAcknowledgement(boolean autoRetryAcknowledgement); |
| 28 | + |
| 29 | + /** |
| 30 | + * Overrides the default maximum number of retry attempts for a failed acknowledgement |
| 31 | + * when autoRetryAcknowledgement is enabled. |
| 32 | + * |
| 33 | + * @param maxAckRetries the maximum number of retry attempts |
| 34 | + * @return the consumer builder instance |
| 35 | + */ |
| 36 | + ConsumerBuilder<T> maxAcknowledgementRetries(int maxAckRetries); |
| 37 | + |
| 38 | + /** |
| 39 | + * Overrides the default the retry delay backoff for acknowledgement retries. |
| 40 | + * This is used when autoRetryAcknowledgement is enabled. |
| 41 | + * |
| 42 | + * @param ackRetryBackoff the backoff strategy to use for retries |
| 43 | + * @return the consumer builder instance |
| 44 | + */ |
| 45 | + ConsumerBuilder<T> autoRetryAcknowledgementBackoff(RedeliveryBackoff ackRetryBackoff); |
| 46 | +``` |
| 47 | + |
| 48 | +This example applies to the Pulsar Java client. Other clients can implement similar changes for adding the `autoRetryAcknowledgement` mode. |
| 49 | + |
| 50 | +## Proposed Changes |
| 51 | + |
| 52 | +- Implement a new `autoRetryAcknowledgement` mode for Pulsar clients where acknowledgements that fail (due to broker restarts, topic unloads, Pulsar load balancing, or other issues) are automatically retried by the client. |
| 53 | + |
| 54 | +- Modify the `ServerCnx` class to send failure responses for discarded acknowledgements when ack receipts are enabled to fix issue #23261. |
| 55 | + |
| 56 | +- Implement a new component in the client library to manage automatic retries of failed acknowledgements. |
| 57 | + |
| 58 | +- When `autoRetryAcknowledgement` is enabled, the "ack receipt" feature is used under the covers. One of the differences is that the `.acknowledge` method should remain asynchronous, and the retries should happen in the background. The existing "ack receipt" feature makes `.acknowledge` synchronous, which is not the desired behavior for many applications since it will cause performance issues by adding a server round-trip when "ack receipt" is synchronous. |
| 59 | + |
| 60 | +- When both `autoRetryAcknowledgement` and "ack receipt" are enabled, the existing "ack receipt" behavior of synchronous acks will be used. The `.acknowledge` method will only return after the ack retry has succeeded or failed after all retry attempts. Similarly, the `.acknowledgeAsync` method will return after the `autoRetryAcknowledgement` completes. |
| 61 | + |
| 62 | +- Update the `ConsumerBuilder` interface to include options for configuring automatic ack retries. This applies to the Java client. Other clients could implement similar changes. |
| 63 | + |
| 64 | +- Implement additional client-side metrics to track failed acknowledgements, retry attempts, and success rates. |
| 65 | + |
| 66 | +- Update relevant documentation to reflect the new feature and its proper usage. |
| 67 | + |
| 68 | +## Compatibility, Deprecation, and Migration Plan |
| 69 | + |
| 70 | +This feature will be opt-in. It doesn't introduce backwards compatibility issues with existing implementations. Clients not utilizing the new automatic retry option will continue to function as before. No deprecation or migration is required for existing users. |
| 71 | + |
| 72 | +## Test Plan |
| 73 | + |
| 74 | +Comprehensive testing will include: |
| 75 | + |
| 76 | +1. Unit tests for the new retry mechanism. |
| 77 | +2. Integration tests simulating various failure scenarios (broker restarts, topic unloads, network issues). |
| 78 | +3. Performance tests to ensure the retry mechanism does not introduce significant overhead. |
| 79 | + |
| 80 | +## Links |
| 81 | + |
| 82 | +* Mailing List discussion thread: |
| 83 | +* Mailing List voting thread: |
0 commit comments