[fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService #24978

TakaHiR07 · 2025-11-13T12:28:01Z

Motivation

As shown in the issue, fix two problem: 1. cleanCacheAndCloseReader() executed twice cause concurrent error, which result in too many orphan reader remain in SystemTopicBasedTopicPoliciesService 2. double update in policyCacheInitMap cause recursive update error

Modifications

no need to do cleanCacheAndCloseReader() when throw exception, since the exception would be catch in outside code. By the way, in previous pulsar-version 2.9.x, cleanCacheAndCloseReader is also executed only once
avoid double update in policyCacheInitMap. use putIfAbsent instead of computeIfAbsent. It is not appropriate to add so many operation into compute().
add two test, to simulate if throw exception in createReader, initPolicyCache, readMorePolicy of prepareInitPoliciesCacheAsync. By the way, it seems lack of unittest in SystemTopicBasedTopicPoliciesService.
new method "newReader()" to ensure only one readerCreateCompletableFuture. Actually this method is add for test. The whole process of prepareInitPoliciesCacheAsync() is : put future -> put reader -> throw exception -> remove reader -> remove future. so even without "newReader()", namespace's reader in readerCache can be ensure only one.

There is one point should be consider in this pr

When use putIfAbsent, if too many getTopicPolicy() trigger prepareInitPoliciesCacheAsync, it would generate many empty completableFuture. Further more, we can use double check in the code to avoid this object gc.(the code would be ugly).

Besides, this case still exist: if failed to close reader in cleanCacheAndCloseReader(), this closing reader maybe have chance to reconnect and become orphan reader. This is not this pr's work.

Verifying this change

[] Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

lhotari · 2025-11-13T17:39:35Z

@TakaHiR07 Thanks for the great analysis and fix. Would it be possible to make a fix to master branch too? Does the problem appear there too? Usually we target master branch first and then backport to maintenance branches.

…ervice

TakaHiR07 · 2025-11-14T10:01:46Z

@TakaHiR07 Thanks for the great analysis and fix. Would it be possible to make a fix to master branch too? Does the problem appear there too? Usually we target master branch first and then backport to maintenance branches.

@lhotari The problem catch exception and cleanPolicyInitMap twice appear too. Have push a pr, #24980. There is a bit different with branch-3.0, since there are some modification in pr-24658

lhotari · 2025-11-14T10:37:35Z

@TakaHiR07 Please check this test failure:

  Error:  Tests run: 155, Failures: 1, Errors: 0, Skipped: 53, Time elapsed: 853.185 s <<< FAILURE! - in org.apache.pulsar.broker.admin.TopicPoliciesTest
  Error:  org.apache.pulsar.broker.admin.TopicPoliciesTest.testTopicPoliciesAfterCompaction[Clean_Cache](4)  Time elapsed: 0.367 s  <<< FAILURE!
  java.lang.AssertionError: expected [true] but found [false]
  	at org.testng.Assert.fail(Assert.java:110)
  	at org.testng.Assert.failNotEquals(Assert.java:1577)
  	at org.testng.Assert.assertTrue(Assert.java:56)
  	at org.testng.Assert.assertTrue(Assert.java:66)
  	at org.apache.pulsar.broker.admin.TopicPoliciesTest.testTopicPoliciesAfterCompaction(TopicPoliciesTest.java:3432)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:840)

Copilot

Pull Request Overview

This PR fixes two critical issues in the SystemTopicBasedTopicPoliciesService that caused orphan readers and recursive update errors: (1) duplicate execution of cleanCacheAndCloseReader() leading to concurrency problems, and (2) improper use of computeIfAbsent() causing recursive updates in policyCacheInitMap.

Key changes include:

Refactored prepareInitPoliciesCacheAsync() to use putIfAbsent() instead of computeIfAbsent() to avoid recursive updates
Removed redundant cleanCacheAndCloseReader() calls from exception handlers in initPolicesCache() since cleanup is now handled once in the outer exception handler
Added a new newReader() method to ensure only one reader future per namespace

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesService.java	Refactored `prepareInitPoliciesCacheAsync()` to prevent double cleanup and recursive updates; added `newReader()` method for proper reader management; removed redundant cleanup calls in `initPolicesCache()`
pulsar-broker/src/test/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesServiceTest.java	Added two comprehensive test cases to verify proper behavior when exceptions occur during reader creation and policy cache initialization, ensuring cleanup happens exactly once

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-14T13:35:05Z

...src/test/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesServiceTest.java

+        // sleep 500ms to make sure clean operation finish.
+        Thread.sleep(500);
+        Assert.assertTrue(prepareFuture.isCompletedExceptionally());
+        future = spyService.getPoliciesCacheInit(NamespaceName.get(NAMESPACE5));
+        Assert.assertNull(future);
+        CompletableFuture<SystemTopicClient.Reader<PulsarEvent>> readerCompletableFuture1 =
+                spyReaderCaches.get(NamespaceName.get(NAMESPACE5));
+        Assert.assertNull(readerCompletableFuture1);


Using Thread.sleep(500) in tests is brittle and can lead to flaky test behavior. The test already uses Awaitility in other places. Consider using Awaitility.await() with appropriate conditions to verify that the cleanup operations have completed instead of relying on a fixed sleep duration.

Suggested change

// sleep 500ms to make sure clean operation finish.

Thread.sleep(500);

Assert.assertTrue(prepareFuture.isCompletedExceptionally());

future = spyService.getPoliciesCacheInit(NamespaceName.get(NAMESPACE5));

Assert.assertNull(future);

CompletableFuture<SystemTopicClient.Reader<PulsarEvent>> readerCompletableFuture1 =

spyReaderCaches.get(NamespaceName.get(NAMESPACE5));

Assert.assertNull(readerCompletableFuture1);

// Await cleanup operation to finish.

Awaitility.await()

.atMost(Duration.ofSeconds(2))

.untilAsserted(() -> {

Assert.assertTrue(prepareFuture.isCompletedExceptionally());

Assert.assertNull(spyService.getPoliciesCacheInit(NamespaceName.get(NAMESPACE5)));

Assert.assertNull(spyReaderCaches.get(NamespaceName.get(NAMESPACE5)));

});

Copilot · 2025-11-14T13:35:06Z

...src/test/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesServiceTest.java

+            }
+        };
+        appender.start();
+        logger.get().addAppender(appender, null, null);


The call to logger.get().addAppender(appender, null, null) appears redundant with the following line logger.addAppender(appender). Based on the Log4j2 API, Logger.get() returns the underlying LoggerConfig, and the second call to logger.addAppender(appender) should be sufficient. The first call with three parameters (the third being null) is typically used on LoggerConfig objects. Consider removing the first call to simplify the code and avoid potential confusion.

Suggested change

logger.get().addAppender(appender, null, null);

Copilot · 2025-11-14T13:35:06Z

...src/test/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesServiceTest.java

+        // sleep 500ms to make sure clean operation finish.
+        Thread.sleep(500);
+        Assert.assertTrue(prepareFuture.isCompletedExceptionally());


Using Thread.sleep(500) in tests is brittle and can lead to flaky test behavior. The test already uses Awaitility in other places (line 595). Consider using Awaitility.await() with appropriate conditions to verify that the cleanup operations have completed instead of relying on a fixed sleep duration.

Suggested change

// sleep 500ms to make sure clean operation finish.

Thread.sleep(500);

Assert.assertTrue(prepareFuture.isCompletedExceptionally());

// Wait until cleanup operation finishes.

Awaitility.await()

.atMost(Duration.ofSeconds(2))

.until(() -> prepareFuture.isCompletedExceptionally()

&& spyService.getPoliciesCacheInit(NamespaceName.get(NAMESPACE5)) == null

&& spyReaderCaches.get(NamespaceName.get(NAMESPACE5)) == null);

Copilot · 2025-11-14T13:35:06Z

...src/test/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesServiceTest.java

+            }
+        };
+        appender.start();
+        logger.get().addAppender(appender, null, null);


The call to logger.get().addAppender(appender, null, null) appears redundant with the following line logger.addAppender(appender). Based on the Log4j2 API, Logger.get() returns the underlying LoggerConfig, and the second call to logger.addAppender(appender) should be sufficient. The first call with three parameters (the third being null) is typically used on LoggerConfig objects. Consider removing the first call to simplify the code and avoid potential confusion.

Suggested change

logger.get().addAppender(appender, null, null);

TakaHiR07 · 2025-11-14T14:19:44Z

@TakaHiR07 Please check this test failure:

  Error:  Tests run: 155, Failures: 1, Errors: 0, Skipped: 53, Time elapsed: 853.185 s <<< FAILURE! - in org.apache.pulsar.broker.admin.TopicPoliciesTest
  Error:  org.apache.pulsar.broker.admin.TopicPoliciesTest.testTopicPoliciesAfterCompaction[Clean_Cache](4)  Time elapsed: 0.367 s  <<< FAILURE!

@lhotari have fixed. In testTopicPoliciesAfterCompaction#clearTopicPoliciesCache, should also clear readerCaches. readerCaches and policyCacheInitMap put and remove element is always together.

lhotari

LGTM

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 13, 2025

TakaHiR07 force-pushed the branch-3.0-fix_prepareInitPoliciesCacheAsync branch from cdff28e to 8941c47 Compare November 13, 2025 12:37

TakaHiR07 changed the title ~~[fix][broker] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService~~ [fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService Nov 13, 2025

TakaHiR07 force-pushed the branch-3.0-fix_prepareInitPoliciesCacheAsync branch from 8941c47 to 1466a6f Compare November 13, 2025 13:11

fix prepareInitPoliciesCacheAsync() in SystemTopicBasedTopicPoliciesS…

b56256f

…ervice

TakaHiR07 force-pushed the branch-3.0-fix_prepareInitPoliciesCacheAsync branch from 1466a6f to b56256f Compare November 14, 2025 09:51

lhotari added the ready-to-test label Nov 14, 2025

lhotari added the release/3.0.16 label Nov 14, 2025

lhotari requested a review from Copilot November 14, 2025 13:24

Copilot started reviewing on behalf of lhotari November 14, 2025 13:25 View session

Copilot finished reviewing on behalf of lhotari November 14, 2025 13:27

Copilot AI reviewed Nov 14, 2025

View reviewed changes

fanjianye added 2 commits November 14, 2025 21:49

improve code in test

205a983

fix error in testTopicPoliciesAfterCompaction

d7846f4

Technoboy- assigned TakaHiR07 Nov 19, 2025

lhotari approved these changes Dec 10, 2025

View reviewed changes

lhotari merged commit 22e0a97 into apache:branch-3.0 Dec 10, 2025
87 of 90 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService #24978

[fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService #24978

TakaHiR07 commented Nov 13, 2025 •

edited

Loading

Uh oh!

lhotari commented Nov 13, 2025

Uh oh!

TakaHiR07 commented Nov 14, 2025 •

edited

Loading

Uh oh!

lhotari commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

TakaHiR07 commented Nov 14, 2025

Uh oh!

lhotari left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        // sleep 500ms to make sure clean operation finish.
-        Thread.sleep(500);
-        Assert.assertTrue(prepareFuture.isCompletedExceptionally());
+        // Wait until cleanup operation finishes.
+        Awaitility.await()
+                .atMost(Duration.ofSeconds(2))
+                .until(() -> prepareFuture.isCompletedExceptionally()
+                        && spyService.getPoliciesCacheInit(NamespaceName.get(NAMESPACE5)) == null
+                        && spyReaderCaches.get(NamespaceName.get(NAMESPACE5)) == null);

[fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService #24978

[fix][broker][branch-3.0] fix prepareInitPoliciesCacheAsync in SystemTopicBasedTopicPoliciesService #24978

Conversation

TakaHiR07 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

lhotari commented Nov 13, 2025

Uh oh!

TakaHiR07 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

TakaHiR07 commented Nov 14, 2025

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TakaHiR07 commented Nov 13, 2025 •

edited

Loading

TakaHiR07 commented Nov 14, 2025 •

edited

Loading