KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

kirktrue · 2024-05-02T01:23:19Z

This issue is related to an optimization for offset fetch logic.

When a user calls Consumer.poll(), among other things, the consumer performs a network request to fetch any previously-committed offsets so it can determine from where to start fetching new records. When the user passes in a timeout of zero, it's almost always the case that the offset fetch network request will not be performed within 0 milliseconds. However, the consumer still sends out the request and handles the response when it is received, usually a few milliseconds later. In this first attempt, the lookup fails and the poll() loops back around. Given that this timeout is the common case, the consumer caches the offset fetch response/result from the first attempt (even though it timed out) because it knows that the next call to poll() is going to attempt the exact same operation. When it is later attempted a second time, the response is already there from the first attempt such that the consumer doesn't need to perform a network request.

The existing consumer has implemented this caching in PendingCommittedOffsetRequest. The new consumer has implemented it in CommitRequestManager. The core issue is the new consumer implementation is clearing out the first attempt's cached result too aggressively. The effect being that the second (and subsequent) attempts fail to find any previous attempt's cached result, and all submit network requests, which all fail. Thus the consumer never makes any headway.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Don't look at me—I'm hideous!

kirktrue · 2024-05-02T19:26:28Z

@kirktrue Should we rename the title of the PR/Jira now that we understand the root cause?

Done 👍

AndrewJSchofield

Please can we have a test which validates this new behaviour? It's quite subtle and I worry that it might be inadvertently broken by an adjacent code change in the future.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

lianetm · 2024-05-06T14:05:23Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java

+         * almost immediately.
+         */
+        private void maybeRemoveInflightOffsetFetch(OffsetFetchRequestState fetchRequest, Throwable error) {
+            if (error == null && !fetchRequest.isExpired) {


this line implies a big change in the current logic, that I wonder if we're taking too far. Agree with not removing the expired requests (that's the root cause of the problem we have), but why putting all errors (not only timeout) in the same bucket? With this new check, how are we ensuring that fetch requests that fail fatally are removed from the inflight queue?

lianetm · 2024-05-06T14:20:35Z

One concern on comment above about how we identify this situation (inflight fetch requests that we shouldn't delete too soon).

Another one about where to consider the situation. Inflight requests are removed in 2 places: direct call to fetch (handled in this PR), but also from the commit manager poll. The commit manager (as other managers) has logic for removing all expired requests in its poll loop, when calling failAndRemoveExpiredFetchRequests. Shouldn't we consider that too?

lianetm · 2024-05-06T14:55:42Z

Did we consider the approach of simply decoupling the request timeout from the application event timeout? We could issue the fetch request without a time boundary (max value probably), and get the application event result with the time boundary (here).

Expressing the intention when creating the request and event seems clearer and brings what we want: fetch requests would remain in the background thread until they get a response or timeout, so they could be reused by a following fetch application event (for the same partitions). Then we could keep the manager logic simple and consistent around how inflights are maintained (removed when they get a response or expire, as it is now). I may be missing something, thoughts?

lianetm · 2024-05-16T19:01:04Z

FYI, I found another issue that looks to me it's related to this same situation: https://issues.apache.org/jira/browse/KAFKA-16777

kirktrue · 2024-05-16T20:43:05Z

Thanks for the review, @lianetm. Agreed on all your points. This PR is a draft because it's a PoC and more thought and tests are needed.

…Timer

…FKA-16637-keep-cached-offset-fetch-result

lianetm · 2024-06-03T16:43:57Z

Hey @kirktrue , this is the simple integration test I was suggesting on the other timeout PR, that should be an easy way to ensure that we're being able to fetch offsets and make progress on continuous poll(ZERO), which is what this PR should unblock I expect (just as guidance, needs tweaks to ensure TestUtils polls with 0). Leaving it here in case it helps validate the changes:

  // Ensure TestUtils polls with ZERO. This fails for the new consumer only.
  @ParameterizedTest(name = TestInfoUtils.TestWithParameterizedQuorumAndGroupProtocolNames)
  @MethodSource(Array("getTestQuorumAndGroupProtocolParametersAll"))
  def testPollEventuallyReturnsRecordsWithZeroTimeout(quorum: String, groupProtocol: String): Unit = {
    val numMessages = 100
    val producer = createProducer()
    sendRecords(producer, numMessages, tp)

    val consumer = createConsumer()
    consumer.subscribe(Set(topic).asJava)
    val records = awaitNonEmptyRecords(consumer, tp)
    assertEquals(numMessages, records.count())
  }

kirktrue added 30 commits January 22, 2024 16:41

WIP

666d74b

Don't look at me—I'm hideous!

Lots more changes

9d6ec58

Updates

bbbfec7

Reverting

a0f6fc5

Reverting toString() changes

72b96b5

Revert toString() changes

ca1f16d

More reverting

c88fa3b

Reverts

3b5bc62

Reverting

46af018

Reverts

4998699

Indent SNAFU

1366f08

Indentation reverts

73fd588

Update AsyncKafkaConsumer.java

6030c8e

Everything compiles (for now)

7249756

Update HeartbeatRequestManagerTest.java

338440e

Updates to fix some tests

7ee977b

Updates

eff925d

Update ApplicationEventProcessorTest.java

d1dfdc6

Updates

79880a4

Update TopicMetadataApplicationEvent.java

d96e75b

Updates

1ee795e

Update ErrorBackgroundEvent.java

a685b4e

Updates

dcdb767

Updates

de267b7

Update ApplicationEventProcessor.java

c94bcb9

Update ApplicationEventProcessor.java

337c306

Update ApplicationEventProcessor.java

422f935

Updates to include RelaxedCompletableFuture

d08d4a6

Merge branch 'trunk' into KAFKA-15974-enforce-timeout-in-requests

fe6a890

Updates

1533901

kirktrue changed the title ~~KAFKA-16637: KIP-848 does not work well~~ KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively May 2, 2024

kirktrue marked this pull request as draft May 2, 2024 19:27

AndrewJSchofield reviewed May 4, 2024

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java Outdated Show resolved Hide resolved

clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java Outdated Show resolved Hide resolved

lianetm reviewed May 6, 2024

View reviewed changes

kirktrue added 2 commits May 16, 2024 13:38

Merge branch 'trunk' into KAFKA-16637-keep-cached-offset-fetch-result

d95a886

Fixed comment typos

5d5dc6e

lianetm mentioned this pull request May 17, 2024

KAFKA-15974: Enforce that event processing respects user-provided timeout #15640

Merged

3 tasks

kirktrue added 15 commits May 22, 2024 10:17

Merge branch 'trunk' into KAFKA-16200-enforce-timeout-in-requestmanagers

c92b626

Fixed merge conflicts

5eb6422

Merge branch 'trunk' into KAFKA-16200-enforce-timeout-in-requestmanagers

7fafbe7

Updates now that poll() isn't immediately expiring requests

b5f72bf

Update CommitRequestManagerTest.java

ea7b16e

Update CommitRequestManager.java

cc8eb0b

Updates to remove changes to minimize diff noise

9decf4b

Moved to using deadlineMs as long as possible before converting to a …

323ea3d

…Timer

More cleanup, fixed typos, renamed method, etc.

308d7d6

Fixed issue with failing tests due to miscalculation of timeout

a14301d

Merge branch 'trunk' into KAFKA-16637-keep-cached-offset-fetch-result

ed6d955

Merge branch 'KAFKA-16200-enforce-timeout-in-requestmanagers' into KA…

c6d5aec

…FKA-16637-keep-cached-offset-fetch-result

Updates

d7ced36

Merge branch 'trunk' into KAFKA-16200-enforce-timeout-in-requestmanagers

c3e0b24

Merge branch 'KAFKA-16200-enforce-timeout-in-requestmanagers' into KA…

cc84d44

…FKA-16637-keep-cached-offset-fetch-result

kirktrue closed this May 31, 2024

kirktrue deleted the KAFKA-16637-keep-cached-offset-fetch-result branch May 31, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

kirktrue commented May 2, 2024

kirktrue commented May 2, 2024

AndrewJSchofield left a comment

lianetm May 6, 2024

lianetm commented May 6, 2024

lianetm commented May 6, 2024

lianetm commented May 16, 2024

kirktrue commented May 16, 2024

lianetm commented Jun 3, 2024

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

KAFKA-16637: AsyncKafkaConsumer removes offset fetch responses from cache too aggressively #15844

Conversation

kirktrue commented May 2, 2024

Committer Checklist (excluded from commit message)

kirktrue commented May 2, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

lianetm May 6, 2024

Choose a reason for hiding this comment

lianetm commented May 6, 2024

lianetm commented May 6, 2024

lianetm commented May 16, 2024

kirktrue commented May 16, 2024

lianetm commented Jun 3, 2024