timeout issue with async commit , we want to commit every message after consuming #4827

chunaiarun · 2024-08-23T13:44:07Z

Describe the bug
32 consumers started to process 7 million messages from 32 partitions ( from the kafka queue)
We are doing manual async commit

To Reproduce

Have 7 million messages produced and segregated equally into 32 partitions
Configure consumer code to do "manual async commit"
then start 32 consumers at the same time
we saw below timeout error in almost all consumer logs.
Expected behavior
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #0)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout KxSystems/kafka#1)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout KxSystems/kafka#2)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout KxSystems/kafka#3)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout KxSystems/kafka#4)")"
"(4i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out 66792 in-flight, 0 retry-queued, 95763 out-queue, 0 partially-sent requests")"
"(3i;"FAIL";"[thrd:GroupCoordinator]: GroupCoordinator: :443: 162555 request(s) timed out

FYI
(fetch.wait.max.ms;10);
(statistics.interval.ms;10000);
(enable.auto.commit;false);
(enable.auto.offset.store;false);
(message.max.bytes;1000000000) );

We have 32 consumers started to process 7 million messages from 32 partitions ( from the kafka queue) We are doing manual async commit

We would like to commit every message here. Is there way to get rid of this timeout... I tried to increase the socket.timeout.ms from default 60secs to 120 secs. I somehow helped.

Any suggestions ??
We are using https://github.com/KxSystems/kafka (kfk is a thin wrapper for kdb+ around librdkafka C API for Kafka. It is part of the Fusion for kdb+ interface collection)

anchitj · 2024-08-27T12:24:32Z

Can you provide debug logs? Committing for each message isn't recommended.

chunaiarun · 2024-08-28T14:50:24Z

Sure will give you

…

On Tue, 27 Aug 2024 at 5:54 PM, Anchit Jain ***@***.***> wrote: Can you provide debug logs? Committing for each message isn't recommended. — Reply to this email directly, view it on GitHub <#4827 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BFO7CEWDFNVKOFUKYCNOSITZTRVZLAVCNFSM6AAAAABNAG76OGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJSGQZDCMBVGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

chunaiarun · 2024-08-29T10:51:41Z

@anchitj Please find the logs

For all the 32 consumers we got this during startup.
As we are a regulatory reporting application, we don't want to miss any message, so we have a requirement to commit each message

,topicpartitionoffsetmetadata!(`xxxx_orders_data;21i;29071312;"")
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out OffsetCommitRequest in flight (after 115414ms, timeout #0)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out OffsetCommitRequest in flight (after 115414ms, timeout #1)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out OffsetCommitRequest in flight (after 115414ms, timeout #2)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out OffsetCommitRequest in flight (after 115414ms, timeout #3)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out OffsetCommitRequest in flight (after 115414ms, timeout #4)")"
"(4i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/0: Timed out 323 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests")"
"(3i;"FAIL";"[thrd:GroupCoordinator]: GroupCoordinator: xxxx-qa.aws-nonprod.xxxcom:443: 323 request(s) timed out: disconnect (average rtt 111018.109ms) (after 138236ms in state UP)")"

hgeraldino · 2024-09-24T21:18:31Z

Isn't this covered in https://github.com/confluentinc/librdkafka/wiki/FAQ#why-committing-each-message-is-slow ?

Chances are you're experiencing exactly the same behavior described in that post, and the recommendation is (and has always been) to avoid manual commits (even async ones) and use the auto-commit + store_offsets() API

I honestly don't think this is a bug

emasab · 2024-10-15T17:14:34Z

we don't want to miss any message, so we have a requirement to commit each message

In this case you should have sync commits and, to reduce number of commits, you should process a small batch of messages and then commit them all together, instead of committing at each message, because if you don't wait for the response they enqueue on the broker and time out ultimately. Thanks @hgeraldino as you pointed out the FAQ is clear about that.

emasab closed this as completed Oct 15, 2024

emasab closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout issue with async commit , we want to commit every message after consuming #4827

timeout issue with async commit , we want to commit every message after consuming #4827

chunaiarun commented Aug 23, 2024

anchitj commented Aug 27, 2024

chunaiarun commented Aug 28, 2024 via email

chunaiarun commented Aug 29, 2024

hgeraldino commented Sep 24, 2024

emasab commented Oct 15, 2024 •

edited

Loading

timeout issue with async commit , we want to commit every message after consuming #4827

timeout issue with async commit , we want to commit every message after consuming #4827

Comments

chunaiarun commented Aug 23, 2024

anchitj commented Aug 27, 2024

chunaiarun commented Aug 28, 2024 via email

chunaiarun commented Aug 29, 2024

hgeraldino commented Sep 24, 2024

emasab commented Oct 15, 2024 • edited Loading

emasab commented Oct 15, 2024 •

edited

Loading