Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~100% CPU usage for rdk:kafka #3107

Closed
5 of 7 tasks
tasneem-hyder opened this issue Oct 13, 2020 · 2 comments
Closed
5 of 7 tasks

~100% CPU usage for rdk:kafka #3107

tasneem-hyder opened this issue Oct 13, 2020 · 2 comments

Comments

@tasneem-hyder
Copy link

tasneem-hyder commented Oct 13, 2020

Description

I have two python process running separately with same CGROUP1 which has two different kind of kafka implementation:
2 members (created by two different kafka modules listening to two different topics)
#1st member created with aiokafka (non-rdkafka) with 1 subscription topic TOPIC1-RDKAFKA with 3 partitions
#2nd member created with rdkafka with 1 subscription topic TOPIC1-AIOKAFKA with 3 partitions

We recently noticed the below issue with confluent-kafka : 1.5.0 running process:

python: rdkafka_roundrobin_assignor.c:97: rd_kafka_roundrobin_assignor_assign_cb: Assertion `next < rd_list_cnt(&eligible_topic-> members)' failed.

And, found that the fix is updated in v1.5.2-RC1, so we downloaded and built manually to work with latest rdkafka version.
As a result the above issue is gone however it is behaving weirdly as below:

  • The moment topics getting the data to consume, the rdkafka process gets stuck
  • The consumer debug log ends with below and suddenly it shows ~100% CPU usage for rdk:kafka and the processing just stuck there
%7|1602577461.966|CLOSE|rdkafka#consumer-2| [thrd:app]: Closing consumer
%7|1602577461.966|CLOSE|rdkafka#consumer-2| [thrd:app]: Waiting for close events

Please suggest us some pointer to look further.

Log snippet:

%7|1602577440.705|HEARTBEAT|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: Heartbeat for group "CGROUP1" generation id 37318
%7|1602577443.028|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [0]: stored offset -1001, committed offset 490752: not including in commit
%7|1602577443.028|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [1]: stored offset -1001, committed offset 419349: not including in commit
%7|1602577443.028|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [2]: stored offset -1001, committed offset 384877: not including in commit
%7|1602577443.028|COMMIT|rdkafka#consumer-2| [thrd:main]: OffsetCommit for 3 partition(s): cgrp auto commit timer: returned: Local: No offset stored
%7|1602577443.706|HEARTBEAT|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: Heartbeat for group "CGROUP1" generation id 37318
%7|1602577446.707|HEARTBEAT|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: Heartbeat for group "CGROUP1" generation id 37318
%7|1602577446.968|HEARTBEAT|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" heartbeat error response in state up (join state started, 3 partition(s) assigned): Broker: Group rebalance in progress
%7|1602577446.968|REBALANCE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" is rebalancing in state up (join-state started) with assignment: group is rebalancing
%7|1602577446.968|PAUSE|rdkafka#consumer-2| [thrd:main]: Library pausing 3 partition(s)
%7|1602577446.968|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v3
%7|1602577446.968|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [0] (v3)
%7|1602577446.968|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v3
%7|1602577446.968|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [1] (v3)
%7|1602577446.968|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v3
%7|1602577446.968|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [2] (v3)
%7|1602577446.968|ASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": delegating revoke of 3 partition(s) to application rebalance callback on queue rd_kafka_cgrp_new: group is rebalancing
%7|1602577446.968|CGRPJOINSTATE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" changed join state started -> wait-revoke-rebalance_cb (v4, state up)
%7|1602577446.968|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0] received op PAUSE (v3) in fetch-state active (opv2)
%7|1602577446.969|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [0]: at offset 490799 (state active, v3)
%7|1602577446.969|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1] received op PAUSE (v3) in fetch-state active (opv2)
%7|1602577446.969|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [1]: at offset 419410 (state active, v3)
%7|1602577446.969|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2] received op PAUSE (v3) in fetch-state active (opv2)
%7|1602577446.969|PAUSE|rdkafka#consumer-2| [thrd:main]: Pause TOPIC1-RDKAFKA [2]: at offset 384934 (state active, v3)
%7|1602577446.969|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op ASSIGN (v0) in state up (join state wait-revoke-rebalance_cb, v4 vs 0)
%7|1602577446.969|ASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": new assignment of 0 partition(s) in join state wait-revoke-rebalance_cb
%7|1602577446.969|BARRIER|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": rd_kafka_cgrp_assign:2641: new version barrier v5
%7|1602577446.969|CGRPJOINSTATE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" changed join state wait-revoke-rebalance_cb -> wait-unassign (v5, state up)
%7|1602577446.969|BARRIER|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": rd_kafka_cgrp_unassign:2547: new version barrier v6
%7|1602577446.969|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": unassigning 3 partition(s) (v6)
%7|1602577446.969|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [0]: stored offset 490799, committed offset 490752: setting stored offset 490799 for commit
%7|1602577446.969|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [1]: stored offset 419410, committed offset 419349: setting stored offset 419410 for commit
%7|1602577446.969|OFFSET|rdkafka#consumer-2| [thrd:main]: Topic TOPIC1-RDKAFKA [2]: stored offset 384934, committed offset 384877: setting stored offset 384934 for commit
%7|1602577446.969|COMMIT|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: Committing offsets for 3 partition(s): unassign
%7|1602577446.970|OFFSET|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: Enqueue OffsetCommitRequest(v7, 3/3 partition(s))): unassign
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0]: rd_kafka_toppar_op_fetch_stop:2338: new version barrier v4
%7|1602577446.970|CONSUMER|rdkafka#consumer-2| [thrd:main]: Stop consuming TOPIC1-RDKAFKA [0] (v4)
%7|1602577446.970|DESP|rdkafka#consumer-2| [thrd:main]: Removing (un)desired topic TOPIC1-RDKAFKA [0]
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1]: rd_kafka_toppar_op_fetch_stop:2338: new version barrier v4
%7|1602577446.970|CONSUMER|rdkafka#consumer-2| [thrd:main]: Stop consuming TOPIC1-RDKAFKA [1] (v4)
%7|1602577446.970|DESP|rdkafka#consumer-2| [thrd:main]: Removing (un)desired topic TOPIC1-RDKAFKA [1]
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2]: rd_kafka_toppar_op_fetch_stop:2338: new version barrier v4
%7|1602577446.970|CONSUMER|rdkafka#consumer-2| [thrd:main]: Stop consuming TOPIC1-RDKAFKA [2] (v4)
%7|1602577446.970|DESP|rdkafka#consumer-2| [thrd:main]: Removing (un)desired topic TOPIC1-RDKAFKA [2]
%7|1602577446.970|RESUME|rdkafka#consumer-2| [thrd:main]: Library resuming 3 partition(s)
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v5
%7|1602577446.970|RESUME|rdkafka#consumer-2| [thrd:main]: Resume TOPIC1-RDKAFKA [0] (v5)
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v5
%7|1602577446.970|RESUME|rdkafka#consumer-2| [thrd:main]: Resume TOPIC1-RDKAFKA [1] (v5)
%7|1602577446.970|BARRIER|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2]: rd_kafka_toppar_op_pause_resume:2397: new version barrier v5
%7|1602577446.970|RESUME|rdkafka#consumer-2| [thrd:main]: Resume TOPIC1-RDKAFKA [2] (v5)
%7|1602577446.970|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Unassign not done yet (3 wait_unassign, 3 assigned, 1 wait commit, join state wait-unassign): unassign
%7|1602577446.970|ASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": assigning 0 partition(s) in join state wait-unassign
%7|1602577446.970|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0] received op FETCH_STOP (v4) in fetch-state active (opv3)
%7|1602577446.970|FETCH|rdkafka#consumer-2| [thrd:main]: Stopping fetch for TOPIC1-RDKAFKA [0] in state active (v4)
%7|1602577446.970|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [0] changed fetch state active -> stopping
%7|1602577446.970|STORETERM|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0]: offset store terminating
%7|1602577446.970|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [0] changed fetch state stopping -> stopped
%7|1602577446.970|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1] received op FETCH_STOP (v4) in fetch-state active (opv3)
%7|1602577446.970|FETCH|rdkafka#consumer-2| [thrd:main]: Stopping fetch for TOPIC1-RDKAFKA [1] in state active (v4)
%7|1602577446.970|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [1] changed fetch state active -> stopping
%7|1602577446.970|STORETERM|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1]: offset store terminating
%7|1602577446.970|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [1] changed fetch state stopping -> stopped
%7|1602577446.970|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2] received op FETCH_STOP (v4) in fetch-state active (opv3)
%7|1602577446.971|FETCH|rdkafka#consumer-2| [thrd:main]: Stopping fetch for TOPIC1-RDKAFKA [2] in state active (v4)
%7|1602577446.971|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [2] changed fetch state active -> stopping
%7|1602577446.971|STORETERM|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2]: offset store terminating
%7|1602577446.971|PARTSTATE|rdkafka#consumer-2| [thrd:main]: Partition TOPIC1-RDKAFKA [2] changed fetch state stopping -> stopped
%7|1602577446.971|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [0] received op PAUSE (v5) in fetch-state stopped (opv4)
%7|1602577446.971|RESUME|rdkafka#consumer-2| [thrd:main]: Not resuming stopped TOPIC1-RDKAFKA [0]: at offset 490799 (state stopped, v5)
%7|1602577446.971|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [1] received op PAUSE (v5) in fetch-state stopped (opv4)
%7|1602577446.971|RESUME|rdkafka#consumer-2| [thrd:main]: Not resuming stopped TOPIC1-RDKAFKA [1]: at offset 419410 (state stopped, v5)
%7|1602577446.971|OP|rdkafka#consumer-2| [thrd:main]: TOPIC1-RDKAFKA [2] received op PAUSE (v5) in fetch-state stopped (opv4)
%7|1602577446.971|RESUME|rdkafka#consumer-2| [thrd:main]: Not resuming stopped TOPIC1-RDKAFKA [2]: at offset 384934 (state stopped, v5)
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op PARTITION_LEAVE in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [0]
%7|1602577446.971|PARTDEL|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": delete TOPIC1-RDKAFKA [0]
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [0]
%7|1602577446.971|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Unassign not done yet (2 wait_unassign, 2 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op PARTITION_LEAVE in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [1]
%7|1602577446.971|PARTDEL|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": delete TOPIC1-RDKAFKA [1]
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [1]
%7|1602577446.971|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Unassign not done yet (1 wait_unassign, 1 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op PARTITION_LEAVE in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [2]
%7|1602577446.971|PARTDEL|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": delete TOPIC1-RDKAFKA [2]
%7|1602577446.971|CGRPOP|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" received op REPLY:FETCH_STOP in state up (join state wait-unassign, v6) for TOPIC1-RDKAFKA [2]
%7|1602577446.971|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Unassign not done yet (0 wait_unassign, 0 assigned, 1 wait commit, join state wait-unassign): FETCH_STOP done
%7|1602577446.976|FETCHADD|rdkafka#consumer-2| [thrd:ssl://*****.com:9093/69]: ssl://*****.com:9093/69: Removed TOPIC1-RDKAFKA [0] from fetch list (0 entries, opv 2): not in active fetch state
%7|1602577447.135|FETCHADD|rdkafka#consumer-2| [thrd:ssl://*****.com:9093/72]: ssl://*****.com:9093/72: Removed TOPIC1-RDKAFKA [1] from fetch list (0 entries, opv 2): not in active fetch state
%7|1602577447.211|FETCHADD|rdkafka#consumer-2| [thrd:ssl://*****.com:9093/71]: ssl://*****.com:9093/71: Removed TOPIC1-RDKAFKA [2] from fetch list (0 entries, opv 2): not in active fetch state
%7|1602577447.287|COMMIT|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: OffsetCommit for 3 partition(s): unassign: returned: Success
%7|1602577447.287|UNASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": unassign done in state up (join state wait-unassign): without new assignment: OffsetCommit done
%7|1602577447.287|CGRPJOINSTATE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" changed join state wait-unassign -> init (v6, state up)
%7|1602577447.287|JOIN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": join with 1 (1) subscribed topic(s)
%7|1602577447.287|CGRPMETADATA|rdkafka#consumer-2| [thrd:main]: consumer join: metadata for subscription is up to date (38996ms old)
%7|1602577447.287|JOIN|rdkafka#consumer-2| [thrd:main]: ssl://*****.com:9093/71: Joining group "CGROUP1" with 1 subscribed topic(s)
%7|1602577447.287|CGRPJOINSTATE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" changed join state init -> wait-join (v6, state up)
%7|1602577447.548|JOINGROUP|rdkafka#consumer-2| [thrd:main]: JoinGroup response: GenerationId 37319, Protocol roundrobin, LeaderId rdkafka-d2249342-f723-4ff8-b2b8-51912df99049 (me), my MemberId rdkafka-d2249342-f723-4ff8-b2b8-51912df99049, 2 members in group: (no error)
%7|1602577447.548|JOINGROUP|rdkafka#consumer-2| [thrd:main]: Elected leader for group "CGROUP1" with 2 member(s)
%7|1602577447.549|GRPLEADER|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1": resetting group leader info: JoinGroup response clean-up
%7|1602577447.549|CGRPJOINSTATE|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" changed join state wait-join -> wait-metadata (v6, state up)
%7|1602577447.812|METADATA|rdkafka#consumer-2| [thrd:main]:   Topic TOPIC1-RDKAFKA partition 0 Leader 69
%7|1602577447.812|METADATA|rdkafka#consumer-2| [thrd:main]:   Topic TOPIC1-RDKAFKA partition 1 Leader 72
%7|1602577447.812|METADATA|rdkafka#consumer-2| [thrd:main]:   Topic TOPIC1-RDKAFKA partition 2 Leader 71
%7|1602577447.812|METADATA|rdkafka#consumer-2| [thrd:main]: GroupCoordinator/71: 2/2 requested topic(s) seen in metadata
%7|1602577447.812|ASSIGN|rdkafka#consumer-2| [thrd:main]: Group "CGROUP1" running roundrobin assignment for 2 member(s):
%7|1602577447.812|ASSIGN|rdkafka#consumer-2| [thrd:main]:  Member "aiokafka-0.6.0-65adf0cf-1885-4898-83f6-6d2f98ecb02a" with 1 subscription(s):
%7|1602577447.812|ASSIGN|rdkafka#consumer-2| [thrd:main]:   TOPIC1-AIOKAFKA [-1]
%7|1602577447.812|ASSIGN|rdkafka#consumer-2| [thrd:main]:  Member "rdkafka-d2249342-f723-4ff8-b2b8-51912df99049" (me) with 1 subscription(s):
%7|1602577447.812|ASSIGN|rdkafka#consumer-2| [thrd:main]:   TOPIC1-RDKAFKA [-1]
%7|1602577447.813|ASSIGN|rdkafka#consumer-2| [thrd:main]: roundrobin: Member "aiokafka-0.6.0-65adf0cf-1885-4898-83f6-6d2f98ecb02a": assigned topic TOPIC1-AIOKAFKA partition 0
%7|1602577447.813|ASSIGN|rdkafka#consumer-2| [thrd:main]: roundrobin: Member "aiokafka-0.6.0-65adf0cf-1885-4898-83f6-6d2f98ecb02a": assigned topic TOPIC1-AIOKAFKA partition 1
%7|1602577447.813|ASSIGN|rdkafka#consumer-2| [thrd:main]: roundrobin: Member "aiokafka-0.6.0-65adf0cf-1885-4898-83f6-6d2f98ecb02a": assigned topic TOPIC1-AIOKAFKA partition 2
%7|1602577461.966|CLOSE|rdkafka#consumer-2| [thrd:app]: Closing consumer
%7|1602577461.966|CLOSE|rdkafka#consumer-2| [thrd:app]: Waiting for close events

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): v1.5.0 and v1.5.2-RC1
  • Apache Kafka version: 5.5.0 Confluent Platform
  • librdkafka client configuration: 'debug': 'consumer,cgrp,topic'
  • Operating system: 4.19.76-linuxkit #1 SMP Tue May 26 11:42:35 UTC 2020 x86_64 GNU/Linux
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue

Environment info

python : 3.8 , confluent-kafka : 1.5.0 , rdkafka : v1.5.0 and v1.5.2-RC1
Docker OS: 4.19.76-linuxkit #1 SMP Tue May 26 11:42:35 UTC 2020 x86_64 GNU/Linux
Running the consumer application in Docker container with 'debug': 'consumer,cgrp,topic'
5.5.0 Confluent Platform

@edenhill
Copy link
Contributor

edenhill commented Nov 4, 2020

Can you try to reproduce this with 'debug': 'all' and provide the logs? Thanks

@plachor
Copy link

plachor commented Dec 1, 2020

might be related to #3159 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants