-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not release the GIL on rd_kafka_assign and rd_kafka_new ? #1023
Comments
What librdkafka version are you on? I'm guessing the new state synchronization in assign() (which was added to avoid race conditions in the app) may to blame for this stall, in which case it makes sense to unlock the GIL for assign(), but would really like to get a reproducer with full log enabled to understand what is going on. As for rd_kafka_new() it will not block on any IO, just waiting for its threads to start which should be instant and not require releasing the GIL. |
@edenhill I've reproduced this on 1.5 and 1.6. Sorry, but do you know how I can fully enable the logs on the consumer side? I do not have easy access to the broker logs unfortunately. Thanks! |
Add |
Hi @edenhill I added debugging, and immediately noticed one issue: a call is made to refresh the kerberos ticket, and this call takes a very long time, 40-120 seconds. For example:
I also executed the same command directly on my unix environment, and it also takes about 40 seconds each time. Would you happen to know what might cause this to be so slow? In addition, it seems like for a given consumer, the kerberos ticket can be refreshed multiple times. In a 5 minute span I counted 4-5 refreshes for each consumer. Do you know why it is necessary to refresh the kerberos ticket so much? Here is the logs just for
|
Kerberos and me are like this: 💀 I'll move the initial kinit refresh out from rd_kafka_new() to avoid this blocking behaviour. Good find 👍 ! |
Closing. Fix included in the next version. |
Hi @edenhill thanks! I will test it out today. I would have thought that calling Would you please clarify how it works? |
It is not smart like that. librdkafka itself does not known what is going on with Kerberos, it simply calls into libsasl2/sasl-cyrus and kinit shell commands. |
In
Consumer.c
'sConsumer_assign
function, a call is made tord_kafka_assign
that does not drop the GIL. This function only uses C arguments, no PyObjects*. My assumption is that this function results in IO to the broker. I think we can drop the GIL here:Likewise, in
Consumer_init
function, a call is made tord_kafka_new
. I'm not too sure, but I think this results in IO to the broker, and the GIL could also be dropped here:I would assume that there are some more invocations of rd_kafka_* functions that result in IO and where we could drop the GIL.
Is there any reason in particular why the GIL isn't being dropped in these situations?
The reason I ask is because I recently started using a new broker with SSL. If I launch four consumer threads, each thread takes about 30 seconds to issue
Consumer(conf)
, and it blocks all other threads (kafka consumers, rest api threads, and etc) during this time.In addition, once each consumer thread's
on_assign
callback is invoked, the successive call toc.assign(partitions)
takes another 30 seconds and blocks all other threads in the application.I hypothesize that these calls are not releasing the GIL, which is resulting in blocking of all the other threads in the application.
(Of course, it shouldn't be taking this long to call
Consumer(conf)
andc.assign(partitions)
- there is probably some network issue in my new broker. But regardless, if the GIL was released, the other threads wouldn't be blocked.)The text was updated successfully, but these errors were encountered: