-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfault with describe_topics and flaky connection #1692
Fix segfault with describe_topics and flaky connection #1692
Conversation
|
f1d8c82
to
c1c65e0
Compare
Thanks for the PR. Even though the changes seems correct, I think we should not change the base implementation which is being used in alot of APIs. Some of the values of other APIs must not be NULL. I think its better to handle the scenario in this way - https://github.com/confluentinc/confluent-kafka-python/blob/master/src/confluent_kafka/src/Admin.c#L3198-L3201. I am also checking if it should be possible to get the host information as NULL or should we handle it as an error instead. But in the meantime you can rebase the branch and do the change mentioned above if you have time. Otherwise, you can wait for my revert. |
c1c65e0
to
2627ecc
Compare
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provided minor comments.
Please update the CHANGELOG.md as well with the entry for the fix.
src/confluent_kafka/src/Admin.c
Outdated
if (rd_kafka_AclBinding_host(c_acl_binding)) { | ||
cfl_PyDict_SetString(kwargs, "host", | ||
rd_kafka_AclBinding_host(c_acl_binding)); | ||
} else { | ||
cfl_PyObject_SetItemString(kwargs, "host", Py_None); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change:
- This change is not related to
describe_topic
- We can't reproduce this
- It shouldn't be null as per the protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not nullable too in DescribeAclsResponse and DeleteAclsResponse
https://github.com/apache/kafka/blob/d65d8867983e54a36178c3a1227b68c8a468d415/clients/src/main/resources/common/message/DescribeAclsResponse.json#L45
https://github.com/apache/kafka/blob/d65d8867983e54a36178c3a1227b68c8a468d415/clients/src/main/resources/common/message/DeleteAclsResponse.json#L49
src/confluent_kafka/src/Admin.c
Outdated
if (rd_kafka_MemberDescription_host(c_member)) { | ||
cfl_PyDict_SetString(kwargs, | ||
"host", | ||
rd_kafka_MemberDescription_host(c_member)); | ||
} else { | ||
cfl_PyObject_SetItemString(kwargs, "host", Py_None); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change:
- This change is not related to describe_topic
- We can't reproduce this
- It shouldn't be null as per the protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 : In this case it's the member host and is not nullable by the protocol
src/confluent_kafka/src/Metadata.c
Outdated
if (c_broker.host) { | ||
if (cfl_PyObject_SetString(broker, "host", | ||
c_broker.host) == -1) { | ||
Py_DECREF(broker); | ||
return NULL; | ||
} | ||
} else { | ||
if (cfl_PyObject_SetItemString(broker, "host", | ||
Py_None) == -1) { | ||
Py_DECREF(broker); | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change:
- This function is used in deprecated API and will be removed in future versions
- This change is not related to describe_topic
- We can't reproduce this
- It shouldn't be null as per the protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about that? Ultimately this function is called by list_topics
in Metadata.c which sounds like it may be related to the segfaults I reported.
Were you able to isolate those segfaults to the one bit of code that you do want to change, in confluent_kafka.c?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host comes from the Metadata response in list_topics
and is not nullable there
https://github.com/apache/kafka/blob/e8cd661becb08279702f1670bb8f1d816537adea/clients/src/main/resources/common/message/MetadataResponse.json#L54
otherwise it can come from list_group (deprecated) that, in librdkafka only, does a list + describe and in that case it's taken from the host of the coordinator that returned the DescribeGroup response here:
https://github.com/confluentinc/librdkafka/blob/c95d2ee950bda779c63843d376aab1d90ed8dfa9/src/rdkafka.c#L4739
/sem-approve |
@@ -1620,7 +1620,10 @@ PyObject *c_Node_to_py(const rd_kafka_Node_t *c_node) { | |||
|
|||
cfl_PyDict_SetInt(kwargs, "id", rd_kafka_Node_id(c_node)); | |||
cfl_PyDict_SetInt(kwargs, "port", rd_kafka_Node_port(c_node)); | |||
cfl_PyDict_SetString(kwargs, "host", rd_kafka_Node_host(c_node)); | |||
if (rd_kafka_Node_host(c_node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I have a doubt that the fix should be here in CKPy, because this uses the Metadata response too for the replicas and isrs and it uses the Metadata host here (not nullable)
https://github.com/apache/kafka/blob/e8cd661becb08279702f1670bb8f1d816537adea/clients/src/main/resources/common/message/MetadataResponse.json#L54
And broker metadata must be present for all the indexed brokers. Checking librdkafka code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here it can be None because it's possible not all the brokers are returned from the Metadata call. I'm not sure if this expected from the broker because the documentation says "Each broker in the response". And in that case LK just sets the broker id here
https://github.com/confluentinc/librdkafka/blob/c95d2ee950bda779c63843d376aab1d90ed8dfa9/src/rdkafka_aux.c#L281
In the list_topics
API, instead, only the broker id list is returned in the replicas
and isrs
fields so the broker list in ClusterMetadata
only contains complete BrokerMetadata
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's the expected behavior. I've created a PR for changing the documentation here.
An update: in Go and .NET it doesn't happen because it either uses C.GoString
or PtrToStringUTF8
that returns nil
or null
.
@emasab, what else do you want me to do with this? |
@lpsinger after reviewing all the cases only in Could you leave only that change? |
Done. Would you like me to squash the changes? |
/sem-approve |
e2d6f3e
to
cfd227c
Compare
/sem-approve |
If you call describe_topics on a flaky connection, sometimes the admin client reply has the host set to a null pointer. When this occurs, instead of segfaulting, report the host as None.
This reverts commit 2627ecc.
Co-authored-by: Pranav Rathi <[email protected]>
cfd227c
to
6e48741
Compare
/sem-approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. LGTM!
You're very welcome! |
If you call describe_topics on a flaky connection, sometimes the admin client reply has the host set to a null pointer. When this occurs, instead of segfaulting, report the host as None.