Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfault with describe_topics and flaky connection #1692

Conversation

lpsinger
Copy link
Contributor

If you call describe_topics on a flaky connection, sometimes the admin client reply has the host set to a null pointer. When this occurs, instead of segfaulting, report the host as None.

@lpsinger lpsinger requested a review from a team as a code owner December 15, 2023 22:46
Copy link

cla-assistant bot commented Dec 15, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

cla-assistant bot commented Dec 15, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@lpsinger lpsinger force-pushed the fix-segfault-on-admin-client-flaky-connection branch 3 times, most recently from f1d8c82 to c1c65e0 Compare December 15, 2023 23:05
@pranavrth
Copy link
Member

Thanks for the PR.

Even though the changes seems correct, I think we should not change the base implementation which is being used in alot of APIs. Some of the values of other APIs must not be NULL. I think its better to handle the scenario in this way - https://github.com/confluentinc/confluent-kafka-python/blob/master/src/confluent_kafka/src/Admin.c#L3198-L3201.

I am also checking if it should be possible to get the host information as NULL or should we handle it as an error instead. But in the meantime you can rebase the branch and do the change mentioned above if you have time. Otherwise, you can wait for my revert.

@lpsinger lpsinger force-pushed the fix-segfault-on-admin-client-flaky-connection branch from c1c65e0 to 2627ecc Compare March 20, 2024 17:22
@lpsinger
Copy link
Contributor Author

But in the meantime you can rebase the branch and do the change mentioned above if you have time.

Done.

Copy link
Member

@pranavrth pranavrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided minor comments.

Please update the CHANGELOG.md as well with the entry for the fix.

src/confluent_kafka/src/confluent_kafka.c Outdated Show resolved Hide resolved
Comment on lines 3354 to 3359
if (rd_kafka_AclBinding_host(c_acl_binding)) {
cfl_PyDict_SetString(kwargs, "host",
rd_kafka_AclBinding_host(c_acl_binding));
} else {
cfl_PyObject_SetItemString(kwargs, "host", Py_None);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change:

  1. This change is not related to describe_topic
  2. We can't reproduce this
  3. It shouldn't be null as per the protocol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 3631 to 3637
if (rd_kafka_MemberDescription_host(c_member)) {
cfl_PyDict_SetString(kwargs,
"host",
rd_kafka_MemberDescription_host(c_member));
} else {
cfl_PyObject_SetItemString(kwargs, "host", Py_None);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change:

  1. This change is not related to describe_topic
  2. We can't reproduce this
  3. It shouldn't be null as per the protocol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 211 to 222
if (c_broker.host) {
if (cfl_PyObject_SetString(broker, "host",
c_broker.host) == -1) {
Py_DECREF(broker);
return NULL;
}
} else {
if (cfl_PyObject_SetItemString(broker, "host",
Py_None) == -1) {
Py_DECREF(broker);
return NULL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change:

  1. This function is used in deprecated API and will be removed in future versions
  2. This change is not related to describe_topic
  3. We can't reproduce this
  4. It shouldn't be null as per the protocol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that? Ultimately this function is called by list_topics in Metadata.c which sounds like it may be related to the segfaults I reported.

Were you able to isolate those segfaults to the one bit of code that you do want to change, in confluent_kafka.c?

Copy link
Contributor

@emasab emasab Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host comes from the Metadata response in list_topics and is not nullable there
https://github.com/apache/kafka/blob/e8cd661becb08279702f1670bb8f1d816537adea/clients/src/main/resources/common/message/MetadataResponse.json#L54

otherwise it can come from list_group (deprecated) that, in librdkafka only, does a list + describe and in that case it's taken from the host of the coordinator that returned the DescribeGroup response here:
https://github.com/confluentinc/librdkafka/blob/c95d2ee950bda779c63843d376aab1d90ed8dfa9/src/rdkafka.c#L4739

@pranavrth
Copy link
Member

/sem-approve

@@ -1620,7 +1620,10 @@ PyObject *c_Node_to_py(const rd_kafka_Node_t *c_node) {

cfl_PyDict_SetInt(kwargs, "id", rd_kafka_Node_id(c_node));
cfl_PyDict_SetInt(kwargs, "port", rd_kafka_Node_port(c_node));
cfl_PyDict_SetString(kwargs, "host", rd_kafka_Node_host(c_node));
if (rd_kafka_Node_host(c_node))
Copy link
Contributor

@emasab emasab Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have a doubt that the fix should be here in CKPy, because this uses the Metadata response too for the replicas and isrs and it uses the Metadata host here (not nullable)
https://github.com/apache/kafka/blob/e8cd661becb08279702f1670bb8f1d816537adea/clients/src/main/resources/common/message/MetadataResponse.json#L54
And broker metadata must be present for all the indexed brokers. Checking librdkafka code

Copy link
Contributor

@emasab emasab Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here it can be None because it's possible not all the brokers are returned from the Metadata call. I'm not sure if this expected from the broker because the documentation says "Each broker in the response". And in that case LK just sets the broker id here
https://github.com/confluentinc/librdkafka/blob/c95d2ee950bda779c63843d376aab1d90ed8dfa9/src/rdkafka_aux.c#L281

In the list_topics API, instead, only the broker id list is returned in the replicas and isrs fields so the broker list in ClusterMetadata only contains complete BrokerMetadata objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's the expected behavior. I've created a PR for changing the documentation here.
An update: in Go and .NET it doesn't happen because it either uses C.GoString or PtrToStringUTF8 that returns nil or null.

@lpsinger
Copy link
Contributor Author

lpsinger commented May 2, 2024

@emasab, what else do you want me to do with this?

@emasab
Copy link
Contributor

emasab commented May 3, 2024

@lpsinger after reviewing all the cases only in c_Node_to_py it's needed to check for NULL values because it takes the nodes from the Metadata request and some of then can be absent, so without host, when the broker is down and not listed there.

Could you leave only that change?

@lpsinger
Copy link
Contributor Author

lpsinger commented May 3, 2024

Could you leave only that change?

Done. Would you like me to squash the changes?

@lpsinger lpsinger requested a review from emasab May 3, 2024 20:32
@pranavrth
Copy link
Member

/sem-approve

@lpsinger lpsinger force-pushed the fix-segfault-on-admin-client-flaky-connection branch from e2d6f3e to cfd227c Compare May 6, 2024 13:57
@lpsinger lpsinger requested a review from pranavrth May 6, 2024 13:57
@pranavrth
Copy link
Member

/sem-approve

lpsinger and others added 5 commits May 7, 2024 16:31
If you call describe_topics on a flaky connection, sometimes the
admin client reply has the host set to a null pointer. When this
occurs, instead of segfaulting, report the host as None.
@pranavrth pranavrth force-pushed the fix-segfault-on-admin-client-flaky-connection branch from cfd227c to 6e48741 Compare May 7, 2024 11:01
@pranavrth
Copy link
Member

/sem-approve

Copy link
Member

@pranavrth pranavrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. LGTM!

@pranavrth pranavrth merged commit 5c4a652 into confluentinc:master May 7, 2024
1 of 2 checks passed
@lpsinger lpsinger deleted the fix-segfault-on-admin-client-flaky-connection branch May 7, 2024 12:15
@lpsinger
Copy link
Contributor Author

lpsinger commented May 7, 2024

You're very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants