-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate discovery race condition between clents and services #132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to help with reviewing.
I will now double-check the changes to RMW_Connext_Client::take_response()
, then run CI.
@@ -46,6 +46,10 @@ enum RMW_Connext_MessageType | |||
RMW_CONNEXT_MESSAGE_REPLY | |||
}; | |||
|
|||
#ifndef DDS_GUID_INITIALIZER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definition moved to this common header from rmw_impl.cpp
@@ -281,4 +285,10 @@ rmw_connextdds_get_cft_filter_expression( | |||
rcutils_allocator_t * const allocator, | |||
rmw_subscription_content_filter_options_t * const options); | |||
|
|||
rmw_ret_t | |||
rmw_connextdds_is_subscription_matched( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New internal API function to check if a remote reader is currently matched by a local writer.
Only implemented for Connext Pro, and only used with the "extended" RPC mapping. The function is never called by the Micro implementation (because it doesn't support the "exteded" RPC mapping).
@@ -55,4 +55,13 @@ | |||
#define RMW_CONNEXT_LIMIT_KEEP_ALL_SAMPLES 1000 | |||
#endif /* RMW_CONNEXT_LIMIT_SAMPLES_MAX */ | |||
|
|||
#ifndef RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT | |||
#define RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT 100000 /* 100ms */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static limit used by RMW_Connext_Publisher::max_blocking_time
when the value of DataWriterQos::reliability::max_blocking_time
is "auto".
#endif /* RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT */ | ||
|
||
#ifndef RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_INFINITE | ||
#define RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_INFINITE (3600000000LU * 24LU * 365LU) /* 1 year */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static limit used by RMW_Connext_Publisher::max_blocking_time
when the value of DataWriterQos::reliability::max_blocking_time
is "infinite".
@@ -45,6 +45,7 @@ struct RMW_Connext_RequestReplyMessage | |||
{ | |||
bool request; | |||
rmw_gid_t gid; | |||
rmw_gid_t writer_gid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance this data structure with an extra field so that we can store both reader and writer GUIDs on the service.
} | ||
|
||
rmw_ret_t | ||
RMW_Connext_Publisher::wait_for_subscription( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block and wait up to RMW_Connext_Publisher::max_blocking_time
for a "known" reader to be matched.
|
||
if (rr_msg->request) { | ||
src_guid = &sample_identity->writer_guid; | ||
src_guid = &related_sample_identity->writer_guid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the service side, src_guid
will contain the client's reply reader's GUID. writer_guid
now carries the client's request writer's GUID (which the service will use as the related_sample_identity::writer_guid
of its reply).
} else { | ||
src_guid = &related_sample_identity->writer_guid; | ||
src_sn = &related_sample_identity->sequence_number; | ||
writer_guid = &sample_identity->writer_guid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is not used for anything by the client, at the moment.
@@ -2515,7 +2604,7 @@ RMW_Connext_Client::create( | |||
char * cft_name = nullptr, | |||
*cft_filter = nullptr; | |||
auto scope_exit_cft_name = rcpputils::make_scope_exit( | |||
[cft_name, cft_filter]() | |||
[&cft_name, &cft_filter]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed memory leak
memcpy( | ||
request_header->request_id.writer_guid, | ||
rr_msg.gid.data, | ||
this->request_pub->gid()->data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm second guessing this change, and I'm thinking the value should be copied from rr_msg.writer_gid.data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed rmw_take_response()
, and the writer_guid
is supposed to match the guid returned by rmw_send_request()
, so the value returned here is the correct one.
I'm trying to debug and resolve the Windows build failure in branch asorbini/service-discovery-race-windbg |
Resolved the Windows build issue (it was caused by an "out of order" |
CI passed with one unrelated test failure on Windows. @clalancette is there someone that could take a look and approve the PR? |
Re-running CI after resolving a deadlock caused by trying to query subscription on the DataWriter from within "matched_mutex": |
@clalancette Can this PR be reviewed as this has been open for a few months? We are running Connext rmw and need this fix in the release. Currently have to build this from source but would like to stop doing that. |
I understand the Jazzy rmw freeze is coming up soon? Can we get this PR reviewed and merged before then? It's been sitting here for 6 months. |
@bijoua29 i think this is really nice enhancement. as you noticed, RMW API freeze will be on Mon. April 8, 2024, https://docs.ros.org/en/rolling/Releases/Release-Jazzy-Jalisco.html#release-timeline i will take a look sometime in this week, but this needs to be rebased before that. CC: @asorbini |
@fujitatomoya Thanks for reviewing this. |
Signed-off-by: Andrea Sorbini <[email protected]>
Signed-off-by: Andrea Sorbini <[email protected]>
…ndard suffix for constants Signed-off-by: Andrea Sorbini <[email protected]>
Signed-off-by: Andrea Sorbini <[email protected]>
247506e
to
41f5230
Compare
I've rebased it onto the latest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I love that we can arbitrarily sleep during an rmw_send_response
; that seems to me to be very jittery, and who knows if 100 milliseconds is long enough for this discovery to happen. But it does improve the situation here, and also matches what rmw_fastrtps
does, so I'm OK with it for now.
@clalancette i was going to have a review, thanks! |
OK, yeah. It looks like this PR is the cause of the regressions. I'm going to open a PR to revert it. |
* Mitigate discovery race condition between clients and services. Signed-off-by: Andrea Sorbini <[email protected]>
This PR introduces changes meant to reduce the impact of a race condition between clients and services.
The race condition is described in detail by this comment, but to summarize:
The only way to fully resolve this race condition is described by the DDS-RPC 1.0 specification, section 7.6.2.2:
The proposed solution follows the approach implemented by rmw_fastrtps_cpp:
These changes, paired with the implementation of
rmw_service_server_is_available()
, provide a partial implementation of the solution prescribed by DDS-RPC, with the following limitations:RMW_Connext_Client::::is_service_available()
only verifies that for every remote subscription matched by the local request DataWriter, there is a matched remote publication with the same "GUID prefix" (i.e. belonging to the same remote DomainParticipant) on the local reply DataReader. This test may lead to false positives if a remote DomainParticipant has created two or more "services" (i.e. pairs of request DR/reply DW) with the same (topic) name, e.g. in the case of a remote process with two ROS 2 Nodes that create a service with the same name. This is a very likely case for "1-to-many" services (where a client makes a request on many services at once).rmw_service_server_is_available()
(e.g. by calling ofrclcpp::wait_for_service()
) before sending a request to make sure it'll (most likely) receive at least one reply from a service, but because of the false positives inRMW_Connext_Client::is_service_available()
, it is still possible for the client to end up with "inconsistent" matches and no reply.This information is currently encoded by the ROS Graph, by means of an endpoint's node and topic name, but it is not easily queried and correlated with the client's matches. It also relies on the assumption that a node can't create two services with the same name (not sure if that is the case...).