Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate discovery race condition between clents and services #132

Merged
merged 4 commits into from
Apr 3, 2024

Conversation

asorbini
Copy link
Collaborator

This PR introduces changes meant to reduce the impact of a race condition between clients and services.

The race condition is described in detail by this comment, but to summarize:

It is possible for the service's request DataReader to receive a request from the client, handle it, and try to write the response before the service's reply DataWriter has actually matched the client's reply DataReader. If this is the case, the response will be lost.
The data race and loss of reply message may also occur in the case of an "asymmetric discovery" between the client's reply DataReader and the service's reply DataWriter, where the latter has discovered and matched the former, but not viceversa.

The only way to fully resolve this race condition is described by the DDS-RPC 1.0 specification, section 7.6.2.2:

[..] To ensure robustness, service discovery must ensure that no replies are lost due to race conditions in the discovery process of the underlying DDS entities [..]:

  • The client-side DataWriter (request DataWriter) shall not send out any request to a service DataReader unless the client-side DataReader (reply DataReader) has discovered and matched the corresponding service DataWriter (reply DataWriter).
  • The service-side DataReader (request DataReader) shall reject requests from a client DataWriter (request
    DataWriter) as long as the service-side DataWriter (reply DataWriter) has not discovered and matched the
    corresponding client-side DataReader (reply DataReader).

The proposed solution follows the approach implemented by rmw_fastrtps_cpp:

These changes, paired with the implementation of rmw_service_server_is_available(), provide a partial implementation of the solution prescribed by DDS-RPC, with the following limitations:

  • RMW_Connext_Client::::is_service_available() only verifies that for every remote subscription matched by the local request DataWriter, there is a matched remote publication with the same "GUID prefix" (i.e. belonging to the same remote DomainParticipant) on the local reply DataReader. This test may lead to false positives if a remote DomainParticipant has created two or more "services" (i.e. pairs of request DR/reply DW) with the same (topic) name, e.g. in the case of a remote process with two ROS 2 Nodes that create a service with the same name. This is a very likely case for "1-to-many" services (where a client makes a request on many services at once).
  • The client must use rmw_service_server_is_available() (e.g. by calling of rclcpp::wait_for_service()) before sending a request to make sure it'll (most likely) receive at least one reply from a service, but because of the false positives in RMW_Connext_Client::is_service_available(), it is still possible for the client to end up with "inconsistent" matches and no reply.
  • A more robust solution requires each service to communicate their endpoints' GUIDs through discovery so that clients may correctly associate each pair of discovered endpoints and then make sure they are both matched before sending a request.
    This information is currently encoded by the ROS Graph, by means of an endpoint's node and topic name, but it is not easily queried and correlated with the client's matches. It also relies on the assumption that a node can't create two services with the same name (not sure if that is the case...).

Copy link
Collaborator Author

@asorbini asorbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to help with reviewing.

I will now double-check the changes to RMW_Connext_Client::take_response(), then run CI.

@@ -46,6 +46,10 @@ enum RMW_Connext_MessageType
RMW_CONNEXT_MESSAGE_REPLY
};

#ifndef DDS_GUID_INITIALIZER
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition moved to this common header from rmw_impl.cpp

@@ -281,4 +285,10 @@ rmw_connextdds_get_cft_filter_expression(
rcutils_allocator_t * const allocator,
rmw_subscription_content_filter_options_t * const options);

rmw_ret_t
rmw_connextdds_is_subscription_matched(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New internal API function to check if a remote reader is currently matched by a local writer.

Only implemented for Connext Pro, and only used with the "extended" RPC mapping. The function is never called by the Micro implementation (because it doesn't support the "exteded" RPC mapping).

@@ -55,4 +55,13 @@
#define RMW_CONNEXT_LIMIT_KEEP_ALL_SAMPLES 1000
#endif /* RMW_CONNEXT_LIMIT_SAMPLES_MAX */

#ifndef RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT
#define RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT 100000 /* 100ms */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static limit used by RMW_Connext_Publisher::max_blocking_time when the value of DataWriterQos::reliability::max_blocking_time is "auto".

#endif /* RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_DEFAULT */

#ifndef RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_INFINITE
#define RMW_CONNEXT_LIMIT_DEFAULT_BLOCKING_TIME_INFINITE (3600000000LU * 24LU * 365LU) /* 1 year */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static limit used by RMW_Connext_Publisher::max_blocking_time when the value of DataWriterQos::reliability::max_blocking_time is "infinite".

@@ -45,6 +45,7 @@ struct RMW_Connext_RequestReplyMessage
{
bool request;
rmw_gid_t gid;
rmw_gid_t writer_gid;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance this data structure with an extra field so that we can store both reader and writer GUIDs on the service.

}

rmw_ret_t
RMW_Connext_Publisher::wait_for_subscription(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block and wait up to RMW_Connext_Publisher::max_blocking_time for a "known" reader to be matched.


if (rr_msg->request) {
src_guid = &sample_identity->writer_guid;
src_guid = &related_sample_identity->writer_guid;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the service side, src_guid will contain the client's reply reader's GUID. writer_guid now carries the client's request writer's GUID (which the service will use as the related_sample_identity::writer_guid of its reply).

} else {
src_guid = &related_sample_identity->writer_guid;
src_sn = &related_sample_identity->sequence_number;
writer_guid = &sample_identity->writer_guid;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is not used for anything by the client, at the moment.

@@ -2515,7 +2604,7 @@ RMW_Connext_Client::create(
char * cft_name = nullptr,
*cft_filter = nullptr;
auto scope_exit_cft_name = rcpputils::make_scope_exit(
[cft_name, cft_filter]()
[&cft_name, &cft_filter]()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed memory leak

memcpy(
request_header->request_id.writer_guid,
rr_msg.gid.data,
this->request_pub->gid()->data,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm second guessing this change, and I'm thinking the value should be copied from rr_msg.writer_gid.data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed rmw_take_response(), and the writer_guid is supposed to match the guid returned by rmw_send_request(), so the value returned here is the correct one.

@asorbini
Copy link
Collaborator Author

Full CI:

  • Linux: Build Status
  • Linux aarch64: Build Status (interrupted because rmw_connextdds is disabled on this platform)
  • Windows: Build Status

@asorbini
Copy link
Collaborator Author

I'm trying to debug and resolve the Windows build failure in branch asorbini/service-discovery-race-windbg

@asorbini
Copy link
Collaborator Author

Resolved the Windows build issue (it was caused by an "out of order" #include).

Running CI again:

  • Linux: Build Status
  • Linux aarch64: Build Status (interrupted)
  • Windows: Build Status

@asorbini
Copy link
Collaborator Author

CI passed with one unrelated test failure on Windows.

@clalancette is there someone that could take a look and approve the PR?

@asorbini asorbini requested a review from clalancette October 2, 2023 16:56
@asorbini
Copy link
Collaborator Author

asorbini commented Oct 6, 2023

Re-running CI after resolving a deadlock caused by trying to query subscription on the DataWriter from within "matched_mutex":

  • Linux Build Status
  • Linux aarch64 Build Status (interrupted)
  • Windows Build Status

@bijoua29
Copy link

bijoua29 commented Mar 9, 2024

@clalancette Can this PR be reviewed as this has been open for a few months? We are running Connext rmw and need this fix in the release. Currently have to build this from source but would like to stop doing that.

@bijoua29
Copy link

I understand the Jazzy rmw freeze is coming up soon? Can we get this PR reviewed and merged before then? It's been sitting here for 6 months.

@fujitatomoya
Copy link
Collaborator

@bijoua29 i think this is really nice enhancement. as you noticed, RMW API freeze will be on Mon. April 8, 2024, https://docs.ros.org/en/rolling/Releases/Release-Jazzy-Jalisco.html#release-timeline

i will take a look sometime in this week, but this needs to be rebased before that.

CC: @asorbini

@fujitatomoya fujitatomoya self-assigned this Mar 27, 2024
@bijoua29
Copy link

bijoua29 commented Apr 2, 2024

@bijoua29 i think this is really nice enhancement. as you noticed, RMW API freeze will be on Mon. April 8, 2024, https://docs.ros.org/en/rolling/Releases/Release-Jazzy-Jalisco.html#release-timeline

i will take a look sometime in this week, but this needs to be rebased before that.

CC: @asorbini

@fujitatomoya Thanks for reviewing this.
@asorbini Can you rebase these changes?

@clalancette clalancette force-pushed the asorbini/service-discovery-race branch from 247506e to 41f5230 Compare April 2, 2024 13:50
@clalancette
Copy link
Contributor

I've rebased it onto the latest.

@clalancette
Copy link
Contributor

Here is full CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I love that we can arbitrarily sleep during an rmw_send_response; that seems to me to be very jittery, and who knows if 100 milliseconds is long enough for this discovery to happen. But it does improve the situation here, and also matches what rmw_fastrtps does, so I'm OK with it for now.

@clalancette clalancette merged commit 7c95abb into rolling Apr 3, 2024
1 check passed
@delete-merged-branch delete-merged-branch bot deleted the asorbini/service-discovery-race branch April 3, 2024 19:14
@fujitatomoya
Copy link
Collaborator

@clalancette i was going to have a review, thanks!

clalancette added a commit that referenced this pull request Apr 4, 2024
@clalancette
Copy link
Contributor

I'm not entirely sure, but I think this may be causing regressions on Windows. Here's some CI before and after this change:

  • Windows Debug with this change: Build Status
  • Windows Debug without this change: Build Status

@clalancette
Copy link
Contributor

OK, yeah. It looks like this PR is the cause of the regressions. I'm going to open a PR to revert it.

trubio-rti pushed a commit that referenced this pull request Jun 5, 2024
* Mitigate discovery race condition between clients and services.

Signed-off-by: Andrea Sorbini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants