Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publisher GID in message info of taken message does not match GID of source publisher #377

Open
christophebedard opened this issue Mar 5, 2022 · 3 comments

Comments

@christophebedard
Copy link
Member

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • source
  • Version or commit hash:
    • ros2.repos file at master
  • DDS implementation:
    • Cyclone DDS
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Have a publisher that publishes a message, and have a subscription that receives that message.

Normal inter-process setup, no shared memory or anything.

Expected behavior

These two are equal:

  1. Publisher GID in the rmw message info struct when the subscription takes a message
  2. GID of the rmw publisher (and underlying DDS writer GID/GUID; the conversion is trivial) that sent the message that gets taken by the subscription

when it's the same publisher.

Actual behavior

Publisher GIDs do not match.

Additional information

The GID of a publisher is recorded here in rmw_create_publisher():

TRACEPOINT(rmw_publisher_init, static_cast<const void *>(pub), cddspub->gid.data);

I'm recording the source publisher GID here in rmw_take():

TRACEPOINT(
rmw_take,
static_cast<const void *>(subscription),
static_cast<const void *>(ros_message),
(message_info ? message_info->source_timestamp : 0LL),
*taken);
using this addition:

diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index 681bb03..c81b08c 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -3134,6 +3134,7 @@ take_done:
     static_cast<const void *>(subscription),
     static_cast<const void *>(ros_message),
     (message_info ? message_info->source_timestamp : 0LL),
+    (message_info ? message_info->publisher_gid.data : nullptr),
     *taken);
   return RMW_RET_OK;
 }

Less than 10 lines above that, it's copying publication_handle from the dds_sample_info_t struct into into message_info->publisher_gid. I'm not sure whether that's actually a GID:

message_info->source_timestamp = info.source_timestamp;

I tried to do get_entity_gid(info.publication_handle, message_info->publisher_gid); instead of simply copying the values, but the GIDs don't seem to match either.

Note that I don't think this behaviour is tested anywhere in ROS 2. However, publisher GIDs seem to match correctly with rmw_fastrtps_cpp.

@christophebedard
Copy link
Member Author

christophebedard commented Mar 6, 2022

Note that I don't think this behaviour is tested anywhere in ROS 2.

I added a test to test_rmw_implementation to cover this: christophebedard/rmw_implementation@7605484. However, the test passed on all RMW implementations in my workspace (except rmw_email_cpp 😆). To properly test and compare interprocess vs intraprocess, I've modified/created simple ping-pong test cases and was able to reproduce the issue:

  1. Simple node that publishes a message which is received by another node in another process.
  2. Same thing but the publisher and subscription are in the same node in the same process.

The publisher's GID is printed after it's created. The publisher GID from the message info data is printed in the subscription callback. These two should match. We run each test-case with rmw_fastrtps_cpp and rmw_cyclonedds_cpp.

Steps:

  1. Use the test-tracetools-pingpong-print-publisher-gid branch from ros2_tracing: https://gitlab.com/ros-tracing/ros2_tracing/-/commits/test-tracetools-pingpong-print-publisher-gid
  2. Build up-to test_tracetools
  3. Open two terminals and source workspace
  4. Test-case 1 (interprocess)
    1. rmw_fastrtps_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_ping
      original: pub gid: 1 15 138 60 134 102 14 96 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
      # Terminal 2
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_pong
      sub msg : pub gid: 1 15 138 60 134 102 14 96 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
    2. rmw_cyclonedds_cpp: GIDs do not match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_ping
      original: pub gid: 255 131 30 82 99 164 26 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      # Terminal 2
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_pong
      sub msg : pub gid: 239 202 37 164 203 102 85 109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  5. Test-case 2 (same process)
    1. rmw_fastrtps_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_pingpong 
      original: pub gid: 1 15 138 60 54 105 11 201 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
      sub msg : pub gid: 1 15 138 60 54 105 11 201 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
    2. rmw_cyclonedds_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_pingpong 
      original: pub gid: 146 105 131 154 211 166 10 147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      sub msg : pub gid: 146 105 131 154 211 166 10 147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Conclusion: it's fine when pub/sub is within the same process (or node), but the publisher GIDs do not match in normal interprocess communications.

@eboasson
Copy link
Collaborator

eboasson commented Mar 7, 2022

I agree it is fair to call this discrepancy a bug, though prior to Feb 10 2020 you wouldn't have been able to observe this because originally there was nothing in the ROS 2 interface that required a globally unique identifier and everything in the Cyclone DDS RMW layer used the same identifier that you see here.

That identifier is what is called the "instance handle" for the DDS DataWriter as it is visible in the DDS discovery topics by taking a subscription to to the DCPSPublication topic, and what is made available to the data reader in the "publication_handle" of the SampleInfo (2.2.2.5.5 "SampleInfo Class" in the DDS spec).

The GUID is not in the standardised fields of this SampleInfo, and it is also not in Cyclone's. Adding it would make the sample info even more obese (IMHO) and assuming I find the time to restructure a few things in the innards of Cyclone, it won't be easily available internally for inclusion in the sample info either. The standards-conforming way to get it is to convert the "publication_handle" into the publisher's GUID and to that efficiently you'll need to maintain a mapping. Either in the Cyclone DDS RMW layer itself, or, more elegantly, in the dds_common package.

@christophebedard
Copy link
Member Author

Thanks for the explanation!

That identifier is what is called the "instance handle" for the DDS DataWriter as it is visible in the DDS discovery topics by taking a subscription to to the DCPSPublication topic, and what is made available to the data reader in the "publication_handle" of the SampleInfo (2.2.2.5.5 "SampleInfo Class" in the DDS spec).

The standards-conforming way to get it is to convert the "publication_handle" into the publisher's GUID and to that efficiently you'll need to maintain a mapping. Either in the Cyclone DDS RMW layer itself, or, more elegantly, in the dds_common package.

Looks like most of this exists already:

static void handle_builtintopic_endpoint(

Implementation-wise, is the mapping from dds_builtintopic_endpoint_t's participant_instance_handle to the GID obtained with convert_guid_to_gid() using dds_builtintopic_endpoint_t's participant_key? Like this here:

convert_guid_to_gid(s->participant_key, ppgid);

christophebedard pushed a commit to ros2/ros2_tracing that referenced this issue Oct 7, 2024
The size changed from 24 to 16 bytes in ros2/rmw#345. The same size is
defined in `tracetools`, and it was never updated.

In practice, reading 24-16=8 random extra bytes didn't change much,
since nothing is currently relying on the GID values, at least not
relying on getting the same GID for the same object from two different
systems, because it doesn't work with `rmw_cyclonedds`, see
ros2/rmw_cyclonedds#377.

Signed-off-by: Christophe Bedard <[email protected]>
christophebedard pushed a commit to ros2/ros2_tracing that referenced this issue Oct 8, 2024
The size changed from 24 to 16 bytes in ros2/rmw#345. The same size is
defined in `tracetools`, and it was never updated.

In practice, reading 24-16=8 random extra bytes didn't change much,
since nothing is currently relying on the GID values, at least not
relying on getting the same GID for the same object from two different
systems, because it doesn't work with `rmw_cyclonedds`, see
ros2/rmw_cyclonedds#377.

Signed-off-by: Christophe Bedard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants