FastRTPS 1.8.0 causes hangs in Navigation2 #280

mkhansenbot · 2019-05-21T15:46:01Z

Bug report

Required Info:

Operating System:
- Ubuntu 18.04
Installation type:
- Source
Version or commit hash:
- 0.7.2
DDS implementation:
- Fast-RTPS master branch (1.8.0)
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Currently I have to run our Navigation2 system test to reproduce this, I'm trying to find a simpler example. However what I see is that when I run our system test with the latest versions (master branches) of rmw_fastrtps (0.7.2) and Fast-RTPS (1.8.0) our test hangs and times out. When I run with the previous versions (0.7.1) and (1.7.2) respectively, things work fine. Also if I run with RMW_IMPLEMENTATION=rmw_opensplice_cpp, things work fine then too.

I haven't been able to isolate the problem. I can provide instructions for how to reproduce using the Nav2 system test if desired.

I see AMCL is stuck waiting for data on the /scan topic, but when I do a ros2 topic hz /scan I can see that the scan topic is being published correctly by gazebo. So it's like the callback to AMCL is not being executed. I'm not sure how to debug that, but I'm pretty sure it's in this rmw layer.

If anyone can offer some help or suggestions as to what to look at to debug this besides the fact that the topics are being executed, I'd appreciate the help.

This is high priority as it is blocking our CI. We won't be able to release for Dashing in this state.

The text was updated successfully, but these errors were encountered:

mkhansenbot · 2019-05-21T16:01:31Z

besides the fact that the topics are being executed

I mean besides the fact that the topics are being published

dirk-thomas · 2019-05-21T16:08:38Z

I can provide instructions for how to reproduce using the Nav2 system test if desired.

Please do so in order for others to be able to help.

mkhansenbot · 2019-05-21T16:48:11Z

OK, here are instructions:

Assuming you've already cloned and built the ros2 repos following the 'Building from source' directions, and you're using the 'master' version of the ros2 repos file.
clone our repo

mkdir -p ~/nav2_ws/src
cd ~/nav2_ws/src
git clone https://github.com/ros-planning/navigation2.git

Download and build source dependencies:

mkdir -p ~/nav2_dependencies_ws/src
cd ~/nav2_dependencies_ws
cp ~/nav2_ws/src/navigation2/tools/ros2_dependencies.repos .
vcs import src < ros2_dependencies.repos
colcon build --symlink-install

Build Navigation2 with system tests

cd ~/nav2_ws
rm src/navigation2/nav2_system_tests/COLCON_IGNORE
colcon build --symlink install

Run the system tests

cd build/nav2_system_tests
ctest -V

At this point, you'll see that the test_localization test will timeout, as will the bt_navigator test.

Switching to RMW_IMPLEMENTATION=rmw_opensplice_cpp and re-running the test again, it should pass.

I hope this helps, I appreciate any help I can get on this

mkhansenbot · 2019-05-21T17:03:24Z

I forgot, of course you have to source install/setup.bash after step 2 and step 3.

nburek · 2019-05-21T17:58:06Z

@mkhansen-intel Do these tests use service servers and clients? There was an issue with some of the test related to using the parameter client that were caused by a race condition of the server and client not being fully subscribed to each other before the function to wait for the service to be ready returns true on the client side. In that case we were seeing that the client call was hanging because it was waiting for a response to a message the server never received. Does it seem like this might be a similar situation to your failing test?

My understanding was that the tests we saw this on were problematic before the upgrade to 1.8.0, but that it was exacerbated after the update. ros2/build_farmer#166

mkhansenbot · 2019-05-21T18:31:54Z

@nburek - I don't think that's the same issue, but I'll look into this more deeply and see. I believe the only wait for service would be for setting use_sim_time=True in this case, but I don't think that's what's hanging. I'll verify that and post again.

dirk-thomas · 2019-05-21T22:56:48Z

I just tried the following and can confirm the hang with FastRTPS:

Shell A:
- install the Dashing prerelease Debian packages (ros-dashing-*)
- source /opt/ros/dashing/setup.bash
- clone ros-planning/navigation2, ros-simulation/gazebo_ros_pkgs@ros2 (will be released soon) and BehaviorTree/BehaviorTree.CPP@ros2 (since its release failed to build) into a workspace ws
- call colcon build in that workspace
Shell B:
- source ws/install/setup.bash
- ctest -V -R test_localization hangs...

@richiprosima You insight might be helpful on this ticket. This is using the latest commit from the master branch of FastRTPS.

mkhansenbot · 2019-05-21T23:03:43Z

@dirk-thomas - glad you are able to reproduce

When you back up to the previous released version of FastRTPS do you see the test pass? That's what I see.

I believe our test is hanging when we try to set use_sim_time on our nodes. At least that's what I'm seeing when I run manually. I need to test again with the old version to confirm.

dirk-thomas · 2019-05-21T23:42:00Z

From reading the sources a bit more it does look like that you are currently making the assumption that you can publish a message right after the publisher has been created without waiting for potential subscribers to be matched.

For the initialpose topic I added that kind of logic on both sides (see https://github.com/dirk-thomas/navigation2/commit/5fbeb2b1cd070fdf4b72c501270bdea9e3b0f450). That logic itself obviously changes the timing and therefore makes the test pass for me sometimes.

But in the cases where is still hangs everything points to a discovery problem in FastRTPS (@richiprosima). A common pattern I am seeing:

publisher and subscriber are being created in separate processes
both wait until the matched count of the opposite side is at least 1
the publisher side continue and publishes a message after the matched subscription count is 1
but the subscriber never "sees" a matching publisher and therefore hangs in the loop I added forever (without the loop the subscriber would simply not get the published message and the system hangs as before).

mkhansenbot · 2019-05-22T00:18:00Z

From reading the sources a bit more it does look like that you are currently making the assumption that you can publish a message right after the publisher has been created without waiting for potential subscribers to be matched.

Is that invalid? In ROS it was OK, and this hasn't been a problem in the past for ROS2 that I know of. We can make the change, but if this is something we should be doing everywhere, we're not, and it would take us a bit of work to implement in every node. I'm OK with doing that, if that's the best practice for ROS2.

dirk-thomas · 2019-05-22T05:02:03Z

Is that invalid?

Yes, at least in terms of that the code is subject to a race.

this hasn't been a problem in the past for ROS2 that I know of

This has always been the case - the time race just shows on some system / platforms but not on others.

We can make the change, but if this is something we should be doing everywhere, we're not, and it would take us a bit of work to implement in every node.

For my snippet above the check of the matched endpoint count is fine sine the test knows exactly how many endpoints to expect. In many other cases the expected count is unknown. Then the software needs to be designed in a way that it handles later matched endpoints gracefully.

E.g. in the communication tests a publisher publishes a message with known values and the subscriber compares the first received message with expected values. To make this robust against this kind of race we don't rely on the matched endpoint count but repeatedly publish the message in the publisher (until the test finished when the subscriber gets the first message).

I just want to be clear that independent of this race there still seems to be a regression in the latest FastRTPS version which leads to one of the endpoints not being matched at all in some cases (even after waiting e.g. 30s).

MiguelCompany · 2019-05-22T06:02:51Z

I just want to be clear that independent of this race there still seems to be a regression in the latest FastRTPS version which leads to one of the endpoints not being matched at all in some cases (even after waiting e.g. 30s).

We are trying to reproduce this, will keep you informed

MiguelCompany · 2019-05-22T10:17:12Z

One thing to note. The requirement of the publisher waiting for the subscriber to match is only necessary when durability is set to RMW_QOS_POLICY_DURABILITY_VOLATILE (which is the value on rmw_qos_profile_default). Changing it to RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL should make the subscriber receive the sample even if it was published before its creation.

dirk-thomas · 2019-05-22T17:09:06Z

The requirement of the publisher waiting for the subscriber to match is only necessary when durability is set to RMW_QOS_POLICY_DURABILITY_VOLATILE (which is the value on rmw_qos_profile_default). Changing it to RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL should make the subscriber receive the sample even if it was published before its creation.

👍 That is certainly possible.

But in some cases transient local isn't a good choice and volatile should still reliably discover / match endpoints given enough time.

dirk-thomas · 2019-05-23T00:13:16Z

@richiprosima FYI this one seems to be a very similar report of the same problem: ros2/ros2#651

mkhansenbot · 2019-05-23T14:06:51Z

@dirk-thomas - given that we're days from Dashing release can we either:

Roll back to 1.7.2 with rmw_fastrtps 0.7.1
Include Opensplice in the binaries so that we can set RMW_IMPLEMENTATION=rmw_opensplice_cpp without having to build ROS2 from source repos

clalancette · 2019-05-23T14:09:54Z

@mkhansen-intel Number 2 is already available; if you apt-get install ros-dashing-rmw-opensplice-cpp, you should be able to build and run against opensplice.

mkhansenbot · 2019-05-23T14:14:54Z

@clalancette - thanks I'll try that!

dirk-thomas · 2019-05-23T14:18:28Z

Just for the record: the same is also true for Connext - just install ros-dashing-rmw-connext-cpp. That is for the Debian packages.

For the big archives you just need to run the OpenSplice / Connext installed and the ROS 2 binaries can use those by selecting then with the env var.

mkhansenbot · 2019-05-23T15:35:44Z

I'm trying to get this to work in our Dockerfile here:
ros-navigation/navigation2#745

It doesn't seem to be running with opensplice, or else opensplice is also failing in the Dockerfile, I can't tell exactly.

Our Dockerfile starts with pulling the OSRF ros2:nightly dockerfile then building on top of that. I'm installing the ros-dashing-rmw-opensplice-cpp but it doesn't seem to run with it at runtime.

alsora · 2019-05-23T17:13:10Z

May be a discovery problem related to #281?

MiguelCompany · 2019-05-27T13:25:14Z

We found the issue. It was related with a change necessary for the implementation of the lifespan QoS. A fix is on the way in eProsima/Fast-DDS#541, a new blackbox test is being added in eProsima/Fast-DDS#542, and a new unit test is under development.

dirk-thomas · 2019-05-27T16:54:08Z

@MiguelCompany great news.

@mkhansen-intel can you please retest with the latest code including the mentioned patch.

mkhansenbot · 2019-05-28T04:14:58Z

I pulled the latest Fast-RTPS master branch and rmw_fastrtps master, and I don't see our test pass. It looks like a new error message from Gazebo:

9: [gzserver-1] xcb_connection_has_error() returned true
9: [gzserver-1] xcb_connection_has_error() returned true
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
1/1 Test #9: test_localization ................***Timeout  60.01 sec

I haven't seen that message 'xcb_connection_has_error() returned true' before.

MiguelCompany · 2019-05-28T08:41:54Z

I haven't seen that message 'xcb_connection_has_error() returned true' before.

That is surely not related to fastrtps, as xcb_connection_has_error is part of XCB core API

I am setting up a workspace to reproduce the problem with the localization test, starting from the instructions on this comment. I also had to clone angles, image_common and vision_opencv for a sucessful build.

MiguelCompany · 2019-05-28T11:02:15Z

I've found an issue and it is not in fastrtps. It puzzles me that you said the tests pass with v1.7.2 and with rmw_opensplice, because I've found that the publisher of the "scan" topic is best effort, while the subscriber is reliable, and that configuration should make them incompatible for communication.

On my side the tests didn't pass with 1.7.2

The following patch taking the qos profile from rmw_qos_profile_sensor_data makes all tests pass:

diff --git a/nav2_amcl/src/amcl_node.cpp b/nav2_amcl/src/amcl_node.cpp
index e773fde..2c27ec2 100644
--- a/nav2_amcl/src/amcl_node.cpp
+++ b/nav2_amcl/src/amcl_node.cpp
@@ -134,6 +134,7 @@ AmclNode::AmclNode()
   set_map_srv_ = create_service<nav_msgs::srv::SetMap>("set_map", handle_set_map_callback);
 
 
+  custom_qos_profile = rmw_qos_profile_sensor_data;
   custom_qos_profile.depth = 1;
   laser_scan_sub_ = new message_filters::Subscriber<sensor_msgs::msg::LaserScan>(this,
       scan_topic_, custom_qos_profile);

mkhansenbot · 2019-05-28T14:08:07Z

@MiguelCompany - I tested and that change you suggested above definitely helps. I submitted a PR for it. Thanks for helping with that. I think the changes you made above helped also (eProsima/Fast-DDS#541,).

Let me see if between those changes and this PR we get our CI to pass again and I'll close this ticket.

dirk-thomas · 2019-05-29T16:24:45Z

@mkhansen-intel Any update?

dirk-thomas · 2019-05-30T16:08:03Z

I will go ahead and close this ticket assuming the problem has been resolved. If that is not the case please feel free to comment with more information and the ticket can be reopened.

dirk-thomas added the bug Something isn't working label May 21, 2019

mkhansenbot mentioned this issue May 21, 2019

AMCL fails to localize in simulation environment ros-navigation/navigation2#735

Closed

mkhansenbot mentioned this issue May 22, 2019

[WIP] Use Opensplice when running system tests ros-navigation/navigation2#745

Closed

MiguelCompany mentioned this issue May 27, 2019

Added discovery regression test [5479] eProsima/Fast-DDS#542

Merged

richiware mentioned this issue May 27, 2019

Fix on get_change with new History ordering [5477] eProsima/Fast-DDS#541

Merged

mkhansenbot mentioned this issue May 28, 2019

AMCL to use sensor data qos for scan topic ros-navigation/navigation2#760

Merged

dirk-thomas closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastRTPS 1.8.0 causes hangs in Navigation2 #280

FastRTPS 1.8.0 causes hangs in Navigation2 #280

mkhansenbot commented May 21, 2019 •

edited

Loading

mkhansenbot commented May 21, 2019 •

edited

Loading

dirk-thomas commented May 21, 2019

mkhansenbot commented May 21, 2019

mkhansenbot commented May 21, 2019

nburek commented May 21, 2019 •

edited

Loading

mkhansenbot commented May 21, 2019

dirk-thomas commented May 21, 2019

mkhansenbot commented May 21, 2019

dirk-thomas commented May 21, 2019

mkhansenbot commented May 22, 2019

dirk-thomas commented May 22, 2019

MiguelCompany commented May 22, 2019

MiguelCompany commented May 22, 2019 •

edited

Loading

dirk-thomas commented May 22, 2019 •

edited

Loading

dirk-thomas commented May 23, 2019

mkhansenbot commented May 23, 2019

clalancette commented May 23, 2019

mkhansenbot commented May 23, 2019

dirk-thomas commented May 23, 2019 •

edited

Loading

mkhansenbot commented May 23, 2019

alsora commented May 23, 2019

MiguelCompany commented May 27, 2019

dirk-thomas commented May 27, 2019

mkhansenbot commented May 28, 2019

MiguelCompany commented May 28, 2019

MiguelCompany commented May 28, 2019

mkhansenbot commented May 28, 2019

dirk-thomas commented May 29, 2019

dirk-thomas commented May 30, 2019

FastRTPS 1.8.0 causes hangs in Navigation2 #280

FastRTPS 1.8.0 causes hangs in Navigation2 #280

Comments

mkhansenbot commented May 21, 2019 • edited Loading

Bug report

Steps to reproduce issue

mkhansenbot commented May 21, 2019 • edited Loading

dirk-thomas commented May 21, 2019

mkhansenbot commented May 21, 2019

mkhansenbot commented May 21, 2019

nburek commented May 21, 2019 • edited Loading

mkhansenbot commented May 21, 2019

dirk-thomas commented May 21, 2019

mkhansenbot commented May 21, 2019

dirk-thomas commented May 21, 2019

mkhansenbot commented May 22, 2019

dirk-thomas commented May 22, 2019

MiguelCompany commented May 22, 2019

MiguelCompany commented May 22, 2019 • edited Loading

dirk-thomas commented May 22, 2019 • edited Loading

dirk-thomas commented May 23, 2019

mkhansenbot commented May 23, 2019

clalancette commented May 23, 2019

mkhansenbot commented May 23, 2019

dirk-thomas commented May 23, 2019 • edited Loading

mkhansenbot commented May 23, 2019

alsora commented May 23, 2019

MiguelCompany commented May 27, 2019

dirk-thomas commented May 27, 2019

mkhansenbot commented May 28, 2019

MiguelCompany commented May 28, 2019

MiguelCompany commented May 28, 2019

mkhansenbot commented May 28, 2019

dirk-thomas commented May 29, 2019

dirk-thomas commented May 30, 2019

mkhansenbot commented May 21, 2019 •

edited

Loading

mkhansenbot commented May 21, 2019 •

edited

Loading

nburek commented May 21, 2019 •

edited

Loading

MiguelCompany commented May 22, 2019 •

edited

Loading

dirk-thomas commented May 22, 2019 •

edited

Loading

dirk-thomas commented May 23, 2019 •

edited

Loading