Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastRTPS 1.8.0 causes hangs in Navigation2 #280

Closed
mkhansenbot opened this issue May 21, 2019 · 29 comments · Fixed by eProsima/Fast-DDS#541
Closed

FastRTPS 1.8.0 causes hangs in Navigation2 #280

mkhansenbot opened this issue May 21, 2019 · 29 comments · Fixed by eProsima/Fast-DDS#541
Labels
bug Something isn't working

Comments

@mkhansenbot
Copy link

mkhansenbot commented May 21, 2019

Bug report

Required Info:

  • Operating System:
    • Ubuntu 18.04
  • Installation type:
    • Source
  • Version or commit hash:
    • 0.7.2
  • DDS implementation:
    • Fast-RTPS master branch (1.8.0)
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

Currently I have to run our Navigation2 system test to reproduce this, I'm trying to find a simpler example. However what I see is that when I run our system test with the latest versions (master branches) of rmw_fastrtps (0.7.2) and Fast-RTPS (1.8.0) our test hangs and times out. When I run with the previous versions (0.7.1) and (1.7.2) respectively, things work fine. Also if I run with RMW_IMPLEMENTATION=rmw_opensplice_cpp, things work fine then too.

I haven't been able to isolate the problem. I can provide instructions for how to reproduce using the Nav2 system test if desired.

I see AMCL is stuck waiting for data on the /scan topic, but when I do a ros2 topic hz /scan I can see that the scan topic is being published correctly by gazebo. So it's like the callback to AMCL is not being executed. I'm not sure how to debug that, but I'm pretty sure it's in this rmw layer.

If anyone can offer some help or suggestions as to what to look at to debug this besides the fact that the topics are being executed, I'd appreciate the help.

This is high priority as it is blocking our CI. We won't be able to release for Dashing in this state.

@mkhansenbot
Copy link
Author

mkhansenbot commented May 21, 2019

besides the fact that the topics are being executed

I mean besides the fact that the topics are being published

@dirk-thomas
Copy link
Member

I can provide instructions for how to reproduce using the Nav2 system test if desired.

Please do so in order for others to be able to help.

@mkhansenbot
Copy link
Author

OK, here are instructions:

  1. Assuming you've already cloned and built the ros2 repos following the 'Building from source' directions, and you're using the 'master' version of the ros2 repos file.

  2. clone our repo

mkdir -p ~/nav2_ws/src
cd ~/nav2_ws/src
git clone https://github.com/ros-planning/navigation2.git
  1. Download and build source dependencies:
mkdir -p ~/nav2_dependencies_ws/src
cd ~/nav2_dependencies_ws
cp ~/nav2_ws/src/navigation2/tools/ros2_dependencies.repos .
vcs import src < ros2_dependencies.repos
colcon build --symlink-install
  1. Build Navigation2 with system tests
cd ~/nav2_ws
rm src/navigation2/nav2_system_tests/COLCON_IGNORE
colcon build --symlink install
  1. Run the system tests
cd build/nav2_system_tests
ctest -V

At this point, you'll see that the test_localization test will timeout, as will the bt_navigator test.

Switching to RMW_IMPLEMENTATION=rmw_opensplice_cpp and re-running the test again, it should pass.

I hope this helps, I appreciate any help I can get on this

@mkhansenbot
Copy link
Author

I forgot, of course you have to source install/setup.bash after step 2 and step 3.

@nburek
Copy link
Contributor

nburek commented May 21, 2019

@mkhansen-intel Do these tests use service servers and clients? There was an issue with some of the test related to using the parameter client that were caused by a race condition of the server and client not being fully subscribed to each other before the function to wait for the service to be ready returns true on the client side. In that case we were seeing that the client call was hanging because it was waiting for a response to a message the server never received. Does it seem like this might be a similar situation to your failing test?

My understanding was that the tests we saw this on were problematic before the upgrade to 1.8.0, but that it was exacerbated after the update. ros2/build_farmer#166

@mkhansenbot
Copy link
Author

@nburek - I don't think that's the same issue, but I'll look into this more deeply and see. I believe the only wait for service would be for setting use_sim_time=True in this case, but I don't think that's what's hanging. I'll verify that and post again.

@dirk-thomas
Copy link
Member

I just tried the following and can confirm the hang with FastRTPS:

  • Shell A:
    • install the Dashing prerelease Debian packages (ros-dashing-*)
    • source /opt/ros/dashing/setup.bash
    • clone ros-planning/navigation2, ros-simulation/gazebo_ros_pkgs@ros2 (will be released soon) and BehaviorTree/BehaviorTree.CPP@ros2 (since its release failed to build) into a workspace ws
    • call colcon build in that workspace
  • Shell B:
    • source ws/install/setup.bash
    • ctest -V -R test_localization hangs...

@richiprosima You insight might be helpful on this ticket. This is using the latest commit from the master branch of FastRTPS.

@mkhansenbot
Copy link
Author

@dirk-thomas - glad you are able to reproduce

When you back up to the previous released version of FastRTPS do you see the test pass? That's what I see.

I believe our test is hanging when we try to set use_sim_time on our nodes. At least that's what I'm seeing when I run manually. I need to test again with the old version to confirm.

@dirk-thomas
Copy link
Member

From reading the sources a bit more it does look like that you are currently making the assumption that you can publish a message right after the publisher has been created without waiting for potential subscribers to be matched.

For the initialpose topic I added that kind of logic on both sides (see https://github.com/dirk-thomas/navigation2/commit/5fbeb2b1cd070fdf4b72c501270bdea9e3b0f450). That logic itself obviously changes the timing and therefore makes the test pass for me sometimes.

But in the cases where is still hangs everything points to a discovery problem in FastRTPS (@richiprosima). A common pattern I am seeing:

  • publisher and subscriber are being created in separate processes
  • both wait until the matched count of the opposite side is at least 1
  • the publisher side continue and publishes a message after the matched subscription count is 1
  • but the subscriber never "sees" a matching publisher and therefore hangs in the loop I added forever (without the loop the subscriber would simply not get the published message and the system hangs as before).

@mkhansenbot
Copy link
Author

From reading the sources a bit more it does look like that you are currently making the assumption that you can publish a message right after the publisher has been created without waiting for potential subscribers to be matched.

Is that invalid? In ROS it was OK, and this hasn't been a problem in the past for ROS2 that I know of. We can make the change, but if this is something we should be doing everywhere, we're not, and it would take us a bit of work to implement in every node. I'm OK with doing that, if that's the best practice for ROS2.

@dirk-thomas
Copy link
Member

Is that invalid?

Yes, at least in terms of that the code is subject to a race.

this hasn't been a problem in the past for ROS2 that I know of

This has always been the case - the time race just shows on some system / platforms but not on others.

We can make the change, but if this is something we should be doing everywhere, we're not, and it would take us a bit of work to implement in every node.

For my snippet above the check of the matched endpoint count is fine sine the test knows exactly how many endpoints to expect. In many other cases the expected count is unknown. Then the software needs to be designed in a way that it handles later matched endpoints gracefully.

E.g. in the communication tests a publisher publishes a message with known values and the subscriber compares the first received message with expected values. To make this robust against this kind of race we don't rely on the matched endpoint count but repeatedly publish the message in the publisher (until the test finished when the subscriber gets the first message).


I just want to be clear that independent of this race there still seems to be a regression in the latest FastRTPS version which leads to one of the endpoints not being matched at all in some cases (even after waiting e.g. 30s).

@MiguelCompany
Copy link
Collaborator

I just want to be clear that independent of this race there still seems to be a regression in the latest FastRTPS version which leads to one of the endpoints not being matched at all in some cases (even after waiting e.g. 30s).

We are trying to reproduce this, will keep you informed

@MiguelCompany
Copy link
Collaborator

MiguelCompany commented May 22, 2019

One thing to note. The requirement of the publisher waiting for the subscriber to match is only necessary when durability is set to RMW_QOS_POLICY_DURABILITY_VOLATILE (which is the value on rmw_qos_profile_default). Changing it to RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL should make the subscriber receive the sample even if it was published before its creation.

@dirk-thomas
Copy link
Member

dirk-thomas commented May 22, 2019

The requirement of the publisher waiting for the subscriber to match is only necessary when durability is set to RMW_QOS_POLICY_DURABILITY_VOLATILE (which is the value on rmw_qos_profile_default). Changing it to RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL should make the subscriber receive the sample even if it was published before its creation.

👍 That is certainly possible.

But in some cases transient local isn't a good choice and volatile should still reliably discover / match endpoints given enough time.

@dirk-thomas
Copy link
Member

@richiprosima FYI this one seems to be a very similar report of the same problem: ros2/ros2#651

@mkhansenbot
Copy link
Author

@dirk-thomas - given that we're days from Dashing release can we either:

  1. Roll back to 1.7.2 with rmw_fastrtps 0.7.1
  2. Include Opensplice in the binaries so that we can set RMW_IMPLEMENTATION=rmw_opensplice_cpp without having to build ROS2 from source repos

@clalancette
Copy link
Contributor

@mkhansen-intel Number 2 is already available; if you apt-get install ros-dashing-rmw-opensplice-cpp, you should be able to build and run against opensplice.

@mkhansenbot
Copy link
Author

@clalancette - thanks I'll try that!

@dirk-thomas
Copy link
Member

dirk-thomas commented May 23, 2019

Just for the record: the same is also true for Connext - just install ros-dashing-rmw-connext-cpp. That is for the Debian packages.

For the big archives you just need to run the OpenSplice / Connext installed and the ROS 2 binaries can use those by selecting then with the env var.

@mkhansenbot
Copy link
Author

I'm trying to get this to work in our Dockerfile here:
ros-navigation/navigation2#745

It doesn't seem to be running with opensplice, or else opensplice is also failing in the Dockerfile, I can't tell exactly.

Our Dockerfile starts with pulling the OSRF ros2:nightly dockerfile then building on top of that. I'm installing the ros-dashing-rmw-opensplice-cpp but it doesn't seem to run with it at runtime.

@alsora
Copy link
Contributor

alsora commented May 23, 2019

May be a discovery problem related to #281?

@MiguelCompany
Copy link
Collaborator

We found the issue. It was related with a change necessary for the implementation of the lifespan QoS. A fix is on the way in eProsima/Fast-DDS#541, a new blackbox test is being added in eProsima/Fast-DDS#542, and a new unit test is under development.

@dirk-thomas
Copy link
Member

@MiguelCompany great news.

@mkhansen-intel can you please retest with the latest code including the mentioned patch.

@mkhansenbot
Copy link
Author

I pulled the latest Fast-RTPS master branch and rmw_fastrtps master, and I don't see our test pass. It looks like a new error message from Gazebo:

9: [gzserver-1] xcb_connection_has_error() returned true
9: [gzserver-1] xcb_connection_has_error() returned true
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
9: [amcl-5] [WARN] [amcl]: Laser scan has not been received (and thus no pose updates have been published). Verify that data is being published on the scan topic.
1/1 Test #9: test_localization ................***Timeout  60.01 sec

I haven't seen that message 'xcb_connection_has_error() returned true' before.

@MiguelCompany
Copy link
Collaborator

I haven't seen that message 'xcb_connection_has_error() returned true' before.

That is surely not related to fastrtps, as xcb_connection_has_error is part of XCB core API

I am setting up a workspace to reproduce the problem with the localization test, starting from the instructions on this comment. I also had to clone angles, image_common and vision_opencv for a sucessful build.

@MiguelCompany
Copy link
Collaborator

I've found an issue and it is not in fastrtps. It puzzles me that you said the tests pass with v1.7.2 and with rmw_opensplice, because I've found that the publisher of the "scan" topic is best effort, while the subscriber is reliable, and that configuration should make them incompatible for communication.

On my side the tests didn't pass with 1.7.2

The following patch taking the qos profile from rmw_qos_profile_sensor_data makes all tests pass:

diff --git a/nav2_amcl/src/amcl_node.cpp b/nav2_amcl/src/amcl_node.cpp
index e773fde..2c27ec2 100644
--- a/nav2_amcl/src/amcl_node.cpp
+++ b/nav2_amcl/src/amcl_node.cpp
@@ -134,6 +134,7 @@ AmclNode::AmclNode()
   set_map_srv_ = create_service<nav_msgs::srv::SetMap>("set_map", handle_set_map_callback);
 
 
+  custom_qos_profile = rmw_qos_profile_sensor_data;
   custom_qos_profile.depth = 1;
   laser_scan_sub_ = new message_filters::Subscriber<sensor_msgs::msg::LaserScan>(this,
       scan_topic_, custom_qos_profile);

@mkhansenbot
Copy link
Author

@MiguelCompany - I tested and that change you suggested above definitely helps. I submitted a PR for it. Thanks for helping with that. I think the changes you made above helped also (eProsima/Fast-DDS#541,).

Let me see if between those changes and this PR we get our CI to pass again and I'll close this ticket.

@dirk-thomas
Copy link
Member

@mkhansen-intel Any update?

@dirk-thomas
Copy link
Member

I will go ahead and close this ticket assuming the problem has been resolved. If that is not the case please feel free to comment with more information and the ticket can be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants