-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastRTPS 1.8.0 causes hangs in Navigation2 #280
Comments
I mean besides the fact that the topics are being published |
Please do so in order for others to be able to help. |
OK, here are instructions:
At this point, you'll see that the test_localization test will timeout, as will the bt_navigator test. Switching to RMW_IMPLEMENTATION=rmw_opensplice_cpp and re-running the test again, it should pass. I hope this helps, I appreciate any help I can get on this |
I forgot, of course you have to |
@mkhansen-intel Do these tests use service servers and clients? There was an issue with some of the test related to using the parameter client that were caused by a race condition of the server and client not being fully subscribed to each other before the function to wait for the service to be ready returns true on the client side. In that case we were seeing that the client call was hanging because it was waiting for a response to a message the server never received. Does it seem like this might be a similar situation to your failing test? My understanding was that the tests we saw this on were problematic before the upgrade to 1.8.0, but that it was exacerbated after the update. ros2/build_farmer#166 |
@nburek - I don't think that's the same issue, but I'll look into this more deeply and see. I believe the only wait for service would be for setting use_sim_time=True in this case, but I don't think that's what's hanging. I'll verify that and post again. |
I just tried the following and can confirm the hang with FastRTPS:
@richiprosima You insight might be helpful on this ticket. This is using the latest commit from the |
@dirk-thomas - glad you are able to reproduce When you back up to the previous released version of FastRTPS do you see the test pass? That's what I see. I believe our test is hanging when we try to set use_sim_time on our nodes. At least that's what I'm seeing when I run manually. I need to test again with the old version to confirm. |
From reading the sources a bit more it does look like that you are currently making the assumption that you can publish a message right after the publisher has been created without waiting for potential subscribers to be matched. For the But in the cases where is still hangs everything points to a discovery problem in FastRTPS (@richiprosima). A common pattern I am seeing:
|
Is that invalid? In ROS it was OK, and this hasn't been a problem in the past for ROS2 that I know of. We can make the change, but if this is something we should be doing everywhere, we're not, and it would take us a bit of work to implement in every node. I'm OK with doing that, if that's the best practice for ROS2. |
Yes, at least in terms of that the code is subject to a race.
This has always been the case - the time race just shows on some system / platforms but not on others.
For my snippet above the check of the matched endpoint count is fine sine the test knows exactly how many endpoints to expect. In many other cases the expected count is unknown. Then the software needs to be designed in a way that it handles later matched endpoints gracefully. E.g. in the communication tests a publisher publishes a message with known values and the subscriber compares the first received message with expected values. To make this robust against this kind of race we don't rely on the matched endpoint count but repeatedly publish the message in the publisher (until the test finished when the subscriber gets the first message). I just want to be clear that independent of this race there still seems to be a regression in the latest FastRTPS version which leads to one of the endpoints not being matched at all in some cases (even after waiting e.g. 30s). |
We are trying to reproduce this, will keep you informed |
One thing to note. The requirement of the publisher waiting for the subscriber to match is only necessary when durability is set to |
👍 That is certainly possible. But in some cases transient local isn't a good choice and volatile should still reliably discover / match endpoints given enough time. |
@richiprosima FYI this one seems to be a very similar report of the same problem: ros2/ros2#651 |
@dirk-thomas - given that we're days from Dashing release can we either:
|
@mkhansen-intel Number 2 is already available; if you |
@clalancette - thanks I'll try that! |
Just for the record: the same is also true for Connext - just install For the big archives you just need to run the OpenSplice / Connext installed and the ROS 2 binaries can use those by selecting then with the env var. |
I'm trying to get this to work in our Dockerfile here: It doesn't seem to be running with opensplice, or else opensplice is also failing in the Dockerfile, I can't tell exactly. Our Dockerfile starts with pulling the OSRF ros2:nightly dockerfile then building on top of that. I'm installing the |
May be a discovery problem related to #281? |
We found the issue. It was related with a change necessary for the implementation of the lifespan QoS. A fix is on the way in eProsima/Fast-DDS#541, a new blackbox test is being added in eProsima/Fast-DDS#542, and a new unit test is under development. |
@MiguelCompany great news. @mkhansen-intel can you please retest with the latest code including the mentioned patch. |
I pulled the latest Fast-RTPS master branch and rmw_fastrtps master, and I don't see our test pass. It looks like a new error message from Gazebo:
I haven't seen that message 'xcb_connection_has_error() returned true' before. |
That is surely not related to fastrtps, as xcb_connection_has_error is part of XCB core API I am setting up a workspace to reproduce the problem with the localization test, starting from the instructions on this comment. I also had to clone angles, image_common and vision_opencv for a sucessful build. |
I've found an issue and it is not in fastrtps. It puzzles me that you said the tests pass with v1.7.2 and with rmw_opensplice, because I've found that the publisher of the "scan" topic is best effort, while the subscriber is reliable, and that configuration should make them incompatible for communication. On my side the tests didn't pass with 1.7.2 The following patch taking the qos profile from
|
@MiguelCompany - I tested and that change you suggested above definitely helps. I submitted a PR for it. Thanks for helping with that. I think the changes you made above helped also (eProsima/Fast-DDS#541,). Let me see if between those changes and this PR we get our CI to pass again and I'll close this ticket. |
@mkhansen-intel Any update? |
I will go ahead and close this ticket assuming the problem has been resolved. If that is not the case please feel free to comment with more information and the ticket can be reopened. |
Bug report
Required Info:
Steps to reproduce issue
Currently I have to run our Navigation2 system test to reproduce this, I'm trying to find a simpler example. However what I see is that when I run our system test with the latest versions (master branches) of rmw_fastrtps (0.7.2) and Fast-RTPS (1.8.0) our test hangs and times out. When I run with the previous versions (0.7.1) and (1.7.2) respectively, things work fine. Also if I run with RMW_IMPLEMENTATION=rmw_opensplice_cpp, things work fine then too.
I haven't been able to isolate the problem. I can provide instructions for how to reproduce using the Nav2 system test if desired.
I see AMCL is stuck waiting for data on the /scan topic, but when I do a
ros2 topic hz /scan
I can see that the scan topic is being published correctly by gazebo. So it's like the callback to AMCL is not being executed. I'm not sure how to debug that, but I'm pretty sure it's in this rmw layer.If anyone can offer some help or suggestions as to what to look at to debug this besides the fact that the topics are being executed, I'd appreciate the help.
This is high priority as it is blocking our CI. We won't be able to release for Dashing in this state.
The text was updated successfully, but these errors were encountered: