-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible race condition in controller manager on humble #979
Comments
I can verify I am facing the same issue on Humble 22.04 with the latest packages of this month. No custom controller, just using Bellow is the stacktrace after it crashes. It happens very often (almost 50%) of the time but randomly.
|
Unfortunately I asked in ros2/rclcpp#1756 and there won't be a backport, seems it is an ABI-breaking fix. |
I just encountered the same issue when soak testing my robot (running the robot non stop for days)
Is there any way to fix this? Using Humble 22.04 |
Quick follow up question. In my case respawining the
|
@pepeRossRobotics The only way we found to prevent the crash from happening is forking ros2_control and surround the place where the problematic method is called with a while/try/catch but that is not a good way to do it. Concerning the launch file, I believe you can use event handlers to start the "spawner" when the |
Many thanks for the help, I assumed something like this was possible, but didn't know how to implement them. I will give them a go now. |
There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186 |
this is really helpful, many thanks! |
Just wanted to add another stack trace to the pile. This is also on Humble on Ubuntu 22.04 LTS.
|
Two updates on this:
I think in the near-term #922 should be reverted. While diagnostic info is nice to have, it's not acceptable for the robot control nodes to segfault like this. |
@schornakj Thanks for the work and the PR to rclcpp.
We guessed that there is some mechanism that prevents the |
Describe the bug
There is a race condition on the call to
rclcpp_lifecycle::State::get_state()
in the controller manager.The
get_state()
method is not thread-safe as explained in this issue. This was fixed in rolling, but not on humble.So any call to the get_state() method can potentially lead to a
std::runtime_error
raising the following error:To Reproduce
It is hard to reproduce, because it happens randomly.
One of the controllers that we implemented used the same
get_state()
method which triggered the error and crashed the controller_manager.But basically any call to
get_state()
can lead to that issue.There are three "paths" of calls that can lead to this method being called:
update
methodWe have crafted a node that is a copy of the
ros2_control_node
, where we purposely call thelist_controllers_service_cb
in a separate thread, many times to trigger the crash.Here is a git diff of the changes that you can apply to this repo to reproduce the crash.
reproductible_bug.diff.txt
Expected behavior
No crash. We might need to protect the call to
get_state()
with yet another mutex, but the best solution would be for rclcpp_lifecycle to backport their fix to humble.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: