Possible race condition in controller manager on humble #979

ejalaa12 · 2023-03-30T10:39:00Z

Describe the bug
There is a race condition on the call to rclcpp_lifecycle::State::get_state() in the controller manager.

The get_state() method is not thread-safe as explained in this issue. This was fixed in rolling, but not on humble.
So any call to the get_state() method can potentially lead to a std::runtime_error raising the following error:

Error in state! Internal state_handle is NULL.

To Reproduce
It is hard to reproduce, because it happens randomly.
One of the controllers that we implemented used the same get_state() method which triggered the error and crashed the controller_manager.

But basically any call to get_state() can lead to that issue.
There are three "paths" of calls that can lead to this method being called:

any service call, but since they are mutually exclusive thanks to the service mutex, we can consider it as one path
diagnostic updater
update method

We have crafted a node that is a copy of the ros2_control_node, where we purposely call the list_controllers_service_cb in a separate thread, many times to trigger the crash.

Here is a git diff of the changes that you can apply to this repo to reproduce the crash.
reproductible_bug.diff.txt

note that I had to add the .txt extension for github to accept uploading the file

Expected behavior
No crash. We might need to protect the call to get_state() with yet another mutex, but the best solution would be for rclcpp_lifecycle to backport their fix to humble.

Environment (please complete the following information):

OS: ubuntu 22.04
Version humble

The text was updated successfully, but these errors were encountered:

DatSpace · 2023-03-30T18:50:59Z

I can verify I am facing the same issue on Humble 22.04 with the latest packages of this month.

No custom controller, just using joint_trajectory_controller/JointTrajectoryController!

Bellow is the stacktrace after it crashes. It happens very often (almost 50%) of the time but randomly.

[ros2_control_node-7] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-7]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-7] Stack trace (most recent call last) in thread 167431:
[ros2_control_node-7] #16   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-7] #15   Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 100, in __clone [0x7f6b1284dbb3]
[ros2_control_node-7] #14   Source "./nptl/pthread_create.c", line 442, in start_thread [0x7f6b127bcb42]
[ros2_control_node-7] #13   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a4c2b2, in 
[ros2_control_node-7] #12   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c89909, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long)
[ros2_control_node-7] #11   Object "/opt/ros/humble/lib/librclcpp.so", at 0x7f6b12c82650, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[ros2_control_node-7] #10   Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e56c94, in 
[ros2_control_node-7] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e0f22c, in diagnostic_updater::Updater::update()
[ros2_control_node-7] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f6b12e23748, in controller_manager::ControllerManager::controller_activity_diagnostic_callback(diagnostic_updater::DiagnosticStatusWrapper&)
[ros2_control_node-7] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f6b1257f981, in 
[ros2_control_node-7] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e517, in __cxa_throw
[ros2_control_node-7] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e2b6, in std::terminate()
[ros2_control_node-7] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a1e24b, in 
[ros2_control_node-7] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f6b12a12bbd, in 
[ros2_control_node-7] #2    Source "./stdlib/abort.c", line 79, in abort [0x7f6b127507f2]
[ros2_control_node-7] #1    Source "../sysdeps/posix/raise.c", line 26, in raise [0x7f6b1276a475]
[ros2_control_node-7] #0  | Source "./nptl/pthread_kill.c", line 89, in __pthread_kill_internal
[ros2_control_node-7]     | Source "./nptl/pthread_kill.c", line 78, in __pthread_kill_implementation
[ros2_control_node-7]       Source "./nptl/pthread_kill.c", line 44, in __pthread_kill [0x7f6b127bea7c]
[ros2_control_node-7] Aborted (Signal sent by tkill() 167250 1000)
[ERROR] [ros2_control_node-7]: process has died [pid 167250, exit code -6, cmd '/opt/ros/humble/lib/controller_manager/ros2_control_node --ros-ar...

roncapat · 2023-03-30T20:33:44Z

Unfortunately I asked in ros2/rclcpp#1756 and there won't be a backport, seems it is an ABI-breaking fix.

ejalaa12 · 2023-03-31T07:02:54Z

@DatSpace how did you reproduce this ?
@roncapat I didn't notice that you asked for the backport, that's a shame. It needs users need to protect the access to this method, which is nor ideal, or safe.

pepeRossRobotics · 2023-04-18T12:33:24Z

I just encountered the same issue when soak testing my robot (running the robot non stop for days)
this is the output of the terminal.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 44154:
[ros2_control_node-1] #14   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[ros2_control_node-1] #13   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f96385d1b, in 
[ros2_control_node-1] #12   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631d5c7, in 
[ros2_control_node-1] #11   Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965531fb, in 
[ros2_control_node-1] #10   Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x557944384b, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7f9695e3af, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libdiff_drive_controller.so", at 0x7f904cb64f, in diff_drive_controller::DiffDriveController::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7f95ffc5bb, in rclcpp_lifecycle::State::id() const
[ros2_control_node-1] #6    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522d23, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f96522a3f, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965229db, in 
[ros2_control_node-1] #3    Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30", at 0x7f965251fb, in __gnu_cxx::__verbose_terminate_handler()
[ros2_control_node-1] #2    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962c712f, in abort
[ros2_control_node-1] #1    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f962da67b, in raise
[ros2_control_node-1] #0    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0x7f9631f200, in 
[ros2_control_node-1] Aborted (Signal sent by tkill() 43824 0)

Is there any way to fix this? Using Humble 22.04

pepeRossRobotics · 2023-04-18T12:48:22Z

Quick follow up question. In my case respawining the ros2_control_node when it crashes would fix the issue, but because I am using the controller_manager spawner to do the join_state_broadcaster and the base_controller of my robot, which they "die" after loading the controller, respawning the ros2_control_node does not fully does it. Is there a way to write the launch file in python so that if the ros2_control_node dies if relaunches along with the two controller_manager spawners?
My current relevant section of my launch files is the following:

    control_node = Node(
        package="controller_manager",
        executable="ros2_control_node",
        parameters=[robot_description, robot_controllers],
        remappings=[("/diagnostics", DIAGNOSTICS_CONTROL_TOPIC)],
        output={
            "stdout": "screen",
            "stderr": "screen",
        },
        respawn = True,
        respawn_delay = 2
    )

    joint_state_broadcaster_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["joint_state_broadcaster",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

    robot_controller_spawner = Node(
        package="controller_manager",
        executable="spawner",
        arguments=["base_controller",
                   "-c",
                   "/controller_manager",
                   "--controller-manager-timeout",
                   "10"]
    )

ejalaa12 · 2023-04-20T06:40:24Z

@pepeRossRobotics The only way we found to prevent the crash from happening is forking ros2_control and surround the place where the problematic method is called with a while/try/catch but that is not a good way to do it.

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

pepeRossRobotics · 2023-04-20T06:50:36Z

Many thanks for the help, I assumed something like this was possible, but didn't know how to implement them. I will give them a go now.

destogl · 2023-04-20T16:22:04Z

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

pepeRossRobotics · 2023-04-24T07:38:01Z

Concerning the launch file, I believe you can use event handlers to start the "spawner" when the controller_manager process starts. I haven't tried it myself, but I think it should work.

There is some of those in our example code. Maybe you find this useful: https://github.com/StoglRobotics/ros_team_workspace/blob/master/templates/ros2_control/robot_ros2_control.launch.py#L186

this is really helpful, many thanks!

schornakj · 2023-04-24T21:29:57Z

Just wanted to add another stack trace to the pile. This is also on Humble on Ubuntu 22.04 LTS.

[ros2_control_node-1] terminate called after throwing an instance of 'std::runtime_error'
[ros2_control_node-1]   what():  Error in state! Internal state_handle is NULL.
[ros2_control_node-1] Stack trace (most recent call last) in thread 178:
[ros2_control_node-1] #13   Object "", at 0xffffffffffffffff, in 
[ros2_control_node-1] #12   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead00ce9ff, in 
[ros2_control_node-1] #11   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003cb42, in 
[ros2_control_node-1] #10   Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead02cc2b2, in 
[ros2_control_node-1] #9    Object "/opt/ros/humble/lib/controller_manager/ros2_control_node", at 0x564dbe0793db, in 
[ros2_control_node-1] #8    Object "/opt/ros/humble/lib/libcontroller_manager.so", at 0x7fead069f61c, in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&)
[ros2_control_node-1] #7    Object "/opt/ros/humble/lib/librclcpp_lifecycle.so", at 0x7feacfe3d981, in 
[ros2_control_node-1] #6    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e517, in __cxa_throw
[ros2_control_node-1] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e2b6, in std::terminate()
[ros2_control_node-1] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead029e24b, in 
[ros2_control_node-1] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fead0292bbd, in 
[ros2_control_node-1] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffd07f2, in abort
[ros2_control_node-1] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7feacffea475, in raise
[ros2_control_node-1] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fead003ea7c, in pthread_kill
[ros2_control_node-1] Aborted (Signal sent by tkill() 104 0)

schornakj · 2023-05-08T18:20:46Z

Two updates on this:

I made a PR to rclcpp to patch the thread safety issue in the lifecycle node state: Fix thread safety in LifecycleNode::get_current_state() for Humble ros2/rclcpp#2183. This will probably take a while to get merged and released in Humble.
@EzraBrooks found that this is probably due to a regression that was introduced in Add diagnostics (backport #820) #922, which adds a new call to get_state() from within a timer callback. On Humble get_state() isn't thread-safe, so the addition of this callback caused the following release of ros2_control on Humble to segfault.

I think in the near-term #922 should be reverted. While diagnostic info is nice to have, it's not acceptable for the robot control nodes to segfault like this.

ejalaa12 · 2023-05-11T17:21:13Z

@schornakj Thanks for the work and the PR to rclcpp.
Yes we also noticed that the issue was introduced when the diagnostics were added but we weren't sure because that it was the only source of errors.
As you can see in the diagram I posted in the first message, there are three call hierarchy that leads to calling get_state():

service callbacks (which are mutually exclusive)
update method
diagnostic timer

We guessed that there is some mechanism that prevents the get_state() method being called from both the update method and service callbacks, but we didn't look to much into it.
But, yeah, once you add the diagnostic timer, there is definitely a race with the two other threads.

ejalaa12 added the bug label Mar 30, 2023

ejalaa12 changed the title ~~Possible race condition in controller manager in humble~~ Possible race condition in controller manager on humble Mar 30, 2023

schornakj mentioned this issue Apr 25, 2023

Apply a fix for LifecycleNode::get_current_state thread safety which is equivalent to #1756 into Humble ros2/rclcpp#2172

Closed

schornakj mentioned this issue May 9, 2023

[Humble] Revert "Add diagnostics (backport #820) (#922)" #1015

Merged

saikishor mentioned this issue Dec 15, 2024

Test per_controller_equal_and_higher_update_rate is flaky #1917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible race condition in controller manager on humble #979

Possible race condition in controller manager on humble #979

ejalaa12 commented Mar 30, 2023

DatSpace commented Mar 30, 2023 •

edited

Loading

roncapat commented Mar 30, 2023

ejalaa12 commented Mar 31, 2023

pepeRossRobotics commented Apr 18, 2023

pepeRossRobotics commented Apr 18, 2023

ejalaa12 commented Apr 20, 2023

pepeRossRobotics commented Apr 20, 2023

destogl commented Apr 20, 2023

pepeRossRobotics commented Apr 24, 2023

schornakj commented Apr 24, 2023

schornakj commented May 8, 2023

ejalaa12 commented May 11, 2023

Possible race condition in controller manager on humble #979

Possible race condition in controller manager on humble #979

Comments

ejalaa12 commented Mar 30, 2023

DatSpace commented Mar 30, 2023 • edited Loading

roncapat commented Mar 30, 2023

ejalaa12 commented Mar 31, 2023

pepeRossRobotics commented Apr 18, 2023

pepeRossRobotics commented Apr 18, 2023

ejalaa12 commented Apr 20, 2023

pepeRossRobotics commented Apr 20, 2023

destogl commented Apr 20, 2023

pepeRossRobotics commented Apr 24, 2023

schornakj commented Apr 24, 2023

schornakj commented May 8, 2023

ejalaa12 commented May 11, 2023

DatSpace commented Mar 30, 2023 •

edited

Loading