Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF2 Buffer crash due to timer not destructed #460

Closed
BriceRenaudeau opened this issue Sep 28, 2021 · 10 comments
Closed

TF2 Buffer crash due to timer not destructed #460

BriceRenaudeau opened this issue Sep 28, 2021 · 10 comments
Assignees

Comments

@BriceRenaudeau
Copy link

Bug report

  • Operating System:
    • Ubuntu 20.04
  • ROS2 Version:
    • Galactic
  • Version or commit hash:
    • from source: galactic branch

Steps to reproduce the issue

Create a costmap_2D with some plugins.
Then repeatedly configure - activate - deactivate - cleanup.

costmap_ros_->deactivate();
costmap_ros_->cleanup();
costmap_ros_->configure();
costmap_ros_->activate();

Expected behavior

The costmap is repeatedly activated and deactivated without a crash.

Actual behavior

After some time, the node crash due to a segfault.

Additional information

Using backward_ros I got this detailed failure.

[my_process-30] [INFO] [$1632495493.220125523] [my_process.local_costmap]: Activating
[my_process-30] [INFO] [$1632495493.220909263] [my_process.local_costmap]: Checking transform
[my_process-30] [INFO] [$1632495493.220943918] [my_process.local_costmap]: Timed out waiting for transform from base_footprint to odom to become available, tf error: Could not find a connection between 'odom' and 'base_footprint' because they are not part of the same tree.Tf has two or more unconnected trees.
[my_process-30] Stack trace (most recent call last) in thread 393537:
[my_process-30] #11 Object "", at 0xffffffffffffffff, in
[my_process-30] #10 Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 95, in __clone [0x7fa9098f7292]
[my_process-30] #9 Source "/build/glibc-YbNSs7/glibc-2.31/nptl/pthread_create.c", line 477, in start_thread [0x7fa9097b5608]
[my_process-30] #8 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28", at 0x7fa909c09de3, in
[my_process-30] #7 Object "/home/brice/elodie2_ws/build/nav2_util/src/libnav2_util_core.so", at 0x7fa90a0991fc, in std::thread::_State_impl<std::thread::Invoker<std::tuple<nav2_util::NodeThread::NodeThread(std::shared_ptrrclcpp::node_interfaces::NodeBaseInterface)::{lambda()#1}> > >::M_run()
[my_process-30] #6 Object "/home/brice/elodie2_ws/install/rclcpp/lib/librclcpp.so", at 0x7fa909eab60b, in rclcpp::executors::SingleThreadedExecutor::spin()
[my_process-30] #5 Object "/home/brice/elodie2_ws/install/rclcpp/lib/librclcpp.so", at 0x7fa909ea3adc, in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&)
[my_process-30] #4 Object "/opt/ros/galactic/lib/libtf2_ros.so", at 0x7fa9095c9c39, in rclcpp::GenericTimer<std::function<void ()>, (void*)0>::execute_callback()
[my_process-30] #3 Object "/opt/ros/galactic/lib/libtf2_ros.so", at 0x7fa9095ca3e7, in std::Function_handler<void (), std::Bind<void (tf2_ros::CreateTimerROS::(tf2_ros::CreateTimerROS, unsigned long, std::function<void (unsigned long const&)>))(unsigned long const&, std::function<void (unsigned long const&)>)> >::M_invoke(std::Any_data const&)
[my_process-30] #2 Object "/opt/ros/galactic/lib/libtf2_ros.so", at 0x7fa9095c52a7, in std::Function_handler<void (unsigned long const&), std::Bind<void (tf2_ros::Buffer::(tf2_ros::Buffer, std::Placeholder<1>, std::shared_ptr<std::promise<geometry_msgs::msg::TransformStamped<std::allocator > > >, std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > >, std::function<void (std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > > const&)>))(unsigned long const&, std::shared_ptr<std::promise<geometry_msgs::msg::TransformStamped<std::allocator > > >, std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > >, std::function<void (std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > > const&)>)> >::M_invoke(std::Any_data const&, unsigned long const&)
[my_process-30] #1 Object "/opt/ros/galactic/lib/libtf2_ros.so", at 0x7fa9095bfcdf, in tf2_ros::Buffer::timerCallback(unsigned long const&, std::shared_ptr<std::promise<geometry_msgs::msg::TransformStamped
<std::allocator > > >, std::shared_future<geometry_msgs::msg::TransformStamped
<std::allocator > >, std::function<void (std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > > const&)>)
[my_process-30] #0 Object "/home/brice/elodie2_ws/build/nav2_costmap_2d/liblayers.so", at 0x7fa900988d99, in tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan<std::allocator >, tf2_ros::Buffer>::transformReadyCallback(std::shared_future<geometry_msgs::msg::TransformStamped<std::allocator > > const&, unsigned long)
[my_process-30] Segmentation fault (Address not mapped to object [0x78])
[ERROR] [my_process-30]: process has died [pid 392956, exit code -11, cmd

This crash might be due to the tf2_ros waitForTransform function that creates a timer to call a callback:

auto timer_handle = timer_interface_->createTimer(
clock_,
timeout,
std::bind(&Buffer::timerCallback, this, std::placeholders::_1, promise, future, callback));

When cleaning the costmap, the buffer is destructed but the Timer still exists and the callback function doesn't exist anymore hence the timer creates a segfault.

The crash happens after the call of tf_buffer_->canTransform:
https://github.com/ros-planning/navigation2/blob/e5a4c844c748233dc4d3039b1ea5809c81dac45f/nav2_costmap_2d/src/costmap_2d_ros.cpp#L219-L221

My false fix

To avoid this crash I added a delay(transform_tolerance_) between deactivate() and cleanup().
To let the timer trigger before .reset()

@clalancette
Copy link
Contributor

Yep, this is definitely a bug, and is also most likely the reason for ros2/rviz#703 . I've attempted a few fixes of this, but I have not yet been able to get something that fixes it completely. I'm still trying.

@aaronchongth
Copy link

Hello @BriceRenaudeau! Thanks for bringing this up. I am currently helping to look into this too, but can't seem to recreate the problem reliably other than repeatedly toggling a costmap in rviz2 randomly. Could you provide a sample of how you're recreating it?

@BriceRenaudeau
Copy link
Author

Hi, I cannot give you the full code, but I can provide you with the critical parts.
My goal was to change several parameters of the costmap dynamically.
But the dynamic parameter wasn't working yet.

So I create a node with the costmap and aParameterEventHandler

MyNode(): Node("my_node")
{
  // Init costmap
  costmap_ros_ = std::make_shared<nav2_costmap_2d::Costmap2DROS>("local_costmap", "", "my_node");
  costmap_thread_ = std::make_unique<nav2_util::NodeThread>(costmap_ros_);
  costmap_active_ = false;

  // init costmap
  costmap_ros_->configure();
  costmap_ros_->activate();
  costmap_active_ = true;

  // Init param handler
  param_handler_ = std::make_shared<rclcpp::ParameterEventHandler>(this);
  static_layer_cb_handler_ = param_handler_->add_parameter_callback(
        "static_layer.enabled",
        std::bind(&MyNode::static_layer_cb, this, std::placeholders::_1),
        "/my_node/local_costmap"
        );
}

Then in the ParamEventHandler I catch the config and restart the costmap.

void MyNode::static_layer_cb(const rclcpp::Parameter param) {
  if (costmap_ros_->get_current_state().label() == "active") {

    if (param.as_bool() != use_static_layer) {
      use_static_layer = param.as_bool();
      // reload costmap
      RCLCPP_INFO(get_logger(), "Costmap param has changed : Reset costmap");

      costmap_ros_->deactivate();
      //rclcpp::Rate(2).sleep(); // my durty fix
      costmap_ros_->cleanup();
      costmap_ros_->configure();
      costmap_ros_->activate();
    }
  }
  return;
}

Hence every time you change a parameter it will restart the costmap

auto parameter_client = std::make_shared<rclcpp::AsyncParametersClient>(shared_from_this(), "/my_node/local_costmap", rmw_qos_profile_parameters);
auto set_results = parameter_client->set_parameters({rclcpp::Parameter("static_layer.enabled", false)});

@aaronchongth
Copy link

Hello @BriceRenaudeau! I just created a PR targeting ros2 which I believe can fix the issue. Unfortunately I have still been unable to reliable test it using downstream packages, but I did include a test which I believe replicates the issue.

I have this branch applying the same fix to galactic. Could you try it out in your application to see if the issue is still persisting?

@BriceRenaudeau
Copy link
Author

@aaronchongth, many thanks for the PR.
Unfortunately, we advanced to another topic keeping the delay as a fix.
I will keep it in mind but I may not be able to try It soon.

@BriceRenaudeau
Copy link
Author

Hi,
Did #490 fix this issue?

@clalancette
Copy link
Contributor

I believe it did, but confirmation would be much appreciated.

@BriceRenaudeau
Copy link
Author

Is it possible to have it backported on galactic?

@clalancette
Copy link
Contributor

Is it possible to have it backported on galactic?

As it was put into Rolling, unfortunately not. That fix was a change to the API and ABI of tf2 and rviz, which isn't allowed in released distributions. There may be a way to backport it while preserving API and ABI, but that would take a bunch of time. If you are interested in attempting a backport, I can give you some pointers on what you would need to change to keep it API and ABI compatible.

@clalancette
Copy link
Contributor

Given that this was fixed long ago, I'm going to close this issue. Please feel free to reopen if you are still seeing the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants