Executor quick destruction after spin hangs #1454

v-lopez · 2020-11-13T10:46:19Z

Bug report

Required Info:

Operating System: Ubuntu 20.04
Installation type: Binaries
Version or commit hash: 5.0.0-1focal.20201007.211227
DDS implementation: Fast DDS
Client library (if applicable): rclcpp

Steps to reproduce issue

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
    });
  //  std::this_thread::sleep_for(50ms);
  executor->cancel();
}

Expected behavior

That the test ends.

Actual behavior

Very frequently, the test hangs on the future destruction, waiting for thread join();
Because the thread is stuck on executor->spin() (wait_for_work exactly)

Additional information

Uncommenting the sleep_for fixes the issue on my machine, so there seems to be a race condition.
In our use case, we have test than end faster than that time and trigger this issue, we can add the sleep as a workaround but it's ugly: https://github.com/ros-controls/ros2_control/pull/234/files/833b1661209cc12a9d6f9ef45c87f4072e641aa7#diff-28c9aa0318c48cffb66c294bbf17394b6f9295bb8b9fb83d2cac539699f7e354R130

The text was updated successfully, but these errors were encountered:

ivanpauno · 2020-11-13T16:59:07Z

The semantics of cancel is strange:

If spin is already running, it will return. But you can spin again afterwards, and those will continue blocking.
A cancel that happens while the executor is not spinning doesn't have any effect (the first wait of the executor will be awaken immediately because a guard condition was triggered by cancel, but spin won't return and will continue doing work).
rclpy Executor doesn't have a cancel method, but rather a shutdown one. spin() calls after shutdown happened will not do work.

I'm not sure what is the intended semantics of cancel, but its current behavior doesn't sound quite useful (it will always be racy).

Moreover, a cancel that happens while the executor is not spinning doesn't have any effect.

Maybe instead of that, the first spin after a cancel should not do work.
The other option is to make it behave like rclpy.Executor.shutdown.

@wjwwood any ideas?

fujitatomoya · 2020-11-16T07:22:50Z

What about providing the status if spinning or not to user application? so that user can do the following.

  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
      std::launch::async, [&executor]() -> void {
      std::this_thread::sleep_for(std::chrono::microseconds(100));
      if (!executor->spinning())
        executor->spin();
      }
    });
  if (executor->spinning()) {
    executor->cancel();
  }
  (void) executor_spin_future.get();

ivanpauno · 2020-11-16T13:05:48Z

What about providing the status if spinning or not to user application? so that user can do the following?

That doesn't seem to solve the problem, the code seems to have the same race as before.

ivanpauno · 2020-12-18T17:31:15Z

My opinion is that instead of setting the spinning flag to false we could:

Set a cancel=true flag, the current/next spin will reset the flag cancel=false and return immediately.
In this case the behavior of cancel is: "cancel the current spin() loop or the following call to spin()"
Set a cancel=true flag, the current and all the following spin() calls return immediately.
In this case the behavior of cancel is: "cancel the executor, all future spin() calls will do nothing"
(I would actually call this shutdown() and not cancel()).

IMO, we should have both: apply the first fix to cancel() and add a new shutdown() method to the executor.
It would be great to get other maintainers feedback before applying the change, maybe @mabelzhang @wjwwood ?

I'm marking this as help-wanted, though I agree that the current cancel() behavior is a bit broken this doesn't seem to be a issue affecting many users and implementing the fix is not hard (it only needs previous agreement of what the "correct behavior" is).

wjwwood · 2020-12-18T20:40:49Z

I'm not sure what is the intended semantics of cancel, but its current behavior doesn't sound quite useful (it will always be racy).

cancel() is only meant to interrupt spin, not prevent future spins.

Changing it to be "cancel the current spin() loop or the following call to spin()" just moves the problem around I think, because what if you want to spin(), then either spin finish itself or cancel it asynchronously, then spin again. If you do that then it's a race to see if cancel will stop the first spin or if it fires off just after spin finishes normally. In that case sometimes the second spin would do something and other times it would return immediately.

Instead maybe we could leave cancel as-is and provide a wait_for_spin(), so the test becomes:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

I personally don't see any need for a shutdown since you can just destroy the executor. In python this is more complicated because reliably invoking the destructor is more difficult and so maybe a shutdown makes sense there. But I'm not opposed to it either.

ivanpauno · 2020-12-18T22:06:05Z

because what if you want to spin(), then either spin finish itself or cancel it asynchronously, then spin again. If you do that then it's a race to see if cancel will stop the first spin or if it fires off just after spin finishes normally.

Don't you still have the same problem with wait_for_spin()?
Note: I guess "spin finishes normally" means for example using spin_some() that can "finish normally" also be cancelled, because IIRC spin() never finishes (except if the rclcpp::Context is shutdown, the executor is cancelled).

I personally don't see any need for a shutdown since you can just destroy the executor. In python this is more complicated because reliably invoking the destructor is more difficult and so maybe a shutdown makes sense there. But I'm not opposed to it either.

The usefulness of a shutdown() method would be pretty similar to the usefulness of rclcpp::shutdown(), where you could use:

while(executor.is_ok()) {
  executor.spin()
}

and in another thread asynchronously call executor.shutdown().

wjwwood · 2020-12-18T22:34:12Z

Note: I guess "spin finishes normally" means for example using spin_some() that can "finish normally" also be cancelled, because IIRC spin() never finishes (except if the rclcpp::Context is shutdown, the executor is cancelled).

Right, or spin_until_future_complete().

Don't you still have the same problem with wait_for_spin()?

I don't think so? Because the reason for the first spin*() call exiting no longer matters, and if the cancel() happens after the spin*() finishes then it will be harmless (a no-op), so in any result of the race between spin*() finishing and cancel() being called, the second spin*() will do work and will never return immediately due to the previous cancel().

But maybe I'm looking at it wrong?

The usefulness of a shutdown() method would be pretty similar to the usefulness of rclcpp::shutdown(), where you could use:

Why not just executor.spin() and then asynchronously executor.cancel()?

ivanpauno · 2020-12-21T17:28:46Z

I don't think so? Because the reason for the first spin*() call exiting no longer matters, and if the cancel() happens after the spin*() finishes then it will be harmless (a no-op), so in any result of the race between spin*() finishing and cancel() being called, the second spin*() will do work and will never return immediately due to the previous cancel().

mmm, I will provide more detailed examples to see if I understand:

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

will always cancel the first spin() (if wait_for_spin returned true).

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin_some();
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

might cancel either the first spin_some() or the second spin(), because the following can happen: wait_for_spin() returns true, context switch, spin_some() finishes, spin() starts, cancel() is executed cancelling the second spin().

with the proposed "cancel() cancels the current/next spin()" behavior:

Case 1. also works fine behavior (and doesn't require a timeout).
Case 2. has the same issue.

Moreover, in loops like the following:

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      while (!executor_cancelled_condition) {
        executor->spin_some();
        // do more work here
      }
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

The wait_for_spin() also has a race (if it's called while spin_some() isn't running), and the proposed change in the cancel() behavior just works (not really if "do more work" has another spin*() call ...).

There's another alternative: a wait_for_spin_and_cancel() method, but it's not my preferred alternative.

Why not just executor.spin() and then asynchronously executor.cancel()?

The issue is how do you get the "executor.is_ok()" (or executor.is_cancelled()) state.
I mean, currently cancel() is just a notification to an asynchronous spin*() running, and cancelled is not an state of the executor.
If cancel would be a state of the executor that isn't automatically cleared (i.e.: not cleared by a following spin*() call, we could have an uncancel() method), the issue mentioned above:

not really if "do more work" has another spin*() call

is fixed.

Maybe the question is: what is the use case of cancel()? That's not clear to me, I think we always use rclcpp::Context::shutdown() in situations like this.

wjwwood · 2020-12-22T01:47:20Z

For case 2, I think that's just not going to work as intended, as you said your proposed change to how cancel works would still not work for case 2. Further more, I don't see the point in case 2, I think you'd want to test something in between, e.g.:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor.spin_some();
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  executor.cancel();

  // ASSERT that spin_some did something you expected
  executor_spin_future.get();

  executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor.spin();
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  // ASSERT spin did something you expected, perhaps in a loop or wait on a condition variable from a callback
  executor.cancel();
}

If you're testing some interaction with spin_some followed by spin, then I'd recommend a condition variable:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  std::mutex m;
  std::condition_variable cv;
  bool finished = false;
  auto executor_spin_future = std::async(
    std::launch::async, [&]() -> void {
      executor.spin_some();
      executor.spin();
      finished = true;
      cv.notify_one();
    });
  do {
    std::unique_lock<std::mutex> lk(m);
    cv.wait_for(100ms, [&]() {return finished;});
    EXPECT_TRUE(finished);
    if (!finished) {
      executor.cancel();
    }
  } while (!finished);
  // ASSERT some state after the spin_some/spin sequence
}

For case 3, you can cover most cases with this:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  bool stop = false;
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      while (!stop && !executor_cancelled_condition) {
        // potential delay
        executor.spin_some();
        // do more work here
      }
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  stop = true;
  executor.cancel();
}

There's still a race if the cancel comes during the "potential delay" area, but it would only have to wait for one more spin_some(). Which shouldn't be too much worse than actually canceling the spin_some(). In which case having it be "this spin or the next" would be better.

The issue is how do you get the "executor.is_ok()" (or executor.is_cancelled()) state.

I don't see why the executor needs to provide that state. I mean the executors != the init/shutdown/ok system we have globally. That's more like the context. The user could provide this state themselves I think.

I mean, currently cancel() is just a notification to an asynchronous spin*() running, and cancelled is not an state of the executor.

This was intentional, I think it was called interrupt first, but was changed to cancel early on. I don't think it was ever meant to solve the problems we're describing here. It was meant to "cancel any ongoing spin if there is one". That's all.

Maybe the question is: what is the use case of cancel()? That's not clear to me, I think we always use rclcpp::Context::shutdown() in situations like this.

I mean, we're using it all over the place to spin indefinitely and then stop that spin once some scenario has been met, e.g.:

rclcpp/rclcpp/test/rclcpp/executors/test_multi_threaded_executor.cpp

Line 81 in d9377dc

executor.cancel();

I think this is a perfectly normal use case. When you include tests that race spin() or loop over spin_some() (like your third example), then it isn't sufficient.

I suppose I'm fine with changing how cancel works, but maybe we should keep a method that works like the current one too, i.e. it just interrupts a spin but only if there is one at that moment.

suab321321 · 2021-02-16T22:19:14Z

@wjwwood I would like to start contributing to this, can you help me get started, I have built rolling from source, but I wont be able to do version control from there. So first I should fork this clone the forked, and then work on it right? But if I do this then how do I test whether my changes are compiling successfully?

wjwwood · 2021-02-16T22:34:51Z

I have built rolling from source, but I wont be able to do version control from there.

You should use this file for development:

https://github.com/ros2/ros2/blob/master/ros2.repos

You can use git from there.

suab321321 · 2021-02-18T19:53:50Z

@ivanpauno is this issue not beginner friendly?

ivanpauno · 2021-02-19T17:23:30Z

@ivanpauno is this issue not beginner friendly?

I'm not pretty sure what we want to do here, so resolving the issue involves first discussing what's desired.
After that is resolved, implementing it shouldn't be a problem, but the good first issue label doesn't feel accurate at the moment.
If you have any opinion of how it should work feel free to chime in.

I will try to reread the comments again in a few days and post something.

I also feel the discussion somewhat overlaps with how rclcpp::shutdown() should work (#1139), though it's not exactly the same issue.

rpaaron · 2021-03-15T19:07:21Z

seems much of the problem is the need to perform an action before (or after) some other action, requiring additional machinery to coordinate this in a multithreaded environment. And theres not a lot of guidance about how best to go about that.
I understand the desire not to change the existing behaviour of the API, the conflicting desire to have functions behave consistently, and the question as to whether this is exceptional behaviour.
Perhaps we could leave everything existing intact, and add an additional API, something like:

auto execution_token = executor.create_token();
std::thread t([&](){
    // will spin until token is cancelled
    // spinning already cancelled token just returns
    executor.spin(execution_token);
});
execution_token.cancel();
t.join();

where behaviour is the same independent of execution order. Perhaps spin could even return some interesting information about how it exited? eg CANCELLED_TOKEN, TOKEN_ALREADY_CANCELLED if useful

ivanpauno · 2021-03-15T19:32:00Z

@rpaaron that's a great example, you can already do that with spin_until_future_complete():

auto promise = std::promise<void>{};
auto future = promise.get_future();
std::thread t([&](){
    // will spin until the future is complete
    // spinning an already completed future just returns
    executor.spin_until_future_complete(future);
});
promise.set_value();
t.join();

rpaaron · 2021-03-15T20:57:34Z

thanks @ivanpauno . Is this approach the current advice, in order to avoid a race condition here?
I've seen in other pull requests, and perhaps existing code that uses this kind of construction:

void Someclass::start {
    executor_thread_ = std::thread([this]() {
        executor_.spin();
    });
}

Someclass::~Someclass() {
    if (executor_thread_.joinable()) {
        executor_.cancel();
        executor_thread_.join();
  }

which seems like it would be vulnerable to a race situation, ie if the object is destructed before executor_.spin() is called in the new thread.
What should be the advice in this case? Use a safer construction like your example? set a flag only after spin() has finished, and continuously cancel() until the flag is set, and only then join()?

ivanpauno · 2021-03-15T21:29:31Z

which seems like it would be vulnerable to a race situation, ie if the object is destructed before executor_.spin() is called in the new thread.

Yeah, that code looks racy.

What should be the advice in this case? Use a safer construction like your example? set a flag only after spin() has finished, and continuously cancel() until the flag is set, and only then join()?

I think that something like the example above would work.
The bad thing of using a flag is that you have to do some "busy-waiting" until the executor is actually cancelled, which is not great but also works.

paulsammut · 2023-05-18T15:43:36Z

@rpaaron that's a great example, you can already do that with spin_until_future_complete():

auto promise = std::promise<void>{};
auto future = promise.get_future();
std::thread t([&](){
    // will spin until the future is complete
    // spinning an already completed future just returns
    executor.spin_until_future_complete(future);
});
promise.set_value();
t.join();

@ivanpauno I'm trying to use this pattern for having multiple nodes in their own executors and threads in a test. One node works fine, but when I have multiple nodes executor.spin_until_future_complete(future); hangs. If I add an executor.cancel() along with the promis signalling, it works. See the code below:

template <typename T>
class SingleThreadedNode {
 public:
  explicit SingleThreadedNode(const std::string name) {
    node = std::make_shared<T>(name);
    exec_.add_node(node);

    thread_ = std::make_unique<std::thread>([&]() {
      RCLCPP_INFO(node->get_logger(), "Starting up %s", node->get_name());
      exec_.spin_until_future_complete(future_);
    });
  }

  ~SingleThreadedNode() {
        RCLCPP_INFO(node->get_logger(), "Trying to shutdown %s", node->get_name());
         promise_.set_value();
         // I don't know why i need this, but without it multiple nodes don't close
        exec_.cancel();
        thread_->join();
        RCLCPP_INFO(node->get_logger(), "Shutdown %s", node->get_name());
  }

  rclcpp::Node::SharedPtr node;

 private:
  std::unique_ptr<std::thread> thread_;
  rclcpp::executors::SingleThreadedExecutor exec_;
  std::promise<void> promise_{};
  std::future<void> future_ = promise_.get_future();
};


TEST(UtilsTest, SingleThreadedNodeRaces) {
  // Test create and rapidly destroying multiple nodes. This tests for nasty races
  auto quickNode1 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node1");
  auto quickNode2 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node2");
  auto quickNode3 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node3");
  auto quickNode4 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node4");
}

This was referenced Nov 13, 2020

Segmentation fault on wait_for_work #1455

Closed

Add forwarding of parameters from controller_manager to controllers ros-controls/ros2_control#234

Closed

ivanpauno added the question Further information is requested label Nov 13, 2020

clalancette assigned ivanpauno Dec 10, 2020

ivanpauno added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Dec 18, 2020

ivanpauno added help wanted Extra attention is needed good first issue Good for newcomers enhancement New feature or request and removed question Further information is requested labels Dec 18, 2020

ivanpauno removed their assignment Dec 18, 2020

ivanpauno mentioned this issue Jan 14, 2021

Publishing is slow in Docker with MutliThreadedExecutor #1487

Closed

ivanpauno removed good first issue Good for newcomers help wanted Extra attention is needed labels Feb 17, 2021

clalancette added the backlog label Mar 11, 2021

fujitatomoya mentioned this issue Mar 16, 2021

Clock subscription callback group spins in its own thread #1556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executor quick destruction after spin hangs #1454

Executor quick destruction after spin hangs #1454

v-lopez commented Nov 13, 2020

ivanpauno commented Nov 13, 2020

fujitatomoya commented Nov 16, 2020

ivanpauno commented Nov 16, 2020

ivanpauno commented Dec 18, 2020

wjwwood commented Dec 18, 2020 •

edited

Loading

ivanpauno commented Dec 18, 2020

wjwwood commented Dec 18, 2020

ivanpauno commented Dec 21, 2020

wjwwood commented Dec 22, 2020

suab321321 commented Feb 16, 2021

wjwwood commented Feb 16, 2021

suab321321 commented Feb 18, 2021

ivanpauno commented Feb 19, 2021

rpaaron commented Mar 15, 2021

ivanpauno commented Mar 15, 2021 •

edited

Loading

rpaaron commented Mar 15, 2021

ivanpauno commented Mar 15, 2021

paulsammut commented May 18, 2023 •

edited

Loading

Executor quick destruction after spin hangs #1454

Executor quick destruction after spin hangs #1454

Comments

v-lopez commented Nov 13, 2020

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

ivanpauno commented Nov 13, 2020

fujitatomoya commented Nov 16, 2020

ivanpauno commented Nov 16, 2020

ivanpauno commented Dec 18, 2020

wjwwood commented Dec 18, 2020 • edited Loading

ivanpauno commented Dec 18, 2020

wjwwood commented Dec 18, 2020

ivanpauno commented Dec 21, 2020

wjwwood commented Dec 22, 2020

suab321321 commented Feb 16, 2021

wjwwood commented Feb 16, 2021

suab321321 commented Feb 18, 2021

ivanpauno commented Feb 19, 2021

rpaaron commented Mar 15, 2021

ivanpauno commented Mar 15, 2021 • edited Loading

rpaaron commented Mar 15, 2021

ivanpauno commented Mar 15, 2021

paulsammut commented May 18, 2023 • edited Loading

wjwwood commented Dec 18, 2020 •

edited

Loading

ivanpauno commented Mar 15, 2021 •

edited

Loading

paulsammut commented May 18, 2023 •

edited

Loading