Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor quick destruction after spin hangs #1454

Open
v-lopez opened this issue Nov 13, 2020 · 18 comments
Open

Executor quick destruction after spin hangs #1454

v-lopez opened this issue Nov 13, 2020 · 18 comments
Labels
backlog enhancement New feature or request

Comments

@v-lopez
Copy link

v-lopez commented Nov 13, 2020

Bug report

Required Info:

  • Operating System: Ubuntu 20.04
  • Installation type: Binaries
  • Version or commit hash: 5.0.0-1focal.20201007.211227
  • DDS implementation: Fast DDS
  • Client library (if applicable): rclcpp

Steps to reproduce issue

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
    });
  //  std::this_thread::sleep_for(50ms);
  executor->cancel();
}

Expected behavior

That the test ends.

Actual behavior

Very frequently, the test hangs on the future destruction, waiting for thread join();
Because the thread is stuck on executor->spin() (wait_for_work exactly)

Additional information

Uncommenting the sleep_for fixes the issue on my machine, so there seems to be a race condition.
In our use case, we have test than end faster than that time and trigger this issue, we can add the sleep as a workaround but it's ugly: https://github.com/ros-controls/ros2_control/pull/234/files/833b1661209cc12a9d6f9ef45c87f4072e641aa7#diff-28c9aa0318c48cffb66c294bbf17394b6f9295bb8b9fb83d2cac539699f7e354R130

@ivanpauno
Copy link
Member

The semantics of cancel is strange:

  • If spin is already running, it will return. But you can spin again afterwards, and those will continue blocking.
  • A cancel that happens while the executor is not spinning doesn't have any effect (the first wait of the executor will be awaken immediately because a guard condition was triggered by cancel, but spin won't return and will continue doing work).
  • rclpy Executor doesn't have a cancel method, but rather a shutdown one. spin() calls after shutdown happened will not do work.

I'm not sure what is the intended semantics of cancel, but its current behavior doesn't sound quite useful (it will always be racy).

Moreover, a cancel that happens while the executor is not spinning doesn't have any effect.

Maybe instead of that, the first spin after a cancel should not do work.
The other option is to make it behave like rclpy.Executor.shutdown.

@wjwwood any ideas?

@fujitatomoya
Copy link
Collaborator

What about providing the status if spinning or not to user application? so that user can do the following.

  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
      std::launch::async, [&executor]() -> void {
      std::this_thread::sleep_for(std::chrono::microseconds(100));
      if (!executor->spinning())
        executor->spin();
      }
    });
  if (executor->spinning()) {
    executor->cancel();
  }
  (void) executor_spin_future.get();

@ivanpauno
Copy link
Member

What about providing the status if spinning or not to user application? so that user can do the following?

That doesn't seem to solve the problem, the code seems to have the same race as before.

@ivanpauno ivanpauno added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Dec 18, 2020
@ivanpauno
Copy link
Member

My opinion is that instead of setting the spinning flag to false we could:

  • Set a cancel=true flag, the current/next spin will reset the flag cancel=false and return immediately.
    In this case the behavior of cancel is: "cancel the current spin() loop or the following call to spin()"
  • Set a cancel=true flag, the current and all the following spin() calls return immediately.
    In this case the behavior of cancel is: "cancel the executor, all future spin() calls will do nothing"
    (I would actually call this shutdown() and not cancel()).

IMO, we should have both: apply the first fix to cancel() and add a new shutdown() method to the executor.
It would be great to get other maintainers feedback before applying the change, maybe @mabelzhang @wjwwood ?


I'm marking this as help-wanted, though I agree that the current cancel() behavior is a bit broken this doesn't seem to be a issue affecting many users and implementing the fix is not hard (it only needs previous agreement of what the "correct behavior" is).

@ivanpauno ivanpauno added help wanted Extra attention is needed good first issue Good for newcomers enhancement New feature or request and removed question Further information is requested labels Dec 18, 2020
@ivanpauno ivanpauno removed their assignment Dec 18, 2020
@wjwwood
Copy link
Member

wjwwood commented Dec 18, 2020

I'm not sure what is the intended semantics of cancel, but its current behavior doesn't sound quite useful (it will always be racy).

cancel() is only meant to interrupt spin, not prevent future spins.

Changing it to be "cancel the current spin() loop or the following call to spin()" just moves the problem around I think, because what if you want to spin(), then either spin finish itself or cancel it asynchronously, then spin again. If you do that then it's a race to see if cancel will stop the first spin or if it fires off just after spin finishes normally. In that case sometimes the second spin would do something and other times it would return immediately.

Instead maybe we could leave cancel as-is and provide a wait_for_spin(), so the test becomes:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

I personally don't see any need for a shutdown since you can just destroy the executor. In python this is more complicated because reliably invoking the destructor is more difficult and so maybe a shutdown makes sense there. But I'm not opposed to it either.

@ivanpauno
Copy link
Member

because what if you want to spin(), then either spin finish itself or cancel it asynchronously, then spin again. If you do that then it's a race to see if cancel will stop the first spin or if it fires off just after spin finishes normally.

Don't you still have the same problem with wait_for_spin()?
Note: I guess "spin finishes normally" means for example using spin_some() that can "finish normally" also be cancelled, because IIRC spin() never finishes (except if the rclcpp::Context is shutdown, the executor is cancelled).

I personally don't see any need for a shutdown since you can just destroy the executor. In python this is more complicated because reliably invoking the destructor is more difficult and so maybe a shutdown makes sense there. But I'm not opposed to it either.

The usefulness of a shutdown() method would be pretty similar to the usefulness of rclcpp::shutdown(), where you could use:

while(executor.is_ok()) {
  executor.spin()
}

and in another thread asynchronously call executor.shutdown().

@wjwwood
Copy link
Member

wjwwood commented Dec 18, 2020

Note: I guess "spin finishes normally" means for example using spin_some() that can "finish normally" also be cancelled, because IIRC spin() never finishes (except if the rclcpp::Context is shutdown, the executor is cancelled).

Right, or spin_until_future_complete().

Don't you still have the same problem with wait_for_spin()?

I don't think so? Because the reason for the first spin*() call exiting no longer matters, and if the cancel() happens after the spin*() finishes then it will be harmless (a no-op), so in any result of the race between spin*() finishing and cancel() being called, the second spin*() will do work and will never return immediately due to the previous cancel().

But maybe I'm looking at it wrong?

The usefulness of a shutdown() method would be pretty similar to the usefulness of rclcpp::shutdown(), where you could use:

Why not just executor.spin() and then asynchronously executor.cancel()?

@ivanpauno
Copy link
Member

I don't think so? Because the reason for the first spin*() call exiting no longer matters, and if the cancel() happens after the spin*() finishes then it will be harmless (a no-op), so in any result of the race between spin*() finishing and cancel() being called, the second spin*() will do work and will never return immediately due to the previous cancel().

mmm, I will provide more detailed examples to see if I understand:

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin();
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

will always cancel the first spin() (if wait_for_spin returned true).

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor->spin_some();
      executor->spin();
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

might cancel either the first spin_some() or the second spin(), because the following can happen: wait_for_spin() returns true, context switch, spin_some() finishes, spin() starts, cancel() is executed cancelling the second spin().

with the proposed "cancel() cancels the current/next spin()" behavior:

Case 1. also works fine behavior (and doesn't require a timeout).
Case 2. has the same issue.

Moreover, in loops like the following:

 TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  auto executor = std::make_shared<rclcpp::executors::SingleThreadedExecutor>();
  executor->add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      while (!executor_cancelled_condition) {
        executor->spin_some();
        // do more work here
      }
    });
  ASSERT_TRUE(executor->wait_for_spin(100ms));
  executor->cancel();
}

The wait_for_spin() also has a race (if it's called while spin_some() isn't running), and the proposed change in the cancel() behavior just works (not really if "do more work" has another spin*() call ...).

There's another alternative: a wait_for_spin_and_cancel() method, but it's not my preferred alternative.

Why not just executor.spin() and then asynchronously executor.cancel()?

The issue is how do you get the "executor.is_ok()" (or executor.is_cancelled()) state.
I mean, currently cancel() is just a notification to an asynchronous spin*() running, and cancelled is not an state of the executor.
If cancel would be a state of the executor that isn't automatically cleared (i.e.: not cleared by a following spin*() call, we could have an uncancel() method), the issue mentioned above:

not really if "do more work" has another spin*() call

is fixed.


Maybe the question is: what is the use case of cancel()? That's not clear to me, I think we always use rclcpp::Context::shutdown() in situations like this.

@wjwwood
Copy link
Member

wjwwood commented Dec 22, 2020

For case 2, I think that's just not going to work as intended, as you said your proposed change to how cancel works would still not work for case 2. Further more, I don't see the point in case 2, I think you'd want to test something in between, e.g.:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor.spin_some();
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  executor.cancel();

  // ASSERT that spin_some did something you expected
  executor_spin_future.get();

  executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      executor.spin();
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  // ASSERT spin did something you expected, perhaps in a loop or wait on a condition variable from a callback
  executor.cancel();
}

If you're testing some interaction with spin_some followed by spin, then I'd recommend a condition variable:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  std::mutex m;
  std::condition_variable cv;
  bool finished = false;
  auto executor_spin_future = std::async(
    std::launch::async, [&]() -> void {
      executor.spin_some();
      executor.spin();
      finished = true;
      cv.notify_one();
    });
  do {
    std::unique_lock<std::mutex> lk(m);
    cv.wait_for(100ms, [&]() {return finished;});
    EXPECT_TRUE(finished);
    if (!finished) {
      executor.cancel();
    }
  } while (!finished);
  // ASSERT some state after the spin_some/spin sequence
}

For case 3, you can cover most cases with this:

TEST(Foo, bar) {
  auto node = std::make_shared<rclcpp::Node>("test");
  rclcpp::executors::SingleThreadedExecutor executor;
  executor.add_node(node);
  bool stop = false;
  auto executor_spin_future = std::async(
    std::launch::async, [&executor]() -> void {
      while (!stop && !executor_cancelled_condition) {
        // potential delay
        executor.spin_some();
        // do more work here
      }
    });
  ASSERT_TRUE(executor.wait_for_spin(100ms));
  stop = true;
  executor.cancel();
}

There's still a race if the cancel comes during the "potential delay" area, but it would only have to wait for one more spin_some(). Which shouldn't be too much worse than actually canceling the spin_some(). In which case having it be "this spin or the next" would be better.

The issue is how do you get the "executor.is_ok()" (or executor.is_cancelled()) state.

I don't see why the executor needs to provide that state. I mean the executors != the init/shutdown/ok system we have globally. That's more like the context. The user could provide this state themselves I think.

I mean, currently cancel() is just a notification to an asynchronous spin*() running, and cancelled is not an state of the executor.

This was intentional, I think it was called interrupt first, but was changed to cancel early on. I don't think it was ever meant to solve the problems we're describing here. It was meant to "cancel any ongoing spin if there is one". That's all.

Maybe the question is: what is the use case of cancel()? That's not clear to me, I think we always use rclcpp::Context::shutdown() in situations like this.

I mean, we're using it all over the place to spin indefinitely and then stop that spin once some scenario has been met, e.g.:

I think this is a perfectly normal use case. When you include tests that race spin() or loop over spin_some() (like your third example), then it isn't sufficient.

I suppose I'm fine with changing how cancel works, but maybe we should keep a method that works like the current one too, i.e. it just interrupts a spin but only if there is one at that moment.

@suab321321
Copy link
Contributor

@wjwwood I would like to start contributing to this, can you help me get started, I have built rolling from source, but I wont be able to do version control from there. So first I should fork this clone the forked, and then work on it right? But if I do this then how do I test whether my changes are compiling successfully?

@wjwwood
Copy link
Member

wjwwood commented Feb 16, 2021

I have built rolling from source, but I wont be able to do version control from there.

You should use this file for development:

https://github.com/ros2/ros2/blob/master/ros2.repos

You can use git from there.

@ivanpauno ivanpauno removed good first issue Good for newcomers help wanted Extra attention is needed labels Feb 17, 2021
@suab321321
Copy link
Contributor

@ivanpauno is this issue not beginner friendly?

@ivanpauno
Copy link
Member

@ivanpauno is this issue not beginner friendly?

I'm not pretty sure what we want to do here, so resolving the issue involves first discussing what's desired.
After that is resolved, implementing it shouldn't be a problem, but the good first issue label doesn't feel accurate at the moment.
If you have any opinion of how it should work feel free to chime in.

I will try to reread the comments again in a few days and post something.

I also feel the discussion somewhat overlaps with how rclcpp::shutdown() should work (#1139), though it's not exactly the same issue.

@rpaaron
Copy link
Contributor

rpaaron commented Mar 15, 2021

seems much of the problem is the need to perform an action before (or after) some other action, requiring additional machinery to coordinate this in a multithreaded environment. And theres not a lot of guidance about how best to go about that.
I understand the desire not to change the existing behaviour of the API, the conflicting desire to have functions behave consistently, and the question as to whether this is exceptional behaviour.
Perhaps we could leave everything existing intact, and add an additional API, something like:

auto execution_token = executor.create_token();
std::thread t([&](){
    // will spin until token is cancelled
    // spinning already cancelled token just returns
    executor.spin(execution_token);
});
execution_token.cancel();
t.join(); 

where behaviour is the same independent of execution order. Perhaps spin could even return some interesting information about how it exited? eg CANCELLED_TOKEN, TOKEN_ALREADY_CANCELLED if useful

@ivanpauno
Copy link
Member

ivanpauno commented Mar 15, 2021

@rpaaron that's a great example, you can already do that with spin_until_future_complete():

auto promise = std::promise<void>{};
auto future = promise.get_future();
std::thread t([&](){
    // will spin until the future is complete
    // spinning an already completed future just returns
    executor.spin_until_future_complete(future);
});
promise.set_value();
t.join(); 

@rpaaron
Copy link
Contributor

rpaaron commented Mar 15, 2021

thanks @ivanpauno . Is this approach the current advice, in order to avoid a race condition here?
I've seen in other pull requests, and perhaps existing code that uses this kind of construction:

void Someclass::start {
    executor_thread_ = std::thread([this]() {
        executor_.spin();
    });
}

Someclass::~Someclass() {
    if (executor_thread_.joinable()) {
        executor_.cancel();
        executor_thread_.join();
  }

which seems like it would be vulnerable to a race situation, ie if the object is destructed before executor_.spin() is called in the new thread.
What should be the advice in this case? Use a safer construction like your example? set a flag only after spin() has finished, and continuously cancel() until the flag is set, and only then join()?

@ivanpauno
Copy link
Member

which seems like it would be vulnerable to a race situation, ie if the object is destructed before executor_.spin() is called in the new thread.

Yeah, that code looks racy.

What should be the advice in this case? Use a safer construction like your example? set a flag only after spin() has finished, and continuously cancel() until the flag is set, and only then join()?

I think that something like the example above would work.
The bad thing of using a flag is that you have to do some "busy-waiting" until the executor is actually cancelled, which is not great but also works.

@paulsammut
Copy link

paulsammut commented May 18, 2023

@rpaaron that's a great example, you can already do that with spin_until_future_complete():

auto promise = std::promise<void>{};
auto future = promise.get_future();
std::thread t([&](){
    // will spin until the future is complete
    // spinning an already completed future just returns
    executor.spin_until_future_complete(future);
});
promise.set_value();
t.join(); 

@ivanpauno I'm trying to use this pattern for having multiple nodes in their own executors and threads in a test. One node works fine, but when I have multiple nodes executor.spin_until_future_complete(future); hangs. If I add an executor.cancel() along with the promis signalling, it works. See the code below:

template <typename T>
class SingleThreadedNode {
 public:
  explicit SingleThreadedNode(const std::string name) {
    node = std::make_shared<T>(name);
    exec_.add_node(node);

    thread_ = std::make_unique<std::thread>([&]() {
      RCLCPP_INFO(node->get_logger(), "Starting up %s", node->get_name());
      exec_.spin_until_future_complete(future_);
    });
  }

  ~SingleThreadedNode() {
        RCLCPP_INFO(node->get_logger(), "Trying to shutdown %s", node->get_name());
         promise_.set_value();
         // I don't know why i need this, but without it multiple nodes don't close
        exec_.cancel();
        thread_->join();
        RCLCPP_INFO(node->get_logger(), "Shutdown %s", node->get_name());
  }

  rclcpp::Node::SharedPtr node;

 private:
  std::unique_ptr<std::thread> thread_;
  rclcpp::executors::SingleThreadedExecutor exec_;
  std::promise<void> promise_{};
  std::future<void> future_ = promise_.get_future();
};


TEST(UtilsTest, SingleThreadedNodeRaces) {
  // Test create and rapidly destroying multiple nodes. This tests for nasty races
  auto quickNode1 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node1");
  auto quickNode2 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node2");
  auto quickNode3 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node3");
  auto quickNode4 = std::make_shared<SingleThreadedNode<rclcpp::Node>>("test_node4");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants