Fix the race condition while calling rcl_shutdown #1353

Barry-Xu-2018 · 2024-09-03T03:14:24Z

Address #1352

Barry-Xu-2018 · 2024-09-03T06:47:13Z

@sloretz

Regarding the fix for issue #1352, I'd like to hear your thoughts.

About issue #1352, there is an issue

One thread is calling rclpy::shutdown()

rclpy/rclpy/src/rclpy/context.cpp

Lines 44 to 53 in 8a2e2d3

    
           void shutdown_contexts() 
        
           { 
        
             // graceful shutdown all contexts 
        
             std::lock_guard<std::mutex> guard{g_contexts_mutex}; 
        
             for (auto * c : g_contexts) { 
        
               rcl_ret_t ret = rcl_shutdown(c); 
        
               (void)ret; 
        
             } 
        
             g_contexts.clear(); 
        
           }

Another thread is calling Context::shutdown()

rclpy/rclpy/src/rclpy/context.cpp

Lines 151 to 165 in 8a2e2d3

    
           Context::shutdown() 
        
           { 
        
             { 
        
               std::lock_guard<std::mutex> guard{g_contexts_mutex}; 
        
               auto iter = std::find(g_contexts.begin(), g_contexts.end(), rcl_context_.get()); 
        
               if (iter != g_contexts.end()) { 
        
                 g_contexts.erase(iter); 
        
               } 
        
             } 
        
             rcl_ret_t ret = rcl_shutdown(rcl_context_.get()); 
        
             if (RCL_RET_OK != ret) { 
        
               throw RCLError("failed to shutdown"); 
        
             } 
        
           }

rcl_shutdown() will be called twice. And rcl_shutdown() may be called at the same time (Calling rcl_shutdown() isn't protected by g_contexts_mutex in Context::shutdown()).
I changed the codes in Context::shutdown().
But this brings a problem: multiple calls to Context::shutdown() will no longer trigger an exception. This will cause the below existing test to fail.

rclpy/rclpy/test/test_init_shutdown.py

Lines 77 to 83 in 8a2e2d3

    
           def test_double_shutdown(): 
        
               context = rclpy.context.Context() 
        
               rclpy.init(context=context) 
        
               assert context.ok() 
        
               rclpy.shutdown(context=context) 
        
               with pytest.raises(RuntimeError): 
        
                   rclpy.shutdown(context=context)

My question is whether we still need to ensure that multiple calls to Context::shutdown() throw an exception ? Maybe It is more appropriate to display a warning In this scenario. What do you think ?

Barry-Xu-2018 · 2024-09-06T01:16:48Z

Friendly ping @sloretz ?

fujitatomoya

IMO this is correct fix.

fujitatomoya · 2024-09-06T23:50:38Z

rclpy/test/test_init_shutdown.py

-    rclpy.init(context=context)
-    assert context.ok()
-    rclpy.shutdown(context=context)
-    with pytest.raises(RuntimeError):


Do we need to remove this test?

I think we can keep this test, but expects RCLError exception here to catch from 2nd rcl_shutdown (internally RCL_RET_ALREADY_SHUTDOWN returns in rcl) ?

Thanks for your comments.

I think we can keep this test, but expects RCLError exception here to catch from 2nd rcl_shutdown (internally RCL_RET_ALREADY_SHUTDOWN returns in rcl) ?

With the current fix, if the context is not in g_contexts, rcl_shutdown will not be called.

According to the issue described in #1352, it is indeed possible for two threads to call rcl_shutdown on the same context. However, one call is from rclpy::shutdown() and the other from Context::shutdown(). In this situation, we do not want to see an exception being thrown.

The test calls Context::shutdown() twice, and it is reasonable to expect an exception on the second call. Therefore, I am considering setting a flag variable in the Context class. If Context::shutdown() has been successfully called, this variable will be set. If Context::shutdown() is called again, it will check if the flag has been set and, if so, will throw an exception CONTEXT_ALREADY_SHUTDOWN (not from RCL). What do you think?

Therefore, I am considering setting a flag variable in the Context class. If Context::shutdown() has been successfully called, this variable will be set. If Context::shutdown() is called again, it will check if the flag has been set and, if so, will throw an exception CONTEXT_ALREADY_SHUTDOWN (not from RCL). What do you think?

Please review 0eb7ee5

fujitatomoya · 2024-09-09T18:20:27Z

rclpy/src/rclpy/context.cpp

-  rcl_ret_t ret = rcl_shutdown(rcl_context_.get());
-  if (RCL_RET_OK != ret) {
-    throw RCLError("failed to shutdown");
+  auto iter = std::find(g_contexts.begin(), g_contexts.end(), rcl_context_.get());


so the g_contexts is the collection of valid context, and this means if it cannot find the context in the g_contexts during this shutdown call, that is the invalid operation? so we can generate the exception without introducing already_shutdown_?

so the g_contexts is the collection of valid context, and this means if it cannot find the context in the g_contexts during this shutdown call, that is the invalid operation?

Yes.
rc_context_ is added to g_contexts in constructor of Context. Before calling rcl_shutdown(), rc_context_ is removed from g_contexts.

so we can generate the exception without introducing already_shutdown_?

already_shutdown_ is only used for multiple calls to Context::shutdown() to throw the exception.

About issue reported by #1352, it describes a possible scenario. There are two threads to call rcl_shutdown for the same context. One call is from rclpy::shutdown() and the other from Context::shutdown(). In this situation, we do not want to get an exception in Context::shutdown().

that is exactly why we have std::lock_guard<std::mutex> guard{g_contexts_mutex}; lock escalated to the this function scope? so that when rclpy::shutdown_contexts is under process, we will not find the iterator with this context, so generate the exception. i really do not understand to have already_shutdown_ flag internally...

@fujitatomoya

i really do not understand to have already_shutdown_ flag internally...

Let me first summarize the two scenarios we currently need to address:

The scenario mentioned in rcl_shutdown is not thread-safe when used in both signal_handler and Context.__exit__. #1352. There are two threads to call rcl_shutdown for the same context. One call is from rclpy::shutdown() and the other from Context::shutdown(). For this scenario, we don't want to get exception.

For one context instance, Context::shutdown() is being called multiple times. At this scenario, we want to throw an exception. (There is an existed test to check if an exception is thrown.) The fix for 1 would prevent 2 from throwing an exception. That's why already_shutdown_ ``` was added to indicate that Context::shutdown() has already been called once on the context instance. Do you have any better suggestions?

@fujitatomoya Friendly ping.

thanks for the explanation, i understand what you mean. if we take already_shutdown_, then we need to reinitialize already_shutdown_ during context initialization? unless, already_shutdown_ will never be False?

IMO, Context::shutdown() should generate the exception only when rcl_shutdown fails with valid context. this actually aligns with rclcpp behavior. of course, if we change like this, we need to adjust the unit tests accordingly.

maybe we can keep this open for more feedback.

thanks for the explanation, i understand what you mean. if we take already_shutdown_, then we need to reinitialize already_shutdown_ during context initialization? unless, already_shutdown_ will never be False?

Yes.

IMO, Context::shutdown() should generate the exception only when rcl_shutdown fails with valid context. this actually aligns with rclcpp behavior. of course, if we change like this, we need to adjust the unit tests accordingly.

Yes. The addition of already_shutdown_ is just to pass the existing tests (Ensure consistency with previous behavior.)

maybe we can keep this open for more feedback.

Yes.

Yadunund · 2024-09-26T18:33:59Z

@fujitatomoya do you mind doing another round of review here whenever you get a chance?

fujitatomoya · 2024-09-27T01:14:32Z

@Barry-Xu-2018 is this still draft? are we waiting for someone or something to make if official?

Barry-Xu-2018 · 2024-09-27T09:20:18Z

@Barry-Xu-2018 is this still draft? are we waiting for someone or something to make if official?

Oh, I forgot to change the status.

clalancette

I'm generally fine with this logic (and it seems to fix some of the errors we've been having in Zenoh). I've left one naming thing to fix, and this needs to be rebased. Once those things are done, I'm happy to approve and run CI on this.

clalancette · 2024-11-07T19:09:03Z

rclpy/src/rclpy/exceptions.hpp

@@ -80,6 +80,11 @@ class InvalidHandle : public std::runtime_error
  using std::runtime_error::runtime_error;
 };

+class CONTEXT_ALREADY_SHUTDOWN : public std::runtime_error


Suggested change

class CONTEXT_ALREADY_SHUTDOWN : public std::runtime_error

class ContextAlreadyShutdown : public std::runtime_error

Thanks. I will update codes.

Signed-off-by: Barry Xu <[email protected]>

fujitatomoya · 2024-11-08T02:16:16Z

@Barry-Xu-2018

if we take already_shutdown_, then we need to reinitialize already_shutdown_ during context initialization? unless, already_shutdown_ will never be False?

i think this still needs to be addressed. with that, i am okay with this PR!

Barry-Xu-2018 · 2024-11-08T02:21:54Z

Run CI:

Linux
Linux-aarch64
Linux-rhel
Windows

Barry-Xu-2018 · 2024-11-08T02:46:41Z

@fujitatomoya

if we take already_shutdown_, then we need to reinitialize already_shutdown_ during context initialization? unless, already_shutdown_ will never be False?

One context has one already_shutdown_. Based on my understanding, once context has been shut down, it should not be used again. That is, user need to create another context.
So already_shutdown_ need not change from 'True' to 'False'.

ahcorde · 2024-11-08T11:00:27Z

Windows

clalancette · 2024-11-08T16:04:54Z

One context has one already_shutdown_. Based on my understanding, once context has been shut down, it should not be used again. That is, user need to create another context.

As far as I can tell, this is correct. There is no method on the Context class to initialize it; initialization happens during construction, which already properly sets already_shutdown_ to false (I don't love this design, as initialization and shutdown aren't symmetric, but that has nothing to do with this PR).

So with that, I'm happy with this. I'm going to approve. I'll wait a few hours before merging just to give @fujitatomoya a chance to respond.

fujitatomoya

@Barry-Xu-2018 thanks for the explanation, as you say there is no reinitialize way to do for user application.

fujitatomoya · 2024-11-08T17:32:47Z

@Barry-Xu-2018 @clalancette btw, we can backport this to Jazzy? i think we can, in the cpp code it breaks the ABI, but this is rclpy python client library, so nobody should rely on the cpp code in here.

Barry-Xu-2018 · 2024-11-10T06:36:34Z

we can backport this to Jazzy?

I think yes. What do you think ? @clalancette

Barry-Xu-2018 mentioned this pull request Sep 3, 2024

rcl_shutdown is not thread-safe when used in both signal_handler and Context.__exit__. #1352

Closed

YuanYuYuan mentioned this pull request Sep 3, 2024

Sporadic failure on system_tests/test_communication ros2/rmw_zenoh#275

Closed

fujitatomoya reviewed Sep 6, 2024

View reviewed changes

fujitatomoya reviewed Sep 9, 2024

View reviewed changes

Barry-Xu-2018 marked this pull request as ready for review September 27, 2024 09:18

Barry-Xu-2018 requested review from sloretz and adityapande-1995 as code owners September 27, 2024 09:18

YuanYuYuan mentioned this pull request Nov 6, 2024

Test failures with rmw_zenoh ros2/rmw_zenoh#286

Closed

clalancette requested changes Nov 7, 2024

View reviewed changes

clalancette mentioned this pull request Nov 7, 2024

Revamp of rmw_context_impl_s ros2/rmw_zenoh#307

Merged

Barry-Xu-2018 added 4 commits November 8, 2024 09:53

Fix the race condition while calling rcl_shutdown

840d550

Signed-off-by: Barry Xu <[email protected]>

Avoid calling rcl_shutdown() multiple times on the same context

7c91d52

Signed-off-by: Barry Xu <[email protected]>

Multiple calls to Context::shutdown will throw an exception

e63899b

Signed-off-by: Barry Xu <[email protected]>

Update the name of an exception

b439d70

Signed-off-by: Barry Xu <[email protected]>

Barry-Xu-2018 force-pushed the review/topic-fix-race-condition-rcl_shutdown branch from cfcf8e3 to b439d70 Compare November 8, 2024 02:09

ahcorde requested review from fujitatomoya and clalancette November 8, 2024 15:38

clalancette approved these changes Nov 8, 2024

View reviewed changes

fujitatomoya approved these changes Nov 8, 2024

View reviewed changes

fujitatomoya merged commit 34f9e13 into ros2:rolling Nov 8, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the race condition while calling rcl_shutdown #1353

Fix the race condition while calling rcl_shutdown #1353

Barry-Xu-2018 commented Sep 3, 2024

Barry-Xu-2018 commented Sep 3, 2024 •

edited

Loading

Barry-Xu-2018 commented Sep 6, 2024

fujitatomoya left a comment

fujitatomoya Sep 6, 2024

Barry-Xu-2018 Sep 9, 2024

Barry-Xu-2018 Sep 9, 2024

fujitatomoya Sep 9, 2024

Barry-Xu-2018 Sep 10, 2024

fujitatomoya Sep 30, 2024

Barry-Xu-2018 Oct 9, 2024

Barry-Xu-2018 Oct 16, 2024

fujitatomoya Nov 7, 2024

Barry-Xu-2018 Nov 7, 2024

Yadunund commented Sep 26, 2024

fujitatomoya commented Sep 27, 2024

Barry-Xu-2018 commented Sep 27, 2024

clalancette left a comment

clalancette Nov 7, 2024

Barry-Xu-2018 Nov 8, 2024

fujitatomoya commented Nov 8, 2024

Barry-Xu-2018 commented Nov 8, 2024

Barry-Xu-2018 commented Nov 8, 2024

ahcorde commented Nov 8, 2024

clalancette commented Nov 8, 2024

fujitatomoya left a comment

fujitatomoya commented Nov 8, 2024

Barry-Xu-2018 commented Nov 10, 2024

	class CONTEXT_ALREADY_SHUTDOWN : public std::runtime_error
	class ContextAlreadyShutdown : public std::runtime_error

Fix the race condition while calling rcl_shutdown #1353

Fix the race condition while calling rcl_shutdown #1353

Conversation

Barry-Xu-2018 commented Sep 3, 2024

Barry-Xu-2018 commented Sep 3, 2024 • edited Loading

Barry-Xu-2018 commented Sep 6, 2024

fujitatomoya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yadunund commented Sep 26, 2024

fujitatomoya commented Sep 27, 2024

Barry-Xu-2018 commented Sep 27, 2024

clalancette left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fujitatomoya commented Nov 8, 2024

Barry-Xu-2018 commented Nov 8, 2024

Barry-Xu-2018 commented Nov 8, 2024

ahcorde commented Nov 8, 2024

clalancette commented Nov 8, 2024

fujitatomoya left a comment

Choose a reason for hiding this comment

fujitatomoya commented Nov 8, 2024

Barry-Xu-2018 commented Nov 10, 2024

Barry-Xu-2018 commented Sep 3, 2024 •

edited

Loading