Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always end up coredump with rolling #659

Closed
fujitatomoya opened this issue Jan 17, 2023 · 16 comments
Closed

always end up coredump with rolling #659

fujitatomoya opened this issue Jan 17, 2023 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Jan 17, 2023

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 docker container
  • Installation type:
    • source
  • Version or commit hash:
  • DDS implementation:
    • rmw_fastrtps_cpp only
  • Client library (if applicable):
    • rclcpp, rclpy

Steps to reproduce issue

root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 topic list
Segmentation fault (core dumped)

Or any other command line is issued, that leads to core dump.

Expected behavior

No coredump observed.

@fujitatomoya
Copy link
Collaborator Author

@clalancette do you see this issue?

@fujitatomoya fujitatomoya self-assigned this Jan 17, 2023
@fujitatomoya fujitatomoya added question Further information is requested more-information-needed Further information is required labels Jan 17, 2023
@fujitatomoya
Copy link
Collaborator Author

Just want to know that this is happening only with my environment or not.

@fujitatomoya
Copy link
Collaborator Author

Other Tier I RMW implementation does not have this problem.

root@tomoyafujita:~/ros2_ws/colcon_ws# RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic list
/parameter_events
/rosout

root@tomoyafujita:~/ros2_ws/colcon_ws# RMW_IMPLEMENTATION=rmw_connextdds ros2 topic list
/parameter_events
/rosout

@clalancette
Copy link
Contributor

@clalancette do you see this issue?

Yes, this is happening on all jobs on CI right now. It just started today, as the overnight CI jobs were OK.

@clalancette
Copy link
Contributor

I'm guessing it is either #568 or #658 . I don't have time right this second to check, but it would be worthwhile to back those out and see what happens.

@fujitatomoya
Copy link
Collaborator Author

No i rolled back them, but it does not change the result. i guess something in Fast-DDS, working on core stack trace.

@fujitatomoya
Copy link
Collaborator Author

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./build/demo_nodes_cpp/talker'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  eprosima::fastdds::dds::detail::ReadTakeCommand::generate_info (info=..., instance=..., item=@0x55a0569247d0: 0x0)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReaderImpl/ReadTakeCommand.hpp:229
229	        info.sample_state = item->isRead ? READ_SAMPLE_STATE : NOT_READ_SAMPLE_STATE;
[Current thread is 1 (Thread 0x7f3971f5a640 (LWP 127542))]
(gdb) bt
#0  eprosima::fastdds::dds::detail::ReadTakeCommand::generate_info (info=..., instance=..., item=@0x55a0569247d0: 0x0)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReaderImpl/ReadTakeCommand.hpp:229
#1  0x00007f39788b8a73 in eprosima::fastdds::dds::detail::DataReaderHistory::get_first_untaken_info (this=0x55a056923d20, info=...)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp:361
#2  0x00007f397889c360 in eprosima::fastdds::dds::DataReaderImpl::get_first_untaken_info (this=0x55a056923860, info=0x7f3971f593f0)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReaderImpl.cpp:762
#3  0x00007f3978896993 in eprosima::fastdds::dds::DataReader::get_first_untaken_info (this=0x55a0569223d0, info=0x7f3971f593f0)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReader.cpp:278
#4  0x00007f397932b648 in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7f3979400246 "rmw_fastrtps_cpp", subscriptions=0x7f3971f594e0, 
    guard_conditions=0x7f3971f594f0, services=0x0, clients=0x0, events=0x0, wait_set=0x7f395c000b70, wait_timeout=0x0)
    at /root/ros2_ws/colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:68
#5  0x00007f39792ce1b7 in node_listener (context=0x55a0567c7260)
    at /root/ros2_ws/colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/listener_thread.cpp:142
#6  0x00007f39792cf0ec in std::__invoke_impl<void, void (*)(rmw_context_s*), rmw_context_s*> (
    __f=@0x55a056939cd0: 0x7f39792cdf7b <node_listener(rmw_context_t*)>) at /usr/include/c++/11/bits/invoke.h:61
#7  0x00007f39792cf04d in std::__invoke<void (*)(rmw_context_s*), rmw_context_s*> (
    __fn=@0x55a056939cd0: 0x7f39792cdf7b <node_listener(rmw_context_t*)>) at /usr/include/c++/11/bits/invoke.h:96
#8  0x00007f39792cefad in std::thread::_Invoker<std::tuple<void (*)(rmw_context_s*), rmw_context_s*> >::_M_invoke<0ul, 1ul> (this=0x55a056939cc8)
    at /usr/include/c++/11/bits/std_thread.h:253
#9  0x00007f39792cef62 in std::thread::_Invoker<std::tuple<void (*)(rmw_context_s*), rmw_context_s*> >::operator() (this=0x55a056939cc8)
    at /usr/include/c++/11/bits/std_thread.h:260
#10 0x00007f39792cef42 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(rmw_context_s*), rmw_context_s*> > >::_M_run (
    this=0x55a056939cc0) at /usr/include/c++/11/bits/std_thread.h:211
#11 0x00007f3979a692b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f39797d7b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00007f3979869a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@fujitatomoya
Copy link
Collaborator Author

segmentation fault happens in Fast-DDS as described above via rmw_wait. checking Fast-DDS commit history.

@fujitatomoya
Copy link
Collaborator Author

seems to be related to eProsima/Fast-DDS@de5cd9c, checking with rolling back the Fast-DDS master branch commit.

there are 3 possible commits from yesterday. https://github.com/eProsima/Fast-DDS/commits/master

@clalancette
Copy link
Contributor

seems to be related to eProsima/Fast-DDS@de5cd9c, checking with rolling back the Fast-DDS master branch commit.

Great, thanks. If we can't get eProsima to fix this today, I'll propose that we update https://github.com/ros2/ros2/blob/rolling/ros2.repos#L37 to pin the version to the last known good hash as a workaround.

@fujitatomoya
Copy link
Collaborator Author

okay confirmed that eProsima/Fast-DDS@de5cd9c is the commit to leads to the coredump. (rolling back to eProsima/Fast-DDS@f6670e3, it does not generate the Segmentation Fault.)

@MiguelCompany
Copy link
Collaborator

@fujitatomoya would you please open a PR in Fast DDS that reverts the offending commit?

@fujitatomoya
Copy link
Collaborator Author

revert PR is eProsima/Fast-DDS#3221

@fujitatomoya
Copy link
Collaborator Author

Stack Trace Information:

(gdb) frame
#0  eprosima::fastdds::dds::detail::ReadTakeCommand::generate_info (info=..., instance=..., item=@0x55db75811ad0: 0x0)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/DataReaderImpl/ReadTakeCommand.hpp:229
229	        info.sample_state = item->isRead ? READ_SAMPLE_STATE : NOT_READ_SAMPLE_STATE;
(gdb) print info
$1 = (eprosima::fastdds::dds::SampleInfo &) @0x7fe8893f43f0: {sample_state = 0, view_state = 0, instance_state = 0, disposed_generation_count = 0, 
  no_writers_generation_count = 0, sample_rank = 0, generation_rank = 0, absolute_generation_rank = 0, source_timestamp = {seconds_ = 0, 
    fraction_ = 0, nanosec_ = 0}, reception_timestamp = {seconds_ = 0, fraction_ = 0, nanosec_ = 0}, instance_handle = {value = {value_ = {
        _M_elems = '\000' <repeats 15 times>}, has_been_set_ = false}}, publication_handle = {value = {value_ = {
        _M_elems = '\000' <repeats 15 times>}, has_been_set_ = false}}, valid_data = false, sample_identity = {writer_guid_ = {guidPrefix = {
        static size = 12, value = '\000' <repeats 11 times>}, entityId = {static size = 4, value = "\000\000\000"}}, sequence_number_ = {high = -1, 
      low = 0}}, related_sample_identity = {writer_guid_ = {guidPrefix = {static size = 12, value = '\000' <repeats 11 times>}, entityId = {
        static size = 4, value = "\000\000\000"}}, sequence_number_ = {high = -1, low = 0}}}
(gdb) print item
$2 = (const eprosima::fastdds::dds::detail::DataReaderCacheChange &) @0x55db75811ad0: 0x0
(gdb) whatis item
type = const eprosima::fastdds::dds::detail::DataReaderCacheChange &


(gdb) frame
#1  0x00007fe88fd53a73 in eprosima::fastdds::dds::detail::DataReaderHistory::get_first_untaken_info (this=0x55db75811020, info=...)
    at /root/ros2_ws/colcon_ws/src/eProsima/Fast-DDS/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp:361
361	        ReadTakeCommand::generate_info(info, *(it->second), *item);
(gdb) print item
$13 = 0x0
(gdb) print instance_changes
$14 = (eprosima::fastrtps::ResourceLimitedVector<eprosima::fastrtps::rtps::CacheChange_t*, std::integral_constant<bool, true>, eprosima::fastrtps::ResourceLimitedContainerConfig, std::allocator<eprosima::fastrtps::rtps::CacheChange_t*>, std::vector<eprosima::fastrtps::rtps::CacheChange_t*, std::allocator<eprosima::fastrtps::rtps::CacheChange_t*> > > &) @0x55db75811a20: {
  _vptr.ResourceLimitedVector = 0x7fe8905cd318 <vtable for eprosima::fastrtps::ResourceLimitedVector<eprosima::fastrtps::rtps::CacheChange_t*, std::integral_constant<bool, true>, eprosima::fastrtps::ResourceLimitedContainerConfig, std::allocator<eprosima::fastrtps::rtps::CacheChange_t*>, std::vector<eprosima::fastrtps::rtps::CacheChange_t*, std::allocator<eprosima::fastrtps::rtps::CacheChange_t*> > >+16>, configuration_ = {initial = 100, 
    maximum = 5000, increment = 1}, collection_ = std::vector of length 0, capacity 100}
(gdb) print data_available_instances_
$15 = std::map with 1 element = {[{value = {value_ = {_M_elems = '\000' <repeats 15 times>}, 
      has_been_set_ = false}}] = std::shared_ptr<eprosima::fastdds::dds::detail::DataReaderInstance> (use count 2, weak count 0) = {
    get() = 0x55db75811a20}}

the following patch can fix the problem, but re-commit the fix as complete patch would be the best.

diff --git a/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp b/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp
index e9d612669..2d258a433 100644
--- a/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp
+++ b/src/cpp/fastdds/subscriber/history/DataReaderHistory.cpp
@@ -357,9 +357,12 @@ bool DataReaderHistory::get_first_untaken_info(
     {
         auto it = data_available_instances_.begin();
         auto& instance_changes = it->second->cache_changes;
-        auto item = instance_changes.cbegin();
-        ReadTakeCommand::generate_info(info, *(it->second), *item);
-        return true;
+        if (!instance_changes.empty())
+        {
+            auto item = instance_changes.cbegin();
+            ReadTakeCommand::generate_info(info, *(it->second), *item);
+            return true;
+        }
     }
 
     return false;

@fujitatomoya
Copy link
Collaborator Author

Full CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@fujitatomoya fujitatomoya added bug Something isn't working and removed question Further information is requested more-information-needed Further information is required labels Jan 18, 2023
@fujitatomoya
Copy link
Collaborator Author

@clalancette @MiguelCompany now it is all green with rmw_fastrtps full CI, i will go ahead to close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants