Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft deadlock #3147

Closed
kikimo opened this issue Oct 19, 2021 · 3 comments
Closed

Raft deadlock #3147

kikimo opened this issue Oct 19, 2021 · 3 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Oct 19, 2021

Describe the bug (must be provided)

Raft deadlock after stress test.

Your Environments (must be provided)

  • OS: uname -a
  • Compliler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id 650776b (v2.6.0)

How To Reproduce(must be provided)

Steps to reproduce the behavior:

  1. 3store + 1meta + 1graph
  2. keep triggering leader change and start 1024 concurrent clients sending add edge request
  3. run for about two minutes

stop the stresser, we see one storage instance offline:

(root@nebula) [(none)]> show hosts;
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| Host            | Port  | Status    | Leader count | Leader distribution  | Partition distribution |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 59007 | "ONLINE"  | 1            | "ttos_3p3r:1"        | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 37549 | "ONLINE"  | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "192.168.15.11" | 56039 | "OFFLINE" | 0            | "No valid partition" | "ttos_3p3r:1"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
| "Total"         |       |           | 1            | "ttos_3p3r:1"        | "ttos_3p3r:3"          |
+-----------------+-------+-----------+--------------+----------------------+------------------------+
Got 4 rows (time spent 1249/11228 us)

Tue, 19 Oct 2021 16:23:00 CST

pstack shows that raft of that instance is blocked on raftLock_:

Thread 61 (Thread 0x7f4d317ff700 (LWP 3413758) "IOThreadPool11"):
#0  0x00007f4d531c354d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4d531bee9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f4d531bed68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000002a3863a in __gthread_mutex_lock (__mutex=0x7f4d384be150) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/x86_64-vesoft-linux/bits/gthr-default.h:748
#4  0x0000000002a3893c in std::mutex::lock (this=0x7f4d384be150) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_mutex.h:103
#5  0x0000000002a3e41b in std::lock_guard<std::mutex>::lock_guard (this=0x7f4d317fbbb8, __m=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_mutex.h:162
#6  0x0000000003d557a0 in nebula::raftex::RaftPart::processHeartbeatRequest (this=0x7f4d384be010, req=..., resp=...) at /root/src/wwl/nebula/src/kvstore/raftex/RaftPart.cpp:1656
#7  0x0000000003d955dc in nebula::raftex::RaftexService::async_eb_heartbeat (this=0x7f4d528323b0, callback=..., req=...) at /root/src/wwl/nebula/src/kvstore/raftex/RaftexService.cpp:220
#8  0x0000000003e2b971 in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::process_heartbeat<apache::thrift::CompactProtocolReader, apache::thrift::CompactProtocolWriter> (this=0x7f4d0491bae0, req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.tcc:220
#9  0x0000000003e2742a in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::setUpAndProcess_heartbeat<apache::thrift::CompactProtocolReader, apache::thrift::CompactProtocolWriter> (this=0x7f4d0491bae0, req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.tcc:198
#10 0x0000000003e294eb in apache::thrift::detail::ap::process_pmap<apache::thrift::CompactProtocolReader, nebula::raftex::cpp2::RaftexServiceAsyncProcessor> (proc=0x7f4d0491bae0, pmap=..., req=..., serializedRequest=..., ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/third-party/install/include/thrift/lib/cpp2/GeneratedCodeHelper.h:891
#11 0x0000000003e266f3 in apache::thrift::detail::ap::process<nebula::raftex::cpp2::RaftexServiceAsyncProcessor> (processor=0x7f4d0491bae0, req=..., serializedRequest=..., protType=apache::thrift::protocol::T_COMPACT_PROTOCOL, ctx=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/third-party/install/include/thrift/lib/cpp2/GeneratedCodeHelper.h:919
#12 0x0000000003e2539e in nebula::raftex::cpp2::RaftexServiceAsyncProcessor::processSerializedRequest (this=0x7f4d0491bae0, req=..., serializedRequest=..., protType=apache::thrift::protocol::T_COMPACT_PROTOCOL, context=0x7f4d30ad0510, eb=0x7f4d30a04000, tm=0x7f4d528141d0) at /root/src/wwl/nebula/build/src/interface/gen-cpp2/RaftexService.cpp:102
#13 0x00000000042f8644 in apache::thrift::Cpp2Connection::requestReceived(std::unique_ptr<apache::thrift::HeaderServerChannel::HeaderRequest, std::default_delete<apache::thrift::HeaderServerChannel::HeaderRequest> >&&) ()
#14 0x00000000043042cc in apache::thrift::HeaderServerChannel::messageReceived(std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >&&, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> >&&) ()
#15 0x00000000042c54f6 in apache::thrift::Cpp2Channel::read(wangle::HandlerContext<int, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*> >*, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#16 0x00000000042d2c73 in wangle::ContextImpl<apache::thrift::Cpp2Channel>::read(std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#17 0x00000000042d2b65 in wangle::ContextImpl<apache::thrift::FramingHandler>::fireRead(std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >) ()
#18 0x00000000042d37e5 in apache::thrift::FramingHandler::read(wangle::HandlerContext<std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >*, folly::IOBufQueue&) ()
#19 0x00000000042cb56d in non-virtual thunk to wangle::ContextImpl<apache::thrift::FramingHandler>::read(folly::IOBufQueue&) ()
#20 0x00000000042cb450 in wangle::ContextImpl<apache::thrift::TAsyncTransportHandler>::fireRead(folly::IOBufQueue&) ()
#21 0x00000000045db8ff in folly::AsyncSocket::handleRead() ()
#22 0x00000000045d2a00 in folly::AsyncSocket::ioReady(unsigned short) ()
#23 0x00000000046c284f in ?? ()
#24 0x00000000046c316f in event_base_loop ()
#25 0x00000000045ec7fe in folly::EventBase::loopBody(int, bool) ()
#26 0x00000000045ecca6 in folly::EventBase::loop() ()
#27 0x00000000045ee756 in folly::EventBase::loopForever() ()
#28 0x0000000004578389 in folly::IOThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
#29 0x0000000004585f59 in void folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()
#30 0x0000000002a3e85c in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f4d33410390) at /root/src/wwl/nebula/build/third-party/install/include/folly/Function.h:400
#31 0x0000000002a4ea06 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0x7f4d33410390) at /root/src/wwl/nebula/build/third-party/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#32 0x0000000002a60400 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:60
#33 0x0000000002a5994a in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__fn=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:95
#34 0x0000000002aa52c2 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f4d33410390) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:234
#35 0x0000000002aa47b7 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f4d33410390) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:243
#36 0x0000000002aa1a54 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f4d33410380) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:186
#37 0x0000000004b4043f in execute_native_thread_routine ()
#38 0x00007f4d531bcea5 in start_thread () from /lib64/libpthread.so.0
#39 0x00007f4d52ee59fd in clone () from /lib64/libc.so.6
Thread 60 (Thread 0x7f4d325ff700 (LWP 3413757) "IOThreadPool10"):

strace shows that thread 3413758(the thread raft task resides) is blocked on futex forever:

# strace -p 3413758
strace: Process 3413758 attached
futex(0x7f4d384be150, FUTEX_WAIT_PRIVATE, 2, NULL

full pstack output is provided below:

3413698.txt

Expected behavior

all cluster peers still running. healthly。

Additional context

Provide logs and configs, or any other context to trace the problem.

@kikimo kikimo added the type/bug Type: something is unexpected label Oct 19, 2021
@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

@critical27
This problem disappear after we we applied this patch #3141 . I highly doubt that this is a similar problem to #3018 , I don't keep the stack trace for #3018 , but I can remember that both of them blocked on raftLock_ , though at different places.

@critical27
Copy link
Contributor

mark, maybe another problem, it is stuck in Host

@kikimo
Copy link
Contributor Author

kikimo commented Oct 19, 2021

mark, maybe another problem, it is stuck in Host

Yes, quit another similar problem of cond var :-) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants