Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix map concurrency issue #4344

Merged
merged 4 commits into from
Jun 24, 2022
Merged

Conversation

panda-sheep
Copy link
Contributor

@panda-sheep panda-sheep commented Jun 23, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

meta crash while running tck:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/root/src/nebula-ent/build/bin/nebula-metad --flagfile /root/nebula-chaos-clust'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000005f7f5da in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) ()
[Current thread is 1 (Thread 0x7f94a23e3700 (LWP 245))]
(gdb) bt
#0  0x0000000005f7f5da in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) ()
#1  0x00000000037638c8 in std::_Rb_tree<int, std::pair<int const, std::mutex>, std::_Select1st<std::pair<int const, std::mutex> >, std::less<int>, std::allocator<std::pair<int const, std::mutex> > >::_M_insert_node (this=0x6f696a8 <nebula::meta::JobManager::getInstance()::inst+6312>, __x=0x0, __p=0x7f93ac1bb980,
    __z=0x7f93b02bb060) at /usr/include/c++/9/bits/stl_tree.h:2366
#2  0x0000000003759720 in std::_Rb_tree<int, std::pair<int const, std::mutex>, std::_Select1st<std::pair<int const, std::mutex> >, std::less<int>, std::allocator<std::pair<int const, std::mutex> > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<int const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<int const, std::mutex> >, std::piecewise_construct_t const&, std::tuple<int const&>&&, std::tuple<>&&) (
    this=0x6f696a8 <nebula::meta::JobManager::getInstance()::inst+6312>, __pos=...) at /usr/include/c++/9/bits/stl_tree.h:2467
#3  0x00000000037522d2 in std::map<int, std::mutex, std::less<int>, std::allocator<std::pair<int const, std::mutex> > >::operator[] (
    this=0x6f696a8 <nebula::meta::JobManager::getInstance()::inst+6312>, __k=@0x7f94a23da58c: 174) at /usr/include/c++/9/bits/stl_map.h:499
#4  0x0000000003747fa1 in nebula::meta::JobManager::reportTaskFinish (this=0x6f67e00 <nebula::meta::JobManager::getInstance()::inst>, req=...)
    at /data/src/nebula-ent/src/meta/processors/job/JobManager.cpp:367
#5  0x00000000037a14be in nebula::meta::ReportTaskProcessor::process (this=0x7f93b0273790, req=...)
    at /data/src/nebula-ent/src/meta/processors/job/ReportTaskProcessor.cpp:17
#6  0x00000000035a11d9 in nebula::meta::MetaServiceHandler::future_reportTaskFinish (this=0x76fab00, req=...)
    at /data/src/nebula-ent/src/meta/MetaServiceHandler.cpp:129
#7  0x000000000458f454 in nebula::meta::cpp2::MetaServiceSvIf::async_tm_reportTaskFinish (this=0x76fab00, callback=..., p_req=...)
    at /data/src/nebula-ent/build/src/interface/gen-cpp2/MetaService.cpp:5425
#8  0x0000000004612478 in nebula::meta::cpp2::MetaServiceAsyncProcessor::process_reportTaskFinish<apache::thrift::CompactProtocolReader, apache::thrift::CompactProtocolWriter> (this=0x7f91cc0038f0, req=..., serializedRequest=..., ctx=0x7f91cc010618, eb=0x7f91cc000ff0, tm=0x76eddd0)
    at /data/src/nebula-ent/build/src/interface/gen-cpp2/MetaService.tcc:5351
#9  0x00000000047542ba in apache::thrift::RequestTask<nebula::meta::cpp2::MetaServiceAsyncProcessor>::run (this=0x7f930c016410)
    at /opt/vesoft/third-party/3.0/include/thrift/lib/cpp2/async/AsyncProcessor.h:471
#10 0x00000000045db19d in apache::thrift::GeneratedAsyncProcessor::processInThread<nebula::meta::cpp2::MetaServiceAsyncProcessor>(std::unique_ptr<apache::thrift::ResponseChannelRequest, apache::thrift::RequestsRegistry::Deleter>, apache::thrift::SerializedCompressedRequest&&, apache::thrift::Cpp2RequestContext*, folly::EventBase*, apache::thrift::concurrency::ThreadManager*, apache::thrift::RpcKind, void (nebula::meta::cpp2::MetaServiceAsyncProcessor::*)(std::unique_ptr<apache::thrift::ResponseChannelRequest, apache::thrift::RequestsRegistry::Deleter>, apache::thrift::SerializedCompressedRequest&&, apache::thrift::Cpp2RequestContext*, folly::EventBase*, apache::thrift::concurrency::ThreadManager*), nebula::meta::cpp2::MetaServiceAsyncProcessor*)::{lambda()#1}::operator()() const
    (this=0x7f91cc004540) at /opt/vesoft/third-party/3.0/include/thrift/lib/cpp2/async/AsyncProcessor.h:1114
#11 0x00000000046d4250 in folly::detail::function::FunctionTraits<void ()>::callSmall<apache::thrift::GeneratedAsyncProcessor::processInThread<nebula::meta::cpp2::MetaServiceAsyncProcessor>(std::unique_ptr<apache::thrift::ResponseChannelRequest, apache::thrift::RequestsRegistry::Deleter>, apache::thrift::SerializedCompressedRequest&&, apache::thrift::Cpp2RequestContext*, folly::EventBase*, apache::thrift::concurrency::ThreadManager*, apache::thrift::RpcKind, void (nebula::meta::cpp2::MetaServiceAsyncProcessor::*)(std::unique_ptr<apache::thrift::ResponseChannelRequest, apache::thrift::RequestsRegistry::Deleter>, apache::thrift::SerializedCompressedRequest&&, apache::thrift::Cpp2RequestContext*, folly::EventBase*, apache::thrift::concurrency::ThreadManager*), nebula::meta::cpp2::Me--Type <RET> for more, q to quit, c to continue without paging--
taServiceAsyncProcessor*)::{lambda()#1}>(folly::detail::function::Data&) (p=...) at /opt/vesoft/third-party/3.0/include/folly/Function.h:371
#12 0x0000000005641cd7 in virtual thunk to apache::thrift::concurrency::FunctionRunner::run() ()
#13 0x000000000579e7a8 in apache::thrift::concurrency::ThreadManager::Impl::Worker::run() ()
#14 0x00000000057a08ae in apache::thrift::concurrency::PthreadThread::threadMain(void*) ()
#15 0x00007f972c1b7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f972c0de293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
354 nebula::cpp2::ErrorCode JobManager::reportTaskFinish(const cpp2::ReportTaskReq& req) {
355   auto spaceId = req.get_space_id();
356   auto jobId = req.get_job_id();
357   auto taskId = req.get_task_id();
358   // only an active job manager will accept task finish report
359   if (status_.load(std::memory_order_acquire) == JbmgrStatus::STOPPED ||
360       status_.load(std::memory_order_acquire) == JbmgrStatus::NOT_START) {
361     LOG(INFO) << folly::sformat(
362         "report to an in-active job manager, spaceId={}, job={}, task={}", spaceId, jobId, taskId);
363     return nebula::cpp2::ErrorCode::E_UNKNOWN;
364   }
365   // because the last task will update the job's status
366   // tasks should report once a time
367   std::lock_guard<std::mutex> lk(muReportFinish_[spaceId]);
368   auto tasksRet = getAllTasks(spaceId, jobId);
369   if (!nebula::ok(tasksRet)) {
370     return nebula::error(tasksRet);
371   }
372   auto tasks = nebula::value(tasksRet);
373   auto task = std::find_if(tasks.begin(), tasks.end(), [&](auto& it) {
374     return it.getJobId() == jobId && it.getTaskId() == taskId;
375   });
376   if (task == tasks.end()) {
377     LOG(INFO) << folly::sformat(
378         "Report an invalid or outdate task, will ignore this report, job={}, "
379         "task={}",
380         jobId,
381         taskId);
382     return nebula::cpp2::ErrorCode::SUCCEEDED;
383   }

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@panda-sheep panda-sheep added ready-for-testing PR: ready for the CI test ready for review labels Jun 23, 2022
@panda-sheep panda-sheep requested a review from kikimo June 23, 2022 09:30
@Sophie-Xie Sophie-Xie added the cherry-pick-v3.2 PR: need cherry-pick to this version label Jun 23, 2022
@panda-sheep panda-sheep changed the title fix mutex in map fix map concurrency issue Jun 24, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #4344 (0c46eda) into master (fcbab77) will increase coverage by 0.01%.
The diff coverage is 78.80%.

@@            Coverage Diff             @@
##           master    #4344      +/-   ##
==========================================
+ Coverage   84.88%   84.89%   +0.01%     
==========================================
  Files        1343     1343              
  Lines      133363   133555     +192     
==========================================
+ Hits       113199   113379     +180     
- Misses      20164    20176      +12     
Impacted Files Coverage Δ
src/common/context/ExpressionContext.h 100.00% <ø> (ø)
src/common/expression/PropertyExpression.h 100.00% <ø> (ø)
src/common/utils/DefaultValueContext.h 0.00% <0.00%> (ø)
src/graph/context/Iterator.h 69.94% <0.00%> (-0.82%) ⬇️
src/graph/context/QueryExpressionContext.h 100.00% <ø> (ø)
...rc/graph/executor/algo/ProduceAllPathsExecutor.cpp 97.67% <ø> (ø)
src/graph/executor/algo/ShortestPathBase.h 50.00% <ø> (ø)
src/graph/executor/query/InnerJoinExecutor.h 100.00% <ø> (ø)
src/graph/executor/query/JoinExecutor.h 100.00% <ø> (ø)
src/graph/executor/query/LeftJoinExecutor.h 100.00% <ø> (ø)
... and 106 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b01fdd3...0c46eda. Read the comment docs.

@liuyu85cn
Copy link
Contributor

[manual dog head]

@liwenhui-soul
Copy link
Contributor

[manual dog head]

hello, how are you.

@liuyu85cn
Copy link
Contributor

how about isolate graph/space in 5.0 [手动狗头]

@liuyu85cn
Copy link
Contributor

[manual dog head]

hello, how are you.

miss you so much

@Sophie-Xie Sophie-Xie merged commit 1644523 into vesoft-inc:master Jun 24, 2022
@panda-sheep panda-sheep deleted the fix_mutex_map branch June 24, 2022 05:54
@panda-sheep
Copy link
Contributor Author

panda-sheep commented Jun 24, 2022

how about isolate graph/space in 5.0 [手动狗头]

Great idea, are you interested? 👏🏻

Sophie-Xie added a commit that referenced this pull request Jun 27, 2022
* fix mutex in map

* add test

* move the order

Co-authored-by: Sophie <[email protected]>
Sophie-Xie added a commit that referenced this pull request Jun 27, 2022
* force cache the docker layer (#4331)

* check god role exist when meta init (#4330)

* check god role exist when meta init

* return error if kv fail

Co-authored-by: Doodle <[email protected]>

* Fix object pool mtsafe. (#4332)

* Fix object pool mtsafe.

* Fix lock.

* Fixed web service crash (#4334)

Co-authored-by: Sophie <[email protected]>

* Fix get edges transform rule. (#4328)

1. Input/Ouput variables.
2. Keep column names of Limit same with input plan node.

Co-authored-by: Sophie <[email protected]>

* fix rc docker (#4336)

* add lock (#4352)

* fix map concurrency issue (#4344)

* fix mutex in map

* add test

* move the order

Co-authored-by: Sophie <[email protected]>

* add stats under index conditions (#4353)

Co-authored-by: Harris.Chu <[email protected]>
Co-authored-by: jimingquan <[email protected]>
Co-authored-by: Doodle <[email protected]>
Co-authored-by: shylock <[email protected]>
Co-authored-by: dutor <[email protected]>
Co-authored-by: panda-sheep <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-v3.2 PR: need cherry-pick to this version ready for review ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants