Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#2978] Various TSAN fixes for QLTransactionTest.RemoteBootstrap
Summary: Started digging into some TSAN failures in QLTransactionTest.RemoteBootstrap. Found and fixed a number of issues: - [#3002] Data race on `yb::log::Log::active_segment_sequence_number()` Seems this field is protected by a read lock for reads, but was not protected on writes. Turned it into an atomic. - [#3007] Race condition between TabletPeer `Init` and `Shutdown` > std::__1::shared_ptr<yb::tablet::enterprise::Tablet>::get() > td::__1::shared_ptr<yb::consensus::RaftConsensus>::get() > src/yb/tablet/tablet_peer.cc:385:17 in yb::tablet::TabletPeer::StartShutdown() Seems like `Shutdown` did not take the appropriate locks to access either `tablet_` or `consensus_`. - [#3008] Race condition in thread pool `Worker` Shutdown path: > #12 yb::rpc::ThreadPool::Impl::Shutdown() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:224 (libyrpc.so+0x20c73f) > #3 yb::rpc::(anonymous namespace)::Worker::Notify() /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/rpc/thread_pool.cc:75 (libyrpc.so+0x20ad6e) Essentially, we're destroying the vector of workers, but it's possible we still end up trying to notify them afterwards. Moved some of the code around and expoxed an explicit `Join`. Logic should stay basically the same. Also moved to shared_ptr instead of raw pointers. - [#3009] Race condition in Master async RPC task vs CatalogManager reading the task description > #0 yb::master::PickLeaderReplica::PickReplica(yb::master::TSDescriptor**) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/async_rpc_tasks.cc:95:12 (libmaster.so+0x2450b8) > #2 yb::master::CatalogManager::SendAddServerRequest(scoped_refptr<yb::master::TabletInfo> const&, yb::consensus::RaftPeerPB_MemberType, yb::consensus::ConsensusStatePB const&, std::__1::basic_string<char, std::__1::char_traits <char>, std::__1::allocator<char> > const&) /n/users/bogdan/code/yugabyte-db/build/tsan-clang-dynamic-ninja/../../src/yb/master/catalog_manager.cc:5046:54 Just removed the log line.. There are a couple more issues I am still seeing: - [#3010] Another `Long wait for safe op id`, but one seems like a bootstrap bug Currently, doing a remote bootstrap triggers an inline OpenTablet in TSTabletManager, unlike the normal ones, which are scheduled through a thread pool. That causes issues, because on shutdown, we wait for the threadpool tasks to finish / get aborted. However, when done inline, this exposes race conditions between Init and Shutdown paths for TabletPeer, RaftConsensus, Log, etc. - [#3011] SEGV during `DisableFailureDetector`, during raft shutdown Caused by the same race between Start (which creates the timer) and shutdown, which aborts it. - [#3012] Log Close failures not flipping the state to closed > F20191106 04:56:48 ../../src/yb/consensus/log_util.cc:874] Check failed: !IsFooterWritten() Caused by the same race above. If TSTabletManager starts a remote bootstrap, we open a log. If we shutdown the tablet manager, before finishing a bootstrap, we wipe the data, but then when we close the log, we error out as files do not exist anymore. - [#3013] Race condition in Master async RPC tasks state transitions > F20191106 05:10:05 ../../src/yb/master/async_rpc_tasks.cc:126] Check failed: task_state == MonitoredTaskState::kWaiting State: kScheduling Seems like there was a race between scheduling the task to run on the reactor thread and only AFTER flipping the state from kScheduling. This can be a standalone investigation. Test Plan: `ybd tsan --cxx-test client_ql-transaction-test --gtest_filter QLTransactionTest.RemoteBootstrap -n 100 --tp 4` Reviewers: mikhail, sergei Reviewed By: sergei Subscribers: hector, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7529
- Loading branch information