Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Segmentation Fault in MvccManager::SafeTimeForFollower #21877

Closed
1 task done
shamanthchandra-yb opened this issue Apr 8, 2024 · 7 comments
Closed
1 task done

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Apr 8, 2024

Jira Link: DB-10780

Description

Tried on 2.23.0.0-b132. This is a CDC case, and looks like a regression.

Profile (6)

Here is the stack trace:

* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
    frame #7: 0x000055d6d0d7df40 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] decltype(__f=<unavailable>, __a0=<unavailable>, __args=<unavailable>)::TransactionRpcBase*&>().*std::declval<void (yb::client::(anonymous namespace)::TransactionRpcBase::*&)(yb::Status const&)>()(std::declval<yb::Status::OK&>())) std::__1::__invoke[abi:v170002]<void (yb::client::(anonymous namespace)::TransactionRpcBase::*&)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*&, yb::Status::OK&, void>(void (yb::client::(anonymous namespace)::TransactionRpcBase::*&)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*&, yb::Status::OK&) at invoke.h:308:25
    frame #8: 0x000055d6d0d7df21 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] std::__1::__bind_return<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::tuple<>, __is_valid_bind_return<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::tuple<>>::value>::type std::__1::__apply_functor[abi:v170002]<void (__f=<unavailable>, __bound_args=<unavailable>, (null)=<unavailable>, __args=<unavailable>)::TransactionRpcBase::*)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, 0ul, 1ul, std::__1::tuple<>>(void (yb::client::(anonymous namespace)::TransactionRpcBase::*&)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&, std::__1::__tuple_indices<0ul, 1ul>, std::__1::tuple<>&&) at bind.h:260:12
    frame #9: 0x000055d6d0d7df21 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] std::__1::__bind_return<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::tuple<>, __is_valid_bind_return<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), std::__1::tuple<yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::tuple<>>::value>::type std::__1::__bind<void (this=<unavailable>)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>::operator()[abi:v170002]<>() at bind.h:292:20
    frame #10: 0x000055d6d0d7df19 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] decltype(__f=<unavailable>)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&>()()) std::__1::__invoke[abi:v170002]<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&>(std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&) at invoke.h:340:25
    frame #11: 0x000055d6d0d7df19 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:v170002]<std::__1::__bind<void (__args=<unavailable>)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&>(std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>&) at invoke.h:415:5
    frame #12: 0x000055d6d0d7df19 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator()() [inlined] std::__1::__function::__alloc_func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator(this=<unavailable>)[abi:v170002]() at function.h:192:16
    frame #13: 0x000055d6d0d7df19 yb-tserver`std::__1::__function::__func<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>, std::__1::allocator<std::__1::__bind<void (yb::client::(anonymous namespace)::TransactionRpcBase::*)(yb::Status const&), yb::client::(anonymous namespace)::TransactionRpcBase*, yb::Status::OK>>, void ()>::operator(this=<unavailable>)() at function.h:363:12
    frame #14: 0x000055d6d1ca6eba yb-tserver`yb::rpc::OutboundCall::InvokeCallbackSync(std::__1::optional<std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>>) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000037e29b19bcb0)[abi:v170002]() const at function.h:517:16
    frame #15: 0x000055d6d1ca6eb4 yb-tserver`yb::rpc::OutboundCall::InvokeCallbackSync(std::__1::optional<std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>>) [inlined] std::__1::function<void ()>::operator(this=0x000037e29b19bcb0)() const at function.h:1168:12
    frame #16: 0x000055d6d1ca6eb4 yb-tserver`yb::rpc::OutboundCall::InvokeCallbackSync(this=0x000037e29b19bc20, now_optional=<unavailable>) at outbound_call.cc:468:3
    frame #17: 0x000055d6d1d841a3 yb-tserver`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x000037e27f07e310) at thread_pool.cc:115:15
    frame #18: 0x000055d6d25a3f03 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x000037e26c108060)[abi:v170002]() const at function.h:517:16
    frame #19: 0x000055d6d25a3eed yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x000037e26c108060)() const at function.h:1168:12
    frame #20: 0x000055d6d25a3eed yb-tserver`yb::Thread::SuperviseThread(arg=0x000037e26c108000) at thread.cc:866:3
    frame #21: 0x00007f4d2b6f11ca libpthread.so.0`start_thread + 234
    frame #22: 0x00007f4d2b942e73 libc.so.6`__clone + 67

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@rthallamko3
Copy link
Contributor

Per @shamanthchandra-yb , The suspect commit is fb7c86c. The crash is happening after we stop/start a node on builds starting b121, but the issue doesn't repro on builds b120 and below on master.

@shishir2001-yb
Copy link
Member

Encountered the same failure in a cross db DDLs test. GHI- #21908

1. Create a universe with DDL atomicity and Per DB Catalog mode
2. Run cross DB DDL sample app

@Arjun-yb
Copy link
Contributor

Noticed similar issue in one of the stress tests(test_multi_volume_data_dir)

@yusong-yan
Copy link
Contributor

yusong-yan commented Apr 12, 2024

After add more logging and rerun the test, here is probably the root cause:
The crash occurs because the ProcessRemoveQueueUnlocked function within the TransactionParticipant tries to call participant_context_.SafeTimeForTransactionParticipant().

HybridTime TabletPeer::SafeTimeForTransactionParticipant() {
  return tablet_->mvcc_manager()->SafeTimeForFollower(
      /* min_allowed= */ HybridTime::kMin, /* deadline= */ CoarseTimePoint::min());
}

However, from looking at log file, the tablet is in the local bootstrap WAL reply process, so the participant_context_ (which is a reference to a TabletPeer object) hasn't been fully initialized yet, and the TabletPeer's field tablet_ is still nullptr. Then, when SafeTimeForTransactionParticipant attempts to access tablet_->mvcc_manager(), it leads to a null pointer access and a segmentation fault.

My hypothesis on how fb7c86c introduced the segmentation fault:
This commit introduced calls to TransactionParticipant::MinRunningHybridTime() in various functions like Tablet::ApplyIntents, Tablet::RemoveIntentsImpl, and ReadConflicts, one of them is likely to be executed during the local bootstrap process.

  • MinRunningHybridTime sends a status request with a callback function RunningTransaction::StatusReceived.
  • This callback (RunningTransaction::StatusReceived) is the starting point of the call stack that leads to the segmentation fault.

@yusong-yan
Copy link
Contributor

The log message shows that TabletPeer SafeTimeForTransactionParticipant() attempted to access nullptr when this tablet is in the process of WAL replay

F0414 18:34:04.846114 22336 tablet_peer.cc:737] T b139e1312a774c7e9ea515b5c442e591 P 27213c803889415a8557c0c0dc7dfada [state=BOOTSTRAPPING]: table_ is null     @     0x7f8ef0134b93  yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked()
    @     0x7f8ef0132e12  yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked()
    @     0x7f8ef0073c5b  yb::tablet::RunningTransaction::DoStatusReceived()
    @     0x7f8eee209372  yb::client::(anonymous namespace)::TransactionRpcBase::Finished()
    @     0x7f8eee209b50  std::__1::__function::__func<>::operator()()
    @     0x7f8eecd06cad  yb::rpc::OutboundCall::InvokeCallbackSync()
    @     0x7f8eecd65f2c  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x7f8eec600a6a  yb::Thread::SuperviseThread()
    @     0x7f8eeae901ca  start_thread
    @     0x7f8eeaafce73  __GI___clone

@agsh-yb
Copy link
Contributor

agsh-yb commented Apr 17, 2024

Noticed similar issue in one of the stress tests (test_ysql_bank_operations_with_materialized)

yusong-yan added a commit that referenced this issue Apr 18, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
yusong-yan added a commit that referenced this issue May 1, 2024
…ialization

Summary:
**Issue:**

Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`.
Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap.
A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And  early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details)

**Fix:**

To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees:
* Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully.
* Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully.

This ensure `min_running_ht` is initialized at a safer point in the startup process.
Jira: DB-11029

Test Plan:
Unit test WIP
To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed.

Reviewers: esheng, sergei

Reviewed By: sergei

Subscribers: slingam, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34389
yusong-yan added a commit that referenced this issue May 3, 2024
…n_running_ht Initialization

Summary:
Original commit: 138b81a / D34389
**Issue:**

Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`.
Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap.
A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And  early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details)

**Fix:**

To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees:
* Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully.
* Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully.

This ensure `min_running_ht` is initialized at a safer point in the startup process.
Jira: DB-11029

Test Plan:
QLTransactionTest.TransactionsEarlyLoadedTest

To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed.

Reviewers: esheng, sergei, rthallam

Reviewed By: esheng, rthallam

Subscribers: ybase, rthallam, slingam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34674
es1024 added a commit that referenced this issue May 6, 2024
…er intent SST files only retained for CDC""

Summary:
This reverts commit D34245 / 89316bd, which reverted
D33131 / fb7c86c due to a segmentation fault introduced due to
`min_running_ht` being initialized too early; this issue is now fixed with
D34389 / 138b81a.
Jira: DB-10466, DB-10780

Test Plan: Jenkins

Reviewers: yyan, sergei

Reviewed By: yyan

Subscribers: rthallam, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34745
svarnau pushed a commit that referenced this issue May 25, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
svarnau pushed a commit that referenced this issue May 25, 2024
…ialization

Summary:
**Issue:**

Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`.
Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap.
A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And  early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details)

**Fix:**

To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees:
* Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully.
* Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully.

This ensure `min_running_ht` is initialized at a safer point in the startup process.
Jira: DB-11029

Test Plan:
Unit test WIP
To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed.

Reviewers: esheng, sergei

Reviewed By: sergei

Subscribers: slingam, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34389
svarnau pushed a commit that referenced this issue May 25, 2024
…er intent SST files only retained for CDC""

Summary:
This reverts commit D34245 / 89316bd, which reverted
D33131 / fb7c86c due to a segmentation fault introduced due to
`min_running_ht` being initialized too early; this issue is now fixed with
D34389 / 138b81a.
Jira: DB-10466, DB-10780

Test Plan: Jenkins

Reviewers: yyan, sergei

Reviewed By: yyan

Subscribers: rthallam, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34745
ZhenYongFan pushed a commit to ZhenYongFan/yugabyte-db that referenced this issue Jun 15, 2024
…ant's min_running_ht Initialization

Summary:
Original commit: 138b81a / D34389
**Issue:**

Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`.
Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap.
A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And  early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ yugabyte#21877 | yugabyte#21877 ]] for more details)

**Fix:**

To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees:
* Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully.
* Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully.

This ensure `min_running_ht` is initialized at a safer point in the startup process.
Jira: DB-11029

Test Plan:
QLTransactionTest.TransactionsEarlyLoadedTest

To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ yugabyte#21877 | yugabyte#21877 ]]. The tests completed successfully with no segmentation faults observed.

Reviewers: esheng, sergei, rthallam

Reviewed By: esheng, rthallam

Subscribers: ybase, rthallam, slingam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34674
@rthallamko3 rthallamko3 reopened this Oct 8, 2024
@rthallamko3
Copy link
Contributor

Reopening for 2.20 backport, as it resolves the situation in #24285

yusong-yan added a commit that referenced this issue Oct 24, 2024
…running_ht Initialization

Summary:
**Issue:**

Original commit: e414e3f / D34389
Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`.
Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap.
A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And  early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details)

**Fix:**

To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees:
* Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully.
* Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully.

This ensure `min_running_ht` is initialized at a safer point in the startup process.
Jira: DB-11029

Test Plan: To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed.

Reviewers: esheng, sergei, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D38931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants