-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Segmentation Fault in MvccManager::SafeTimeForFollower #21877
Comments
Per @shamanthchandra-yb , The suspect commit is fb7c86c. The crash is happening after we stop/start a node on builds starting b121, but the issue doesn't repro on builds b120 and below on master. |
Encountered the same failure in a cross db DDLs test. GHI- #21908
|
Noticed similar issue in one of the stress tests(test_multi_volume_data_dir) |
After add more logging and rerun the test, here is probably the root cause:
However, from looking at log file, the tablet is in the local bootstrap WAL reply process, so the My hypothesis on how fb7c86c introduced the segmentation fault:
|
The log message shows that TabletPeer
|
Noticed similar issue in one of the stress tests (test_ysql_bank_operations_with_materialized) |
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
…ialization Summary: **Issue:** Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`. Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap. A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details) **Fix:** To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees: * Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully. * Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully. This ensure `min_running_ht` is initialized at a safer point in the startup process. Jira: DB-11029 Test Plan: Unit test WIP To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed. Reviewers: esheng, sergei Reviewed By: sergei Subscribers: slingam, rthallam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34389
…n_running_ht Initialization Summary: Original commit: 138b81a / D34389 **Issue:** Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`. Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap. A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details) **Fix:** To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees: * Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully. * Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully. This ensure `min_running_ht` is initialized at a safer point in the startup process. Jira: DB-11029 Test Plan: QLTransactionTest.TransactionsEarlyLoadedTest To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed. Reviewers: esheng, sergei, rthallam Reviewed By: esheng, rthallam Subscribers: ybase, rthallam, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34674
…er intent SST files only retained for CDC"" Summary: This reverts commit D34245 / 89316bd, which reverted D33131 / fb7c86c due to a segmentation fault introduced due to `min_running_ht` being initialized too early; this issue is now fixed with D34389 / 138b81a. Jira: DB-10466, DB-10780 Test Plan: Jenkins Reviewers: yyan, sergei Reviewed By: yyan Subscribers: rthallam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34745
…retained for CDC" Summary: D33131 introduced a segmentation fault which was identified in multiple tests. ``` * thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4 frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11 frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32 frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45 frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5 frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16 frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7 ``` This diff reverts the change to unblock the tests. The proper fix for this problem is WIP Jira: DB-10780, DB-10466 Test Plan: Jenkins: urgent Reviewers: rthallam Reviewed By: rthallam Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34245
…ialization Summary: **Issue:** Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`. Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap. A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details) **Fix:** To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees: * Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully. * Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully. This ensure `min_running_ht` is initialized at a safer point in the startup process. Jira: DB-11029 Test Plan: Unit test WIP To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed. Reviewers: esheng, sergei Reviewed By: sergei Subscribers: slingam, rthallam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34389
…er intent SST files only retained for CDC"" Summary: This reverts commit D34245 / 89316bd, which reverted D33131 / fb7c86c due to a segmentation fault introduced due to `min_running_ht` being initialized too early; this issue is now fixed with D34389 / 138b81a. Jira: DB-10466, DB-10780 Test Plan: Jenkins Reviewers: yyan, sergei Reviewed By: yyan Subscribers: rthallam, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D34745
…ant's min_running_ht Initialization Summary: Original commit: 138b81a / D34389 **Issue:** Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`. Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap. A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ yugabyte#21877 | yugabyte#21877 ]] for more details) **Fix:** To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees: * Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully. * Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully. This ensure `min_running_ht` is initialized at a safer point in the startup process. Jira: DB-11029 Test Plan: QLTransactionTest.TransactionsEarlyLoadedTest To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ yugabyte#21877 | yugabyte#21877 ]]. The tests completed successfully with no segmentation faults observed. Reviewers: esheng, sergei, rthallam Reviewed By: esheng, rthallam Subscribers: ybase, rthallam, slingam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34674
Reopening for 2.20 backport, as it resolves the situation in #24285 |
…running_ht Initialization Summary: **Issue:** Original commit: e414e3f / D34389 Initialized `min_running_ht` in `TransactionParticipant` indicates completed transaction loading by `TransactionLoader`. Transaction loading starts during tablet bootstrap. However, there's a chance Transaction loading finishes before tablet bootstrap completes, which means `min_running_ht` could be initialized during the bootstrap. This could lead to unexpected behavior when `TransactionParticipant::MinRunningHybridTime()` is called during bootstrap. A recent code change D33131 has exposed by introducing calls to `TransactionParticipant::MinRunningHybridTime()` in various functions that are likely to be executed during the bootstrap WAL reply process. And early `min_running_ht` initialization triggers unexpected code execution and caused segmentation fault.(See [[ #21877 | #21877 ]] for more details) **Fix:** To address this, `min_running_ht` initialization now happens within the `LoadFinished` function, and it guarantees: * Successful Transaction Loading: At this point, all transactions for the tablet have been loaded successfully. * Local Bootstrap Completion: Once start_latch_.Wait() completes, it means `TransactionParticipant::Start()` has been called. This ensures the local bootstrap process has finished successfully. This ensure `min_running_ht` is initialized at a safer point in the startup process. Jira: DB-11029 Test Plan: To validate the effectiveness of the fix, we re-ran the stress tests that originally exposed issue [[ #21877 | #21877 ]]. The tests completed successfully with no segmentation faults observed. Reviewers: esheng, sergei, rthallam Reviewed By: rthallam Subscribers: ybase, rthallam, slingam Differential Revision: https://phorge.dev.yugabyte.com/D38931
Jira Link: DB-10780
Description
Tried on
2.23.0.0-b132
. This is a CDC case, and looks like a regression.Here is the stack trace:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: