Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState flaky test #24317

Closed
1 task done
es1024 opened this issue Oct 8, 2024 · 0 comments
Closed
1 task done

Comments

@es1024
Copy link
Contributor

es1024 commented Oct 8, 2024

Jira Link: DB-13207

Description

GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState rarely fails with the following stack:

F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@es1024 es1024 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Oct 8, 2024
@es1024 es1024 self-assigned this Oct 8, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Oct 8, 2024
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Oct 8, 2024
es1024 added a commit that referenced this issue Oct 9, 2024
…anup

Summary:
There exists a race condition between commit/abort path and old transaction cleanup for
promoted transactions, where commit/abort path observes that old transaction cleanup is still
ongoing and sets `cleanup_waiter_`, but old transaction cleanup finishes before `cleanup_waiter_`
is set, resulting in the waiter never getting called.

This is the cause of occasional failures of
GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState and
GeoTransactionsPromotionRF1Test.TestTwoTabletPromotionFailure with the following stack:

```
F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start
```
Jira: DB-13207

Test Plan:
Jenkins. Also ran GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState 500x
and ensured above stack did not appear.

Reviewers: sergei

Reviewed By: sergei

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38789
es1024 added a commit that referenced this issue Oct 11, 2024
…d transaction cleanup

Summary:
Original commit: f51e54d / D38789
There exists a race condition between commit/abort path and old transaction cleanup for
promoted transactions, where commit/abort path observes that old transaction cleanup is still
ongoing and sets `cleanup_waiter_`, but old transaction cleanup finishes before `cleanup_waiter_`
is set, resulting in the waiter never getting called.

This is the cause of occasional failures of
GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState and
GeoTransactionsPromotionRF1Test.TestTwoTabletPromotionFailure with the following stack:

```
F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start
```
Jira: DB-13207

Test Plan:
Jenkins. Also ran GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState 500x
and ensured above stack did not appear.

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D38892
es1024 added a commit that referenced this issue Oct 11, 2024
…d transaction cleanup

Summary:
Original commit: f51e54d / D38789
There exists a race condition between commit/abort path and old transaction cleanup for
promoted transactions, where commit/abort path observes that old transaction cleanup is still
ongoing and sets `cleanup_waiter_`, but old transaction cleanup finishes before `cleanup_waiter_`
is set, resulting in the waiter never getting called.

This is the cause of occasional failures of
GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState and
GeoTransactionsPromotionRF1Test.TestTwoTabletPromotionFailure with the following stack:

```
F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start
```
Jira: DB-13207

Test Plan:
Jenkins. Also ran GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState 500x
and ensured above stack did not appear.

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38891
es1024 added a commit that referenced this issue Oct 12, 2024
…transaction cleanup

Summary:
Original commit: f51e54d / D38789
There exists a race condition between commit/abort path and old transaction cleanup for
promoted transactions, where commit/abort path observes that old transaction cleanup is still
ongoing and sets `cleanup_waiter_`, but old transaction cleanup finishes before `cleanup_waiter_`
is set, resulting in the waiter never getting called.

This is the cause of occasional failures of
GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState and
GeoTransactionsPromotionRF1Test.TestTwoTabletPromotionFailure with the following stack:

```
F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start
```
Jira: DB-13207

Test Plan:
Jenkins. Also ran GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState 500x
and ensured above stack did not appear.

Reviewers: sergei, rthallam

Reviewed By: rthallam

Subscribers: ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D38890
@es1024 es1024 reopened this Nov 8, 2024
es1024 added a commit that referenced this issue Nov 8, 2024
…old transaction cleanup

Summary:
There exists a race condition between commit/abort path and old transaction cleanup for
promoted transactions, where commit/abort path observes that old transaction cleanup is still
ongoing and sets `cleanup_waiter_`, but old transaction cleanup finishes before `cleanup_waiter_`
is set, resulting in the waiter never getting called.

This is the cause of occasional failures of
GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState and
GeoTransactionsPromotionRF1Test.TestTwoTabletPromotionFailure with the following stack:

```
F20241007 19:23:50 ../../src/yb/rpc/rpc.cc:339] Check failed: calls_.empty() Calls: [0x000035bd35b49d60 -> AbortTransaction: tablet_id: "36df9b80658448848075dd10894f489c" transaction_id: "`\2051\'S&K\333\267[<\276\373P\302\207" propagated_hybrid_time: 7079338864676978688, retrier: { task_id: -1 state: kFinished deadline: 314126.220s }]
*** Check failure stack trace: ***
    @     0x7fa6097965c0  google::LogMessage::SendToLog()
    @     0x7fa609796c00  google::LogMessage::Flush()
    @     0x7fa609799979  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa60990b545  yb::rpc::Rpcs::Shutdown()
    @     0x7fa60b40216d  yb::client::TransactionManager::Impl::~Impl()
    @     0x7fa60b3fdfad  yb::client::TransactionManager::~TransactionManager()
    @     0x7fa60cfe6f6e  yb::tserver::DbServerBase::~DbServerBase()
    @     0x7fa60d0b702e  yb::tserver::TabletServer::~TabletServer()
    @     0x7fa60e3e3393  yb::tserver::MiniTabletServer::Shutdown()
    @     0x7fa60e561a05  yb::MiniCluster::Shutdown()
    @     0x7fa60e595aca  yb::YBMiniClusterTestBase<>::DoTearDown()
    @     0x7fa60de00c1d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60dde78c7  testing::TestInfo::Run()
    @     0x7fa60dde8575  testing::TestSuite::Run()
    @     0x7fa60ddf7e4e  testing::internal::UnitTestImpl::RunAllTests()
    @     0x7fa60de0190d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x7fa60ddf795f  testing::UnitTest::Run()
    @     0x7fa60de81177  main
    @     0x7fa607a29d90  (unknown)
    @     0x7fa607a29e40  __libc_start_main
    @     0x55f94fc7e325  _start
```
Jira: DB-13207

Original commit: f51e54d / D38789

Test Plan:
Jenkins. Also ran GeoTransactionsPromotionTest.TestPromotionReturningToAbortedState 500x
and ensured above stack did not appear.

Reviewers: sergei

Reviewed By: sergei

Subscribers: ybase, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D39830
@es1024 es1024 closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants