-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfigreconciler: use fixed timestamp when reading descriptors #140400
Conversation
tagging @arulajmani who built this system and who may have opinions on this change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this patch is papering over the contention issue smells. The reconciler is running a read-only transaction, so it forms the read half of a read-write conflict. Typically, the write transaction should be able to commit by refreshing to a higher timestamp. If it isn't able to, it's because the read-refresh of its read set is failing because a (third) transaction has performed an overlapping write. So this patch seems like a brittle bandaid to me.
I call it a brittle bandaid because the patch is removing one source of schema change transactions getting their write timestamp bumped. However, there's other ways this can happen -- e.g. the closed timestamp interval, another read of these tables, etc. When such contention issues come up during customer escalations, our recommendation is to generally dig into why the read-refresh is failing. That's because this battle of "ensure write timestamps are never bumped" is fraught. To that end, we should:
- Figure out if the schema change transaction has a needlessly large read set.
- Figure out which third transaction is causing the schema change transaction's read-refresh to fail.
- Figure out whether the schema change transaction can acquire locks to prevent its read-refreshes from failing.
Reviewed 1 of 2 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @msbutler, @rafiss, and @yuzefovich)
-- commits
line 7 at r1:
The spanconfig reconciler is not writing to any of these tables, is it? What locks is it acquiring?
-- commits
line 17 at r1:
Calling this "more correct" implies that the previous behaviour was wrong -- that seems like a mischaracterization. The only guarantee we need here that reconciliation should happen at a consistent read snapshot, which was the case before.
pkg/spanconfig/spanconfigreconciler/reconciler.go
line 530 at r1 (raw file):
// system.span_count. The spanconfig reconciler needs to read these // objects also. A fixed timestamp will avoid contention, and is also // more correct since this job should read descriptors using the same
We should drop this "more correct" verbiage -- see my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)
Previously, arulajmani (Arul Ajmani) wrote…
The spanconfig reconciler is not writing to any of these tables, is it? What locks is it acquiring?
We can take it into this thread: https://cockroachlabs.slack.com/archives/C04N0AS14CT/p1738595046864469
Or let's sync up in a huddle so I can show you in person.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)
Previously, arulajmani (Arul Ajmani) wrote…
Calling this "more correct" implies that the previous behaviour was wrong -- that seems like a mischaracterization. The only guarantee we need here that reconciliation should happen at a consistent read snapshot, which was the case before.
Why is not incorrect to read the descriptors at a larger timestamp than the checkpoint we receive from the rangefeed. Does using a larger timestamp not mean that the descriptors could look different from how they did at the time they were checkpointed in the rangefeed?
I'll post my investigation here for greater visibility. To try to debug why the transaction retry errors were occurring, I incrementally added logs throughout spanconfig/KV code. I ended up with this diff:
I also removed the override of autocommit_before_ddl in logic tests:
I can observe a failure reliably by running this test. I used these instructions to run the test on EngFlow.
Here's what I found. The test fails with:
I looked in the CRDB logs that I added for
This shows us that the logic test statement (running in transaction cockroach/pkg/spanconfig/spanconfiglimiter/limiter.go Lines 66 to 77 in c68c559
The key referenced in the error is cockroach/pkg/kv/kvserver/replica_tscache.go Lines 654 to 659 in 780cc19
The other transaction in this case is
And most importantly, this shows the creator of the transaction is the AUTO SPAN CONFIG RECONCILIATION job.
cockroach/pkg/spanconfig/spanconfigreconciler/reconciler.go Lines 520 to 522 in 264f443
It doesn't seem like there's any evidence of a third transaction above.
How can we make it do that? Should we use |
Recently when we changed the default value of autocommit_before_ddl to true, we found that the chance of hitting a retry error while running schema changes dramatically increased. The reason was because the backgound span reconciler would need locks for the same keys that were being modified by the schema change job itself -- most notably, the descriptor table and descriptor ID sequence. This patch addresses the issue by making the spanconfig reconciler use the checkpoint timestamp from the rangefeed as the fixed timestamp for the transaction that reads the descriptors whose spans are being reconciled. This defensive measure helps us avoid any possibility of contention caused by this background job's transaction. This allows us to re-enable the autocommit setting for logictests that run in multitenancy. Release note: None
264f443
to
5c5ffc5
Compare
I met with Arul and we concluded that the most likely explanation for the behavior exhibited by these logs is overload. I tried bumping up the CPU allocation to this test up to 3. Right now it's 2. cockroach/pkg/ccl/logictestccl/tests/3node-tenant/BUILD.bazel Lines 13 to 15 in f3459f3
Unfortunately the same flakes still occurred. I'd like to go forward with this change. I removed the language around it being more correct, and clarified that this change is purely a defensive programming measure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this and posting the detailed analysis above. It was a breeze to engage with!
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @msbutler, @rafiss, and @yuzefovich)
TFTRs! bors r+ |
Recently when we changed the default value of autocommit_before_ddl to
true, we found that the chance of hitting a retry error while running
schema changes dramatically increased. The reason was because the
backgound span reconciler would need locks for the same keys that were
being modified by the schema change job itself -- most notably, the
descriptor table and descriptor ID sequence.
This patch addresses the issue by making the spanconfig reconciler use
the checkpoint timestamp from the rangefeed as the fixed timestamp for
the transaction that reads the descriptors whose spans are being
reconciled. This defensive measure helps us avoid any possibility of
contention caused by this background job's transaction. This allows
us to re-enable the autocommit setting for logictests that run in
multitenancy.
fixes #140172
Release note: None