ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

mjonss · 2024-12-13T00:20:49Z

What problem does this PR solve?

Issue Number: close #58226, close #58692

Problem Summary:
A concurrency test showed that when REORGANIZE PARTITION are copying non-clustered table rows in batches, if an update happens during such batch for the same rows included in the batch, then the batch will overwrite the updates with what the batch originally read.

What changed and how does it work?

Reverted the use of table.AddRecord() for non-clustered tables and added the old row into the batch transaction, so the transaction would fail if the old row have been touched (if it already has been copied/double written and has not been touched, it will be skipped from being copied).

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

For non-clustered tables during REORGANIZE PARTITION data copying/backfill, if a row is updated at the same time the reorg is copying that row in a batch, it could overwrite with the state before the update.

…ansactions

tiprow · 2024-12-13T00:21:04Z

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no suggestions.

codecov · 2024-12-13T00:41:58Z

Codecov Report

Attention: Patch coverage is 88.52459% with 7 lines in your changes missing coverage. Please review.

Project coverage is 77.5686%. Comparing base (0be1983) to head (fc27c67).
Report is 149 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #58229        +/-   ##
================================================
+ Coverage   73.1936%   77.5686%   +4.3750%     
================================================
  Files          1681       1730        +49     
  Lines        463050     503461     +40411     
================================================
+ Hits         338923     390528     +51605     
+ Misses       103344      90685     -12659     
- Partials      20783      22248      +1465

Flag	Coverage Δ
integration	`51.9920% <85.2459%> (?)`
unit	`74.8839% <88.5245%> (+2.5641%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`53.0100% <ø> (+0.3190%)`	⬆️
parser	`∅ <ø> (∅)`
br	`64.5621% <ø> (+18.5421%)`	⬆️

mjonss · 2024-12-13T01:19:20Z

/retest

tiprow · 2024-12-13T01:19:41Z

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Defined2014

Is it true that the cluster index will not be duplicated during reorg? This is because the primary key check has ensured this.

Defined2014 · 2024-12-17T01:59:10Z

pkg/ddl/partition.go

+					return errors.Trace(err)
+				}
+				// Also don't actually write it :)
+				err = txn.Delete(w.oldKeys[i])


Is it duplicate with L3878? Maybe keep assertion is enough.

I'm not sure how to only force the transaction to fail if a key already exists, regardless of settings like tidb_txn_assertion_level, so by involving the key in the transaction, it will fail if another concurrent transaction has modified it (like insert it due to UPDATE). Just having the SetAssertion() is not enough, we could have the txn.Delete() or txn.Set() only, not sure if Set+Assert+Delete is better though, what is your thought?

I don't quite understand this code. But why does SetAssertNotExist succeed after txn.Set it first?

I would expect that only lock calls are directly checked/forwarded to the KV store, while SetAssertion/Set/Delete is only applied during Commit.

If the goal is simply to prevent the row_key from being modified, using txn.LockKeys to apply a pessimistic lock on the corresponding row is the appropriate approach.

The purpose of Assertion is fundamentally different:

It is designed to validate invariants or constraints to ensure correctness is not violated.

It is not intended for concurrency control or preventing concurrent modifications.

Using LockKeys in a pessimistic transaction explicitly handles concurrency by preventing conflicting writes, while assertions act as safeguards to check assumptions about the state after operations are performed.

For the use of internal transactions during the execution of DDL tasks and their concurrency control with DML, it is recommended to consult DDL-related colleagues to confirm whether the logic complies with DDL constraint requirements.

There should be similar code references for regular DDL backfill.
/cc @wjhuang2016 @tangenta

The difference with regular DDL it uses the version/lock on a key to control another unrelated key. Looks too strange here.

After changing to txn.LockKeys() it is now same as what (*addIndexTxnWorker) BackfillData() does here.

Defined2014 · 2024-12-17T02:02:44Z

pkg/ddl/partition.go

 				}
+
+				// tablecodec.prefixLen is not exported, but is just TableSplitKeyLen + 2


Not get the comments point. Why +2 here?

TableSplitKeyLen is t_<encoded tableID> only, and we want to include the r_ as well, so that is where the +2 comes from.

Use len(recordPrefixSep) instead of 2? Or prefixLen in tablecodec.go

Hmm, all names that start with lower case is not exported :(
I can change those names, but then the PR grows a bit with unrelated changes...
Is it OK if I create a follow-up issue+PR for exporting PrefixLen/RecordPrefixSepLength later?

pkg/ddl/partition.go

mjonss · 2024-12-17T11:16:11Z

/retest

tiprow · 2024-12-17T11:16:34Z

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bb7133

LGTM

ti-chi-bot · 2024-12-18T02:37:10Z

[LGTM Timeline notifier]

Timeline:

2024-12-17 12:18:24.801197026 +0000 UTC m=+959294.889999566: ☑️ agreed by Defined2014.
2024-12-18 02:37:08.93876772 +0000 UTC m=+1010819.027570262: ☑️ agreed by bb7133.

bb7133 · 2024-12-18T03:44:05Z

/retest

tiprow · 2024-12-18T03:50:25Z

@mjonss: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
fast_test_tiprow	`1f87051`	link	true	`/test fast_test_tiprow`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…part-backfill-dml-58226

mjonss · 2024-12-18T17:24:27Z

/retest

tiprow · 2024-12-18T17:24:50Z

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2025-01-09T08:12:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bb7133, Defined2014, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/ddl/OWNERS~~ [tangenta]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Defined2014 · 2025-01-09T09:30:45Z

/retest

tiprow · 2025-01-09T09:31:08Z

@Defined2014: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2025-01-09T10:13:34Z

In response to a cherrypick label: new pull request created to branch release-8.5: #58834.

#58229) (#58834) close #58226, close #58692

Change from table.AddRecord, to txn.Set, with check for concurrent tr…

afeae7f

…ansactions

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 13, 2024

mjonss requested a review from Copilot December 13, 2024 00:20

ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 13, 2024

Copilot AI reviewed Dec 13, 2024

View reviewed changes

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Dec 16, 2024

mjonss requested review from Defined2014 and tangenta December 16, 2024 16:25

Defined2014 reviewed Dec 17, 2024

View reviewed changes

tangenta reviewed Dec 17, 2024

View reviewed changes

pkg/ddl/partition.go Outdated Show resolved Hide resolved

Moved the AssertNotExists before txn.Set

73d75b2

Defined2014 reviewed Dec 17, 2024

View reviewed changes

pkg/ddl/partition.go Show resolved Hide resolved

ti-chi-bot bot requested review from tangenta and wjhuang2016 December 17, 2024 11:48

Using LockKeys instead.

1f87051

Defined2014 approved these changes Dec 17, 2024

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 17, 2024

bb7133 approved these changes Dec 18, 2024

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 18, 2024

mjonss added 2 commits December 18, 2024 09:31

Added a comment.

75fc1b4

Merge remote-tracking branch 'pingcap/master' into non-cluster-reorg-…

fc27c67

…part-backfill-dml-58226

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jan 8, 2025

tangenta approved these changes Jan 9, 2025

View reviewed changes

ti-chi-bot bot added approved do-not-merge/needs-triage-completed and removed do-not-merge/needs-triage-completed labels Jan 9, 2025

ti-chi-bot bot merged commit b22555b into pingcap:master Jan 9, 2025
25 checks passed

ti-chi-bot mentioned this pull request Jan 9, 2025

ddl: Fix issue with concurrent update getting reverted by BackfillData (#58229) #58834

Merged

13 tasks

mjonss deleted the non-cluster-reorg-part-backfill-dml-58226 branch January 9, 2025 16:23

ti-chi-bot bot pushed a commit that referenced this pull request Jan 10, 2025

ddl: Fix issue with concurrent update getting reverted by BackfillData (

a6baaad

#58229) (#58834) close #58226, close #58692

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

mjonss commented Dec 13, 2024 •

edited by Defined2014

Loading

tiprow bot commented Dec 13, 2024

codecov bot commented Dec 13, 2024 •

edited

Loading

mjonss commented Dec 13, 2024

tiprow bot commented Dec 13, 2024

Defined2014 left a comment

Defined2014 Dec 17, 2024

mjonss Dec 17, 2024

Defined2014 Dec 17, 2024

mjonss Dec 17, 2024

cfzjywxk Dec 17, 2024

cfzjywxk Dec 17, 2024

Defined2014 Dec 17, 2024

mjonss Dec 18, 2024

Defined2014 Dec 17, 2024

mjonss Dec 17, 2024

Defined2014 Dec 17, 2024 •

edited

Loading

mjonss Dec 17, 2024 •

edited

Loading

mjonss commented Dec 17, 2024

tiprow bot commented Dec 17, 2024

bb7133 left a comment

ti-chi-bot bot commented Dec 18, 2024

bb7133 commented Dec 18, 2024

tiprow bot commented Dec 18, 2024

mjonss commented Dec 18, 2024

tiprow bot commented Dec 18, 2024

ti-chi-bot bot commented Jan 9, 2025

Defined2014 commented Jan 9, 2025

tiprow bot commented Jan 9, 2025

ti-chi-bot commented Jan 9, 2025

		}

		// tablecodec.prefixLen is not exported, but is just TableSplitKeyLen + 2

ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

Conversation

mjonss commented Dec 13, 2024 • edited by Defined2014 Loading

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

tiprow bot commented Dec 13, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 13, 2024 • edited Loading

Codecov Report

mjonss commented Dec 13, 2024

tiprow bot commented Dec 13, 2024

Defined2014 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Defined2014 Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

mjonss Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

mjonss commented Dec 17, 2024

tiprow bot commented Dec 17, 2024

bb7133 left a comment

Choose a reason for hiding this comment

ti-chi-bot bot commented Dec 18, 2024

[LGTM Timeline notifier]

bb7133 commented Dec 18, 2024

tiprow bot commented Dec 18, 2024

mjonss commented Dec 18, 2024

tiprow bot commented Dec 18, 2024

ti-chi-bot bot commented Jan 9, 2025

Defined2014 commented Jan 9, 2025

tiprow bot commented Jan 9, 2025

ti-chi-bot commented Jan 9, 2025

mjonss commented Dec 13, 2024 •

edited by Defined2014

Loading

codecov bot commented Dec 13, 2024 •

edited

Loading

Defined2014 Dec 17, 2024 •

edited

Loading

mjonss Dec 17, 2024 •

edited

Loading