Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: Fix issue with concurrent update getting reverted by BackfillData #58229

Merged

Conversation

mjonss
Copy link
Contributor

@mjonss mjonss commented Dec 13, 2024

What problem does this PR solve?

Issue Number: close #58226, close #58692

Problem Summary:
A concurrency test showed that when REORGANIZE PARTITION are copying non-clustered table rows in batches, if an update happens during such batch for the same rows included in the batch, then the batch will overwrite the updates with what the batch originally read.

What changed and how does it work?

Reverted the use of table.AddRecord() for non-clustered tables and added the old row into the batch transaction, so the transaction would fail if the old row have been touched (if it already has been copied/double written and has not been touched, it will be skipped from being copied).

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

For non-clustered tables during REORGANIZE PARTITION data copying/backfill, if a row is updated at the same time the reorg is copying that row in a batch, it could overwrite with the state before the update.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 13, 2024
@mjonss mjonss requested a review from Copilot December 13, 2024 00:20
@ti-chi-bot ti-chi-bot bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 13, 2024
Copy link

tiprow bot commented Dec 13, 2024

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no suggestions.

Copy link

codecov bot commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 88.52459% with 7 lines in your changes missing coverage. Please review.

Project coverage is 77.5686%. Comparing base (0be1983) to head (fc27c67).
Report is 149 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #58229        +/-   ##
================================================
+ Coverage   73.1936%   77.5686%   +4.3750%     
================================================
  Files          1681       1730        +49     
  Lines        463050     503461     +40411     
================================================
+ Hits         338923     390528     +51605     
+ Misses       103344      90685     -12659     
- Partials      20783      22248      +1465     
Flag Coverage Δ
integration 51.9920% <85.2459%> (?)
unit 74.8839% <88.5245%> (+2.5641%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.0100% <ø> (+0.3190%) ⬆️
parser ∅ <ø> (∅)
br 64.5621% <ø> (+18.5421%) ⬆️

@mjonss
Copy link
Contributor Author

mjonss commented Dec 13, 2024

/retest

Copy link

tiprow bot commented Dec 13, 2024

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@Defined2014 Defined2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that the cluster index will not be duplicated during reorg? This is because the primary key check has ensured this.

return errors.Trace(err)
}
// Also don't actually write it :)
err = txn.Delete(w.oldKeys[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it duplicate with L3878? Maybe keep assertion is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to only force the transaction to fail if a key already exists, regardless of settings like tidb_txn_assertion_level, so by involving the key in the transaction, it will fail if another concurrent transaction has modified it (like insert it due to UPDATE). Just having the SetAssertion() is not enough, we could have the txn.Delete() or txn.Set() only, not sure if Set+Assert+Delete is better though, what is your thought?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this code. But why does SetAssertNotExist succeed after txn.Set it first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that only lock calls are directly checked/forwarded to the KV store, while SetAssertion/Set/Delete is only applied during Commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is simply to prevent the row_key from being modified, using txn.LockKeys to apply a pessimistic lock on the corresponding row is the appropriate approach.

The purpose of Assertion is fundamentally different:

  • It is designed to validate invariants or constraints to ensure correctness is not violated.
  • It is not intended for concurrency control or preventing concurrent modifications.

Using LockKeys in a pessimistic transaction explicitly handles concurrency by preventing conflicting writes, while assertions act as safeguards to check assumptions about the state after operations are performed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the use of internal transactions during the execution of DDL tasks and their concurrency control with DML, it is recommended to consult DDL-related colleagues to confirm whether the logic complies with DDL constraint requirements.

There should be similar code references for regular DDL backfill.
/cc @wjhuang2016 @tangenta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference with regular DDL it uses the version/lock on a key to control another unrelated key. Looks too strange here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing to txn.LockKeys() it is now same as what (*addIndexTxnWorker) BackfillData() does here.

}

// tablecodec.prefixLen is not exported, but is just TableSplitKeyLen + 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not get the comments point. Why +2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableSplitKeyLen is t_<encoded tableID> only, and we want to include the r_ as well, so that is where the +2 comes from.

Copy link
Contributor

@Defined2014 Defined2014 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use len(recordPrefixSep) instead of 2? Or prefixLen in tablecodec.go

Copy link
Contributor Author

@mjonss mjonss Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, all names that start with lower case is not exported :(
I can change those names, but then the PR grows a bit with unrelated changes...
Is it OK if I create a follow-up issue+PR for exporting PrefixLen/RecordPrefixSepLength later?

pkg/ddl/partition.go Outdated Show resolved Hide resolved
@mjonss
Copy link
Contributor Author

mjonss commented Dec 17, 2024

/retest

Copy link

tiprow bot commented Dec 17, 2024

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot requested review from tangenta and wjhuang2016 December 17, 2024 11:48
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 17, 2024
Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 18, 2024
Copy link

ti-chi-bot bot commented Dec 18, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-17 12:18:24.801197026 +0000 UTC m=+959294.889999566: ☑️ agreed by Defined2014.
  • 2024-12-18 02:37:08.93876772 +0000 UTC m=+1010819.027570262: ☑️ agreed by bb7133.

@bb7133
Copy link
Member

bb7133 commented Dec 18, 2024

/retest

Copy link

tiprow bot commented Dec 18, 2024

@mjonss: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 1f87051 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mjonss
Copy link
Contributor Author

mjonss commented Dec 18, 2024

/retest

Copy link

tiprow bot commented Dec 18, 2024

@mjonss: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jan 8, 2025
Copy link

ti-chi-bot bot commented Jan 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bb7133, Defined2014, tangenta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Defined2014
Copy link
Contributor

/retest

Copy link

tiprow bot commented Jan 9, 2025

@Defined2014: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot merged commit b22555b into pingcap:master Jan 9, 2025
25 checks passed
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #58834.

@mjonss mjonss deleted the non-cluster-reorg-part-backfill-dml-58226 branch January 9, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
6 participants