tikv: fix TxnSize to be the number of involved keys in the region #11725

sticnarf · 2019-08-13T05:26:23Z

What problem does this PR solve?

In prewrite, the TxnSize is now the number of keys in a single batch. Because there is a 16KB batch size limit, it is possible that we write a number of batches to the same region. Then, the TxnSize is actually not the number of keys involved in a single region.

What is changed and how it works?

I create a map in the twoPhaseCommitter to store the number of involved keys in each region. Then, we minimize the changes to the commit process. The map is used in buildPrewriteRequest to fill the TxnSize field in the PrewriteRequest.

Check List

Tests

Unit test

Code changes

Has exported function/method change
Has exported variable/fields change

Side effects

Possible performance regression
- I think it is quite minor, though

Related changes

Need to update the documentation
- I think we should clarify the meaning of TxnSize in kvproto

CLAassistant · 2019-08-13T05:26:30Z

All committers have signed the CLA.

codecov · 2019-08-13T08:08:43Z

Codecov Report

Merging #11725 into master will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##            master    #11725   +/-   ##
=========================================
  Coverage   81.541%   81.541%           
=========================================
  Files          434       434           
  Lines        93857     93857           
=========================================
  Hits         76532     76532           
  Misses       11855     11855           
  Partials      5470      5470

sticnarf · 2019-08-13T08:09:28Z

PTAL @disksing

disksing · 2019-08-13T08:36:34Z

store/tikv/2pc.go

@@ -323,6 +325,12 @@ func (c *twoPhaseCommitter) doActionOnKeys(bo *Backoffer, action twoPhaseCommitA
 	var batches []batchKeys
 	var sizeFunc = c.keySize
 	if action == actionPrewrite {
+		if c.regionTxnSize == nil {
+			c.regionTxnSize = make(map[RegionVerID]int)


I prefer to create it at initial phase.

OK. I will do it.

disksing · 2019-08-13T08:36:53Z

store/tikv/2pc.go

+			c.regionTxnSize = make(map[RegionVerID]int)
+		}
+		for region, keys := range groups {
+			c.regionTxnSize[region] = len(keys)


I noticed that the map is not recalculated when encounter region miss (may cause by region split or balance), during retry, all batch size will become 0.
A quick fix is to use region.ID as the map key instead, it's not 100% accurate but will be better.

On region errors, now txn size will become the batch size instead of 0 because the failed batch will go through the whole prewrite process. Then, it will be as bad as the original code when we have large value size :(

There will be some other problems when we use region.ID as the map key. When there are several batches in a new region retrying concurrently, it is a problem how we update the txn size of the region...

oh good point 🤣

sticnarf · 2019-08-13T12:20:21Z

Unexpected "resolve lock lite" is worse than doing more normal resolve lock. As we don't know what the transaction size when doing a retry, I propose we just set it to MaxUint64 to prevent resolve lock lite completely.
(Say we have a transaction who wrote hundreds of keys and died. The value size of each row is over 1KB. Then, the batch size is always under the threshold of doing a normal resolve lock. Now, a coprocessor request consistently encounters locks and only does resolve lock lite. It is disastrous because there can be tens of retries and the coprocessor progress is zeroed on each retry.)

@disksing PTAL again. Thanks!

disksing · 2019-08-13T12:42:36Z

LGTM.
But both the lite resolve lock and preventing lite resolve lock seems more like a workaround rather than a solution...
Just a rough idea, maybe we can choose a strategy based on runtime statistics. For example, A) after retry a certain number of times, coprocessor scan and return all locks of the region, or B) tikv client uses resolve lock lite by default, if it encounters many locks for a same transaction, then it switches to the batch mode.

sticnarf · 2019-08-14T02:19:34Z

But both the lite resolve lock and preventing lite resolve lock seems more like a workaround rather than a solution...
Just a rough idea, maybe we can choose a strategy based on runtime statistics. For example, A) after retry a certain number of times, coprocessor scan and return all locks of the region, or B) tikv client uses resolve lock lite by default, if it encounters many locks for a same transaction, then it switches to the batch mode.

Yes, all of them are just workarounds. But I think the better solution is to make tikv able to resolve locks themselves (maybe embed a client in tikv) instead of trying reducing the resolve lock round trips between client and tikv. Then, there won't be abortion in cop processing and it is more controllable. For example, (just an inmature idea) tikv can maintain an LRU cache of unfinished locks, and resolve expired ones of them actively when the load is low.

disksing · 2019-08-14T05:10:12Z

I guess this scheme violates our design philosophy, which is, tikv is not allowed to access tikv as a client...

lysu

LGTM

lysu · 2019-08-16T08:58:49Z

/run-all-tests

sticnarf · 2019-08-19T02:28:44Z

/run-all-tests

sticnarf · 2019-08-19T02:33:19Z

Can we merge it?

lysu

LGTM

lysu · 2019-08-19T03:51:00Z

@sticnarf do we need cherry-pick it to 3.0 ?

store/tikv/2pc_test.go

tiancaiamao · 2019-08-19T06:34:05Z

LGTM

sre-bot · 2019-08-19T06:34:46Z

/run-all-tests

sticnarf · 2019-08-19T06:37:22Z

@sticnarf do we need cherry-pick it to 3.0 ?

I'm inclined to. There is small chance (a transaction with many keys and big values is down and a coprocessor request needs to scan them) that inaccurate txn_size harms performance a lot.

sre-bot · 2019-08-19T07:29:08Z

cherry pick to release-3.0 failed

…ngcap#11725) Signed-off-by: Yilin Chen <[email protected]>

…1725, #11793) (#11787)

sre-bot · 2020-04-07T10:22:58Z

It seems that, not for sure, we failed to cherry-pick this commit to release-3.0. Please comment '/run-cherry-picker' to try to trigger the cherry-picker if we did fail to cherry-pick this commit before. @sticnarf PTAL.

sticnarf force-pushed the correct-txn-size branch from fa1a12f to f58f3bd Compare August 13, 2019 05:48

tikv: fix TxnSize to be the number of involved keys in the region

87e25f5

sticnarf force-pushed the correct-txn-size branch from 84454e4 to 87e25f5 Compare August 13, 2019 06:49

Merge branch 'master' into correct-txn-size

83f82ed

sticnarf marked this pull request as ready for review August 13, 2019 06:55

sticnarf added 2 commits August 13, 2019 15:02

Fix test

3bffca0

Fix TxnSize is missing from mocktikv

7545602

sticnarf force-pushed the correct-txn-size branch from 69a436d to 7545602 Compare August 13, 2019 08:03

Merge branch 'master' into correct-txn-size

c13b12f

sticnarf added the component/tikv label Aug 13, 2019

sticnarf requested review from AndreMouche, MyonKeminta and zhangjinpeng87 August 13, 2019 08:36

disksing reviewed Aug 13, 2019

View reviewed changes

sticnarf removed request for AndreMouche, MyonKeminta and zhangjinpeng87 August 13, 2019 09:53

Merge remote-tracking branch 'upstream/master' into correct-txn-size

d40f810

sticnarf force-pushed the correct-txn-size branch 2 times, most recently from adf8a49 to df8fb83 Compare August 13, 2019 12:12

Set TxnSize to MaxUint64 on retry

720be0d

sticnarf force-pushed the correct-txn-size branch from df8fb83 to 720be0d Compare August 13, 2019 12:16

Merge branch 'master' into correct-txn-size

e9eb51c

lysu requested review from lysu and tiancaiamao August 15, 2019 06:06

lysu reviewed Aug 15, 2019

View reviewed changes

Merge branch 'master' into correct-txn-size

03c9038

lysu added the type/bugfix This PR fixes a bug. label Aug 16, 2019

Merge branch 'master' into correct-txn-size

9678fa1

lysu approved these changes Aug 19, 2019

View reviewed changes

lysu added status/all tests passed status/LGT2 Indicates that a PR has LGTM 2. labels Aug 19, 2019

tiancaiamao reviewed Aug 19, 2019

View reviewed changes

store/tikv/2pc_test.go Show resolved Hide resolved

Merge branch 'master' into correct-txn-size

813a6b5

tiancaiamao approved these changes Aug 19, 2019

View reviewed changes

tiancaiamao added the status/can-merge Indicates a PR has been approved by a committer. label Aug 19, 2019

sre-bot merged commit 5580a01 into pingcap:master Aug 19, 2019

lysu added the needs-cherry-pick-3.0 label Aug 19, 2019

sticnarf added a commit to sticnarf/tidb that referenced this pull request Aug 20, 2019

tikv: fix TxnSize to be the number of involved keys in the region (pi…

b23dee6

…ngcap#11725) Signed-off-by: Yilin Chen <[email protected]>

This was referenced Aug 20, 2019

tikv: fix TxnSize to be the number of involved keys in the region (#11725, #11793) #11787

Merged

tikv: do not update regionTxnSize on retries #11793

Merged

sre-bot pushed a commit that referenced this pull request Aug 25, 2019

tikv: fix TxnSize to be the number of involved keys in the region (#1…

571bfed

…1725, #11793) (#11787)

disksing mentioned this pull request Aug 27, 2019

support resolve specified lock keys (#10292) #11889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tikv: fix TxnSize to be the number of involved keys in the region #11725

tikv: fix TxnSize to be the number of involved keys in the region #11725

sticnarf commented Aug 13, 2019 •

edited

Loading

CLAassistant commented Aug 13, 2019 •

edited

Loading

codecov bot commented Aug 13, 2019 •

edited

Loading

sticnarf commented Aug 13, 2019

disksing Aug 13, 2019

sticnarf Aug 13, 2019

disksing Aug 13, 2019

sticnarf Aug 13, 2019 •

edited

Loading

sticnarf Aug 13, 2019

disksing Aug 13, 2019

sticnarf commented Aug 13, 2019 •

edited

Loading

disksing commented Aug 13, 2019

sticnarf commented Aug 14, 2019 •

edited

Loading

disksing commented Aug 14, 2019

lysu left a comment

lysu commented Aug 16, 2019

sticnarf commented Aug 19, 2019

sticnarf commented Aug 19, 2019

lysu left a comment

lysu commented Aug 19, 2019

tiancaiamao commented Aug 19, 2019

sre-bot commented Aug 19, 2019

sticnarf commented Aug 19, 2019

sre-bot commented Aug 19, 2019

sre-bot commented Apr 7, 2020

tikv: fix TxnSize to be the number of involved keys in the region #11725

tikv: fix TxnSize to be the number of involved keys in the region #11725

Conversation

sticnarf commented Aug 13, 2019 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

CLAassistant commented Aug 13, 2019 • edited Loading

codecov bot commented Aug 13, 2019 • edited Loading

Codecov Report

sticnarf commented Aug 13, 2019

disksing Aug 13, 2019

Choose a reason for hiding this comment

sticnarf Aug 13, 2019

Choose a reason for hiding this comment

disksing Aug 13, 2019

Choose a reason for hiding this comment

sticnarf Aug 13, 2019 • edited Loading

Choose a reason for hiding this comment

sticnarf Aug 13, 2019

Choose a reason for hiding this comment

disksing Aug 13, 2019

Choose a reason for hiding this comment

sticnarf commented Aug 13, 2019 • edited Loading

disksing commented Aug 13, 2019

sticnarf commented Aug 14, 2019 • edited Loading

disksing commented Aug 14, 2019

lysu left a comment

Choose a reason for hiding this comment

lysu commented Aug 16, 2019

sticnarf commented Aug 19, 2019

sticnarf commented Aug 19, 2019

lysu left a comment

Choose a reason for hiding this comment

lysu commented Aug 19, 2019

tiancaiamao commented Aug 19, 2019

sre-bot commented Aug 19, 2019

sticnarf commented Aug 19, 2019

sre-bot commented Aug 19, 2019

sre-bot commented Apr 7, 2020

sticnarf commented Aug 13, 2019 •

edited

Loading

CLAassistant commented Aug 13, 2019 •

edited

Loading

codecov bot commented Aug 13, 2019 •

edited

Loading

sticnarf Aug 13, 2019 •

edited

Loading

sticnarf commented Aug 13, 2019 •

edited

Loading

sticnarf commented Aug 14, 2019 •

edited

Loading