Skip to content
This repository has been archived by the owner on Dec 8, 2021. It is now read-only.

Rewrite data file parser; Insert _tidb_rowid when needed; Update checkpoint structure #82

Merged
merged 15 commits into from
Nov 20, 2018

Conversation

kennytm
Copy link
Collaborator

@kennytm kennytm commented Nov 1, 2018

What problem does this PR solve?

Completely fix TOOL-462, by recording the _tidb_rowid on non-PkIsHandle tables to ensure idempotence when importing the same chunk twice.

What is changed and how it works?

  1. Assign a Row ID to every row of a table before import starts, so importing two chunks from the same table is no longer order-dependent.
  2. To properly assign a Row ID, we need to know exactly how many rows each chunk has. So we replaced splitFuzzyRegion back to an exact version.
  3. Since we need to read the whole file before importing, we want to make this step as fast as possible. Therefore, I replaced the MDDataReader by a ragel-based parser, which is about 8x faster on my machine.
  4. We also need to record the RowIDs into the checkpoints. The checkpoint tables are modified to accommodate this change. Additionally, the checksums are stored as properties of a chunk instead of the whole table.
  5. To ensure the only global property, the allocator, won't interfere with the data output in future updates, I've created a custom allocator which will panic on any unsupported operation.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility (only if you update Lightning after it saved a checkpoint)

Related changes

  • Need to cherry-pick to the release branch (2.1)
  • Need to update the tidb-ansible repository
  • Need to update the documentation
  • Need to be included in the release note

@sre-bot
Copy link

sre-bot commented Nov 1, 2018

Hi contributor, thanks for your PR.

This patch needs to be approved by someone of admins. They should reply with "/ok-to-test" to accept this PR for running test automatically.

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 1, 2018

/run-all-tests

@kennytm

This comment has been minimized.

@kennytm kennytm force-pushed the kennytm/inject-tidb-rowid branch from 3d9dde9 to fd3d23d Compare November 6, 2018 03:17
@kennytm kennytm force-pushed the kennytm/inject-tidb-rowid branch from fd3d23d to ea5e94c Compare November 6, 2018 03:21
@kennytm kennytm force-pushed the kennytm/inject-tidb-rowid branch from ea5e94c to 47d5df6 Compare November 6, 2018 04:07
@kennytm
Copy link
Collaborator Author

kennytm commented Nov 6, 2018

/run-all-tests

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 6, 2018

PTAL @GregoryIan (... throw dice ...) @csuzhangxc @july2993

}

func (merger *ChunkCheckpointMerger) MergeInto(cpd *TableCheckpointDiff) {
cpd.hasChecksum = true
cpd.hasChunks = true
cpd.allocBase = merger.AllocBase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always want to ask must it be it increasing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be. Adding a MaxInt64 call to ensure.

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 7, 2018

/run-all-tests

back_quoted = '`' ^'`'* '`';
unquoted = ^([,;()'"`] | space)+;

row = '(' (^[)'"`] | single_quoted | double_quoted | back_quoted)* ')';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about it, maybe we need more test or descirptions

@IANTHEREAL
Copy link
Collaborator

@july2993 and I would review it again after we handle _row_id rebase

On first read, we will reset the allocator base it is the maximum of

1. the AUTO_INCREMENT option of the CREATE TABLE statement, or
2. the total number of rows

This ensures future writes after importing will not clobber existing rows
due to overlapping _tidb_rowid.
@kennytm
Copy link
Collaborator Author

kennytm commented Nov 9, 2018

I believe TiDB isn't going to accept any PR which doesn't handle the _tidb_rowid rebase in the distributed case (which is irrelevant to Lightning), so I explicitly rebase on the number of rows here instead.

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 9, 2018

/ok-to-test

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 9, 2018

/run-all-tests

@IANTHEREAL
Copy link
Collaborator

@july2993 PTAL

@IANTHEREAL
Copy link
Collaborator

@csuzhangxc @amyangfei PTAL

@kennytm kennytm added the status/DNM Do not merge, test is failing or blocked by another PR label Nov 14, 2018
@kennytm kennytm force-pushed the kennytm/inject-tidb-rowid branch from ca1eab3 to ed37dbe Compare November 14, 2018 12:15
@kennytm
Copy link
Collaborator Author

kennytm commented Nov 14, 2018

/run-all-tests

@kennytm kennytm removed the status/DNM Do not merge, test is failing or blocked by another PR label Nov 14, 2018
start = time.Now()
kvs, _, err := kvEncoder.SQL2KV(sqls.String())
metrics.MarkTiming(encodeMark, start)
common.AppLogger.Debugf("len(kvs) %d, len(sql) %d", len(kvs), sqls.Len())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sqls.Len() (number of bytes) for len(sql) significant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's significant. This is just migration of existing code.

kvs, affectedRows, err := kvEncoder.SQL2KV(stmt)
metrics.MarkTiming(encodeMark, start)
common.AppLogger.Debugf("len(kvs) %d, len(sql) %d", len(kvs), len(stmt))

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 20, 2018

/run-integration-test

Copy link
Contributor

@amyangfei amyangfei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kennytm
Copy link
Collaborator Author

kennytm commented Nov 20, 2018

/run-unit-test

@kennytm kennytm merged commit a7122c4 into master Nov 20, 2018
@kennytm kennytm deleted the kennytm/inject-tidb-rowid branch November 20, 2018 15:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants