Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import multiple tables at same time - 1 #2191

Merged
merged 18 commits into from
Jan 22, 2025

Conversation

makalaaneesh
Copy link
Collaborator

@makalaaneesh makalaaneesh commented Jan 16, 2025

Describe the changes in this pull request

  • refactor batch producing + submitting to allow for producing one single batch at a time. This will help us import multiple tables at the same time.
  • Refactor batch producing logic into a FileBatchProducer

Describe if there are any user-facing changes

How was this pull request tested?

Wrote unit tests.
To run integration tests:

  • resumption tests
  • long running tests

Does your PR have changes that can cause upgrade issues?

Component Breaking changes?
MetaDB Yes/No
Name registry json Yes/No
Data File Descriptor Json Yes/No
Export Snapshot Status Json Yes/No
Import Data State Yes/No
Export Status Json Yes/No
Data .sql files of tables Yes/No
Export and import data queue Yes/No
Schema Dump Yes/No
AssessmentDB Yes/No
Sizing DB Yes/No
Migration Assessment Report Json Yes/No
Callhome Json Yes/No
YugabyteD Tables Yes/No
TargetDB Metadata Tables Yes/No

@makalaaneesh makalaaneesh marked this pull request as ready for review January 20, 2025 05:45
Copy link
Contributor

@priyanshi-yb priyanshi-yb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Comment on lines 1009 to 1011
if err != nil {
utils.ErrExit("preparing for file import: %s", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this PrepareForFileImport here? we are already doing it in NewFileBatchProducer

lastBatchNumber: lastBatchNumber,
lastOffset: lastOffset,
fileFullySplit: fileFullySplit,
completed: completed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: completed: len(pendingBatches) == 0 && fileFullySplit

return nil, err
}
if p.lineFromPreviousBatch != "" {
err = batchWriter.WriteRecord(p.lineFromPreviousBatch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment for explaining about this lineFromPreviousBatch

}

// 3 batches should be produced
// while calculating for the first batch, the header is also considered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, while preparing the first batch - we add the bytes of the header to the batch's total bytes but for the further batches, we don't as we already have the header we don't include it in the batch's bytes.
I think worth testing if in some cases where the number of columns is huge can this header's bytes can also contribute to the batches' bytes and should be included.
Can you please add a TODO while we are adding a header to bthe atch file to fix this if required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, did not want to change the implementation as part of this PR. Will add a TODO:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makalaaneesh @priyanshi-yb i think we should be uniform across all the batches, either consider it in all or not consider at all, which we can discuss.

assert.NotNil(t, batch1)
assert.Equal(t, int64(2), batch1.RecordCount)

// simulate a crash and recover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test for recovery situation!

Copy link
Contributor

@priyanshi-yb priyanshi-yb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@sanyamsinghal sanyamsinghal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added mostly minor comments only.
Thanks for adding the unit tests, keep adding these kind of tests will indeed help in making the codebase more robust.

Comment on lines +78 to +83
batch := p.pendingBatches[0]
p.pendingBatches = p.pendingBatches[1:]
// file is fully split and returning the last batch, so mark the producer as completed
if len(p.pendingBatches) == 0 && p.fileFullySplit {
p.completed = true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are setting as complete before last batch is processed.
should we be setting this when this is actually no batch available futher?

Suggesting something like this::

if len(p.pendingBatches) > 0 {
		batch := p.pendingBatches[0]
		p.pendingBatches = p.pendingBatches[1:]
		return batch, nil
	} else if len(p.pendingBatches) == 0 && p.fileFullySplit {
		p.completed = true
	}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makalaaneesh this one might be important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanyamsinghal Since that is the last batch that we are returning (because file is fully split and we are picking the last pending batch, no batches are further available, so it made sense to mark it as done.

}, nil
}

func (p *FileBatchProducer) Done() bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: p --> producer ?

return d.maxSizeBytes
}

func createTempFile(dir string, fileContents string) (string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider moving helper functions like this to test/utils/testutils.go testutils package

return ldataDir, lexportDir, state, nil
}

func setupFileForTest(lexportDir string, fileContents string, dir string, tableName string) (string, *ImportFileTask, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more explicit function name?

assert.Equal(t, int64(2), batches[0].RecordCount)
batchContents, err := os.ReadFile(batches[0].GetFilePath())
assert.NoError(t, err)
assert.Equal(t, "id,val\n1, \"hello\"\n2, \"world\"", string(batchContents))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: define a var for expected values

// 3 batches should be produced
// while calculating for the first batch, the header is also considered
assert.Equal(t, 3, len(batches))
// each of length 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment?

}

// 3 batches should be produced
// while calculating for the first batch, the header is also considered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makalaaneesh @priyanshi-yb i think we should be uniform across all the batches, either consider it in all or not consider at all, which we can discuss.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other case which can test here:

  1. Errors due to the datafile, for eg: syntax error.
  2. Resumability, after fixing that error in the main datafile.
  3. I think we two type of data files supported csv and text, so having coverage from that perspective is also good.
  4. Any variation in the content of datafile, specially csv? although that is something to be tested on data file package, but if it is not there we can add here also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas -
1, 2: They are more about the full import. Not applicable in the case of a FileBatchProducer which just produces the batches. Can be taken up when I add the FileTaskImporter 👍
3,4 : agreed, but as you said better suited for the dataFile package.

@makalaaneesh makalaaneesh merged commit 6fef1dc into main Jan 22, 2025
66 of 67 checks passed
@makalaaneesh makalaaneesh deleted the aneesh/import-multiple-tables-at-same-time-1 branch January 22, 2025 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants