Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: remove duplication in create_writer #7229

Merged
merged 6 commits into from
Aug 9, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 8, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

I noticed some code duplication while reviewing #7212 from @devinjdangelo and I think this code will be used more when adding write support for avro, parquet and arrow files. Thus it makes sense to put it in its own function

What changes are included in this PR?

Refactor CsvSink::create_writer and JsonSink::create_writer

Are these changes tested?

covered by existing tests

Are there any user-facing changes?

no

@github-actions github-actions bot added the core Core DataFusion crate label Aug 8, 2023
@@ -494,56 +490,6 @@ impl CsvSink {
file_compression_type,
}
}

// Create a write for Csv files
async fn create_writer(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point of this PR is to move this into a free function that is shared rather than have two copies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@@ -235,7 +239,7 @@ pub(crate) enum AbortMode {
}

/// A wrapper struct with abort method and writer
struct AbortableWrite<W: AsyncWrite + Unpin + Send> {
pub(crate) struct AbortableWrite<W: AsyncWrite + Unpin + Send> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this module is getting pretty large -- I think it would be easier to navigate it if were split into some smaller modules (e.g. put the write related code into datafusion/core/src/datasource/file_format/write.rs, for example). Any interest in doing so @devinjdangelo or @metesynnada ? If not, I can give it a try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea. The code is likely to become more complex / long with additional helpers as we get into parallelization and dealing with parquet. I'm working on a PR to add support for inserting parquet now, and I can consolidate the shared write code into a new module as part of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a draft here #7244

Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for catching and improving on this.

@@ -494,56 +490,6 @@ impl CsvSink {
file_compression_type,
}
}

// Create a write for Csv files
async fn create_writer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@@ -235,7 +239,7 @@ pub(crate) enum AbortMode {
}

/// A wrapper struct with abort method and writer
struct AbortableWrite<W: AsyncWrite + Unpin + Send> {
pub(crate) struct AbortableWrite<W: AsyncWrite + Unpin + Send> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea. The code is likely to become more complex / long with additional helpers as we get into parallelization and dealing with parquet. I'm working on a PR to add support for inserting parquet now, and I can consolidate the shared write code into a new module as part of that.

@@ -62,7 +62,7 @@ pub trait GetExt {
}

/// Readable file compression type
#[derive(Debug, Clone, PartialEq, Eq)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@metesynnada metesynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the duplication is logical, I just suggest an comment update.

@alamb alamb merged commit 026c1d0 into apache:main Aug 9, 2023
@alamb alamb deleted the alamb/refactor_write branch August 9, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants