Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR] Make the sink input aware of its plan #7610

Merged
merged 2 commits into from
Sep 22, 2023
Merged

[MINOR] Make the sink input aware of its plan #7610

merged 2 commits into from
Sep 22, 2023

Conversation

metesynnada
Copy link
Contributor

Which issue does this PR close?

Improves the #7452

Rationale for this change

FileSink can be optimized further even with unbounded sources, so we are making the sink operation aware of its plan's unboundedness.

What changes are included in this PR?

To ensure the accuracy of the plan output, the "is_plan_streaming" method is utilized. This method is then incorporated into the FileSinkConfig to facilitate the process of checking the plan output.

Are these changes tested?

With existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Sep 20, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't clear to me how this code is covered by existing tests

Specifically, I am thinking "if we removed this code no test would fail" 🤔

@metesynnada
Copy link
Contributor Author

For usual finite source cases, there will be no difference at all. On the other hand, there will also be no difference for the existing unbounded tests.

I did not fix a bug, just prevented spamming the yield_now() for bounded output cases even if we have FIFO-like sinks since the data will eventually be written when the execution finishes.

Do you have any suggestions on how to test this addition for better clarity?

@alamb
Copy link
Contributor

alamb commented Sep 21, 2023

I did not fix a bug, just prevented spamming the yield_now() for bounded output cases even if we have FIFO-like sinks since the data will eventually be written when the execution finishes.

I am sorry -- I missed this point.

Do you have any suggestions on how to test this addition for better clarity?

Given it is a performance optimization, perhaps we could just add a comment to the code explaining the rationale (so someone is less likely to remove it / break it accidentally)

# Conflicts:
#	datafusion/core/src/datasource/listing/table.rs
Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is LGTM!. As @alamb suggested I have updated the comment to explain "why being precise about unboundedness at the input helps"

@mustafasrepo mustafasrepo merged commit d193508 into apache:main Sep 22, 2023
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants