-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: parallelize parquet_exec
test case single_file
#4735
Conversation
Signed-off-by: Ruihang Xia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @waynexia . I think the return code from the task needs to be verified but otherwise this look good to me. Thank you
I like how it will use more threads if available and degrade gracefully if not.
On my machine, master takes 20s
≈com
cargo test -p datafusion --test parquet_exec
...
test result: ok. 52 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 20.93s
With this PR, it takes 5.6s 👨🍳 👌
cargo test -p datafusion --test parquet_exec
...
test result: ok. 52 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 5.68s
cc @tustvold as he previously had some concerns with parallelizing this test, but I think his original concerns were related to writing the test file multiple times, which this PR does not do.
single_file
parquet_exec
test case single_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, other than the panic handling @alamb has already highlighted. I think we could potentially also look to reduce the page and row group size of the writer, so that we can potentially get the same coverage using a significantly smaller parquet file
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Thanks for your review!
I try to set |
Signed-off-by: Ruihang Xia <[email protected]>
let generator = AccessLogGenerator::new().with_row_limit(NUM_ROWS); | ||
let generator = AccessLogGenerator::new() | ||
.with_row_limit(NUM_ROWS) | ||
.with_max_batch_size(ROW_LIMIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't quite what I meant, this just alters the internal batch size that the file is generated in. I'll add it to my list to get a PR in to adjust the generator to not generate such unnecessarily large files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note I carefully chose the values of the predicates in this test to cover various cases based on how the values were distributed in the data generator. I am sure it will be possible to reduce the size of the data file, but it will require some carefulness to ensure the coverage is retained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in #4743
Benchmark runs are scheduled for baseline = 734c211 and contender = 38a24c0. 38a24c0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Signed-off-by: Ruihang Xia [email protected]
Which issue does this PR close?
Closes #.
Rationale for this change
I found this test is very slow (the last one
datafusion::parquet_exec parquet::filter_pushdown::single_file
):and always hold the whole test process for seconds.
What changes are included in this PR?
This test is composed of serval little cases. I change it to run them in parallel, making the unit test faster by ~10s.
Are these changes tested?
by it self
Are there any user-facing changes?
no