-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ClickBench queries to DataFusion benchmark runner #7060
Conversation
d7bbae5
to
23c0348
Compare
05086a3
to
10d104b
Compare
|
||
// Common benchmark options (don't use doc comments otherwise this doc | ||
// shows up in help files) | ||
#[derive(Debug, StructOpt, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored common options into CommonOpt
pub use run::{BenchQuery, BenchmarkRun}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was dead code I accidentally introduced in #7054
The actual entrypoint for the dfbench binary is in https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/dfbench.rs
/// Run the tpch benchmark | ||
/// Run the tpch benchmark. | ||
/// | ||
/// This benchmarks is derived from the [TPC-H][1] version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the details of what the benchmark was doing into the binary from the README, which I think is better as it is closer to the code, but I don't feel strongly about this and would welcome other opinions on the matter
@@ -53,17 +62,9 @@ pub struct RunOpt { | |||
#[structopt(short, long)] | |||
debug: bool, | |||
|
|||
/// Number of iterations of each test run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first part of a payment to reduce duplication in the benchmark runners (e.g. part of #7052)
@tustvold or @Dandandan do you have time to review this PR (I hope to use this benchmark runner to drive / test further groupby peformance improvements) |
benchmarks/src/clickbench.rs
Outdated
|
||
/// Returns the text of query `query_id` | ||
fn get_query(&self, query_id: usize) -> Result<String> { | ||
if query_id == 0 || query_id > 43 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClickBench numbers the queries 0-42:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch -- fixed ae4ce7d
One comment about the clickbench numbers, the rest looks good to me! Thanks for driving this forward |
Draft as it builds on #7054Which issue does this PR close?
closes #6994
closes #6128
Rationale for this change
see #6994 -- tldr is to optimize clickbench queries it needs to be easier to run them
What changes are included in this PR?
dfbench clickbench
command to run ClickBench queriesbench.sh
to run clickbench queriesbenchmarks/README.md
-- see rendered version https://github.com/alamb/arrow-datafusion/tree/alamb/clickbench_runner/benchmarksAre these changes tested?
I tested them manually
Run clickbench q1 directly (e.g. for profiling):
run with hits_partitioned (100 parquet files):
Run with bench.sh:
See help
Are there any user-facing changes?