Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring collection of statistics during TPC-H benchmarks #3889

Merged
merged 2 commits into from
Oct 19, 2022

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Oct 19, 2022

Which issue does this PR close?

Closes #3888 .

Rationale for this change

As described in #3888, it is useful to see a benchmark with/without statistics to determine what sort of effect it will have.

What changes are included in this PR?

A new flag, --disable-statistics to turn off statistic collection.

Are there any user-facing changes?

No.

@isidentical isidentical changed the title Allow enabling collection of statistics during TPC-H benchmarks Allow configuring collection of statistics during TPC-H benchmarks Oct 19, 2022
@isidentical
Copy link
Contributor Author

There are currently only two physical plans that differ when we pass --disable-stats, which is I guess the most reliable indication that they are the only plans that we actually use statistics. Here are the results from them in regards to on/off execution (which looks amazing, thanks to all the previous work on hash build side switching by @Dandandan and others)

image

I'll try to inspect other plans to see what else we can do (or if there is anything obvious we are missing, except filters which are in progress in #3868). But I'd expect the majority of the speed-ups in here to come from global join ordering (unlike only locals we have) with #3843 so no high hopes to see a magic bullet just yet 😇

@isidentical isidentical marked this pull request as ready for review October 19, 2022 01:56
@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
/// Path to output directory where JSON summary file should be written to
#[structopt(parse(from_os_str), short = "o", long = "output")]
output_path: Option<PathBuf>,

/// Whether to disable collection of statistics (and cost based optimizations) or not.
#[structopt(short = "S", long = "disable-statistics")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to keep defaults the same as datafusion-cli (statistics disabled)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about it but then decided against it since the default is already true. Do you know whether we have any places where we continuously run benchmarks where making the default false would change results? If it wouldn't cause a regression on places where we compare benchmarks across datafusion commits/versions, I think it should be an easy to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no current "continuous benchmarking" in place for this. If that would be the case, it would make sense to run the benchmarks both with and without collecting statistics.

Anyway, good that we can enable / disable them 👍

@Dandandan Dandandan merged commit c87964c into apache:master Oct 19, 2022
@Dandandan
Copy link
Contributor

Thanks @isidentical

@isidentical
Copy link
Contributor Author

Thanks for resolving conflicts @Dandandan 🙏🏻 By the way, just as a question (or maybe something we might be interested to discuss in general), are there any plans/efforts to do continuous benchmarking to ensure that no regression goes unnoticed? (I was very interested in @andygrove 's benchmarking of ballista/spark, so maybe even doing it both for ballista/datafusion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configuring statistics on TPC-H benchmarks
2 participants