Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring collection of statistics during TPC-H benchmarks #3889

Merged
merged 2 commits into from
Oct 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ implementations as well as other query engines.

## Benchmark derived from TPC-H

These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
These benchmarks are derived from the [TPC-H][1] benchmark. And we use this repo as the source of tpch-gen and answers:
https://github.com/databricks/tpch-dbgen.git, based on [2.17.1](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf) version of TPC-H.

## Generating Test Data
Expand Down Expand Up @@ -55,6 +55,11 @@ You can enable the features `simd` (to use SIMD instructions, `cargo nightly` is
cargo run --release --features "simd mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
```

If you want to disable collection of statistics (and thus cost based optimizers), you can pass `--disable-statistics` flag.
``bash
cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path /mnt/tpch-parquet --format parquet --query 17 --disable-statistics
```

The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
(generated by the `dbgen` utility) to CSV and Parquet.

Expand Down Expand Up @@ -130,16 +135,16 @@ h2o groupby query 1 took 1669 ms
## Parquet filter pushdown benchmarks

This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on
a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.

```base
cargo run --release --bin parquet_filter_pushdown -- --path ./data --scale-factor 1.0
```

This will generate the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor`
(with the default value of `1.0` generating a ~1GB parquet file).
(with the default value of `1.0` generating a ~1GB parquet file).

For each filter we will run the query using different `ParquetScanOption` settings.
For each filter we will run the query using different `ParquetScanOption` settings.

Example run:
```
Expand All @@ -159,4 +164,4 @@ Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...
```
```
10 changes: 8 additions & 2 deletions benchmarks/src/bin/tpch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
/// Path to output directory where JSON summary file should be written to
#[structopt(parse(from_os_str), short = "o", long = "output")]
output_path: Option<PathBuf>,

/// Whether to disable collection of statistics (and cost based optimizations) or not.
#[structopt(short = "S", long = "disable-statistics")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to keep defaults the same as datafusion-cli (statistics disabled)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about it but then decided against it since the default is already true. Do you know whether we have any places where we continuously run benchmarks where making the default false would change results? If it wouldn't cause a regression on places where we compare benchmarks across datafusion commits/versions, I think it should be an easy to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no current "continuous benchmarking" in place for this. If that would be the case, it would make sense to run the benchmarks both with and without collecting statistics.

Anyway, good that we can enable / disable them 👍

disable_statistics: bool,
}

#[derive(Debug, StructOpt)]
Expand Down Expand Up @@ -164,7 +168,8 @@ async fn benchmark_datafusion(opt: DataFusionBenchmarkOpt) -> Result<Vec<RecordB
let mut benchmark_run = BenchmarkRun::new(opt.query);
let config = SessionConfig::new()
.with_target_partitions(opt.partitions)
.with_batch_size(opt.batch_size);
.with_batch_size(opt.batch_size)
.with_collect_statistics(!opt.disable_statistics);
let ctx = SessionContext::with_config(config);

// register tables
Expand Down Expand Up @@ -440,7 +445,7 @@ async fn get_table(
format,
file_extension: extension.to_owned(),
target_partitions,
collect_stat: true,
collect_stat: ctx.config.collect_statistics,
table_partition_cols: vec![],
};

Expand Down Expand Up @@ -1374,6 +1379,7 @@ mod tests {
file_format: "tbl".to_string(),
mem_table: false,
output_path: None,
disable_statistics: false,
};
let actual = benchmark_datafusion(opt).await?;

Expand Down