Add ClickBench queries to DataFusion benchmark runner #7060

alamb · 2023-07-23T11:53:13Z

~~Draft as it builds on #7054~~

Which issue does this PR close?

closes #6994
closes #6128

Rationale for this change

see #6994 -- tldr is to optimize clickbench queries it needs to be easier to run them

What changes are included in this PR?

Add new dfbench clickbench command to run ClickBench queries
Update bench.sh to run clickbench queries
Update benchmarks/README.md -- see rendered version https://github.com/alamb/arrow-datafusion/tree/alamb/clickbench_runner/benchmarks

Are these changes tested?

I tested them manually

Run clickbench q1 directly (e.g. for profiling):

cargo run  --bin dfbench -- clickbench --query 1
Running benchmarks with the following options: RunOpt { query: Some(1), common: CommonOpt { iterations: 3, partitions: 2, batch_size: 8192 }, path: "benchmarks/data/hits.parquet", queries_path: "benchmarks/queries/clickbench/queries.sql", output_path: None }
Q1: SELECT COUNT(*) FROM hits;
Query 1 iteration 0 took 305.7 ms and returned 1 rows
Query 1 iteration 1 took 13.6 ms and returned 1 rows
Query 1 iteration 2 took 13.6 ms and returned 1 rows

run with hits_partitioned (100 parquet files):

cargo run  --bin dfbench -- clickbench --query 1 --path=benchmarks/data/hits_partitioned

Run with bench.sh:

./bench.sh run clickbench_1

See help

cargo run  --bin dfbench  -- clickbench --help

dfbench-clickbench 27.0.0
Run the clickbench benchmark

The ClickBench[1] benchmarks are widely cited in the industry and
focus on grouping / aggregation / filtering. This runner uses the
scripts and queries from [2].

[1]: https://github.com/ClickHouse/ClickBench
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

USAGE:
    dfbench clickbench [OPTIONS]

FLAGS:
    -h, --help       
            Prints help information

    -V, --version    
            Prints version information


OPTIONS:
    -s, --batch-size <batch-size>        
            Batch size when reading CSV or Parquet files [default: 8192]

    -i, --iterations <iterations>        
            Number of iterations of each test run [default: 3]

    -o, --output <output-path>           
            If present, write results json here

    -n, --partitions <partitions>        
            Number of partitions to process in parallel [default: 2]

    -p, --path <path>                    
            Path to hits.parquet (single file) or `hits_partitioned` (partitioned, 100 files) [default:
            benchmarks/data/hits.parquet]
    -r, --queries_path <queries-path>    
            Path to queries.sql (single file) [default: benchmarks/queries/clickbench/queries.sql]

    -q, --query <query>                  
            Query number. If not specified, runs all queries

Are there any user-facing changes?

alamb · 2023-07-25T12:17:50Z

benchmarks/src/options.rs

+
+// Common benchmark options (don't use doc comments otherwise this doc
+// shows up in help files)
+#[derive(Debug, StructOpt, Clone)]


refactored common options into CommonOpt

alamb · 2023-07-25T12:18:19Z

benchmarks/src/lib.rs

 pub use run::{BenchQuery, BenchmarkRun};
-


this was dead code I accidentally introduced in #7054

The actual entrypoint for the dfbench binary is in https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/dfbench.rs

alamb · 2023-07-25T12:20:13Z

benchmarks/src/tpch/run.rs

-/// Run the tpch benchmark
+/// Run the tpch benchmark.
+///
+/// This benchmarks is derived from the [TPC-H][1] version


I moved the details of what the benchmark was doing into the binary from the README, which I think is better as it is closer to the code, but I don't feel strongly about this and would welcome other opinions on the matter

alamb · 2023-07-25T12:20:55Z

benchmarks/src/tpch/run.rs

@@ -53,17 +62,9 @@ pub struct RunOpt {
    #[structopt(short, long)]
    debug: bool,

-    /// Number of iterations of each test run


This is the first part of a payment to reduce duplication in the benchmark runners (e.g. part of #7052)

alamb · 2023-07-26T19:24:11Z

@tustvold or @Dandandan do you have time to review this PR (I hope to use this benchmark runner to drive / test further groupby peformance improvements)

Dandandan · 2023-07-26T20:41:01Z

benchmarks/src/clickbench.rs

+
+    /// Returns the text of query `query_id`
+    fn get_query(&self, query_id: usize) -> Result<String> {
+        if query_id == 0 || query_id > 43 {


ClickBench numbers the queries 0-42:

https://benchmark.clickhouse.com/

Good catch -- fixed ae4ce7d

Dandandan · 2023-07-26T20:42:08Z

One comment about the clickbench numbers, the rest looks good to me! Thanks for driving this forward

alamb force-pushed the alamb/clickbench_runner branch from d7bbae5 to 23c0348 Compare July 23, 2023 11:58

This was referenced Jul 24, 2023

add a readme for clickbench intro #3186

Closed

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Closed

Add clickbench query runner to benchmarks, update docs

10d104b

alamb force-pushed the alamb/clickbench_runner branch from 05086a3 to 10d104b Compare July 25, 2023 12:15

alamb marked this pull request as ready for review July 25, 2023 12:15

alamb marked this pull request as draft July 25, 2023 12:15

alamb commented Jul 25, 2023

View reviewed changes

alamb marked this pull request as ready for review July 25, 2023 12:21

Dandandan changed the title ~~Add ClickBench queries to DataFusion benchmark runer~~ Add ClickBench queries to DataFusion benchmark runner Jul 26, 2023

Dandandan reviewed Jul 26, 2023

View reviewed changes

alamb added 2 commits July 27, 2023 06:51

Merge remote-tracking branch 'apache/main' into alamb/clickbench_runner

9798e7c

Fix numbering so it goes from 0 to 42

ae4ce7d

Dandandan approved these changes Jul 27, 2023

View reviewed changes

alamb merged commit 11b7b5c into apache:main Jul 27, 2023

alamb deleted the alamb/clickbench_runner branch July 27, 2023 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClickBench queries to DataFusion benchmark runner #7060

Add ClickBench queries to DataFusion benchmark runner #7060

alamb commented Jul 23, 2023 •

edited

Loading

alamb Jul 25, 2023

alamb Jul 25, 2023

alamb Jul 25, 2023

alamb Jul 25, 2023

alamb commented Jul 26, 2023

Dandandan Jul 26, 2023

alamb Jul 27, 2023

Dandandan commented Jul 26, 2023

Add ClickBench queries to DataFusion benchmark runner #7060

Add ClickBench queries to DataFusion benchmark runner #7060

Conversation

alamb commented Jul 23, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb Jul 25, 2023

Choose a reason for hiding this comment

alamb Jul 25, 2023

Choose a reason for hiding this comment

alamb Jul 25, 2023

Choose a reason for hiding this comment

alamb Jul 25, 2023

Choose a reason for hiding this comment

alamb commented Jul 26, 2023

Dandandan Jul 26, 2023

Choose a reason for hiding this comment

alamb Jul 27, 2023

Choose a reason for hiding this comment

Dandandan commented Jul 26, 2023

alamb commented Jul 23, 2023 •

edited

Loading