-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks for testing row filtering #3769
Conversation
]; | ||
|
||
let filter_matrix = vec![ | ||
// Selective-ish filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well-defined test case and test data! 👍
path: PathBuf, | ||
|
||
/// Batch size when reading Parquet files | ||
#[structopt(short = "s", long = "batch-size", default_value = "8192")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there two short options 's'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact when you run the example in debug mode it asserts on exactly this problem:
cargo run --bin parquet_filter_pushdown -- --path ./data --\|alamb@aal-dev:~/arrow-datafusion$
scale-factor 1.0
...
Running `target/debug/parquet_filter_pushdown --path ./data --scale-factor 1.0` |error[E0433]: failed to resolve: use of undeclared type `WriterProperties`
thread 'main' panicked at 'Argument short must be unique | --> benchmarks/src/bin/parquet_filter_pushdown.rs:235:17
| |
-s is already in use', /home/alamb/.cargo/registry/src/github.jparrowsec.cn-1ecc6299db9ec823/cla\|235 | let props = WriterProperties::builder()
p-2.34.0/src/app/parser.rs:190:13 | | ^^^^^^^^^^^^^^^^ not found in this scope
note: run with `RUST_BACKTRACE=1` environment variable to display a ba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just removed one of them. I don't think batch size needs to be a cli option in this benchmark.
@thinkharderdev thanks for your great bench.
There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in enable FYI, i see impala choose to use fixed row number in one page to do benchmark for getting good performance. |
Thank you @thinkharderdev -- I plan to review this PR in detail later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great -- thank you @thinkharderdev
I also verified the parquet file that was created
$ du -s -h /tmp/data/logs.parquet
988M /tmp/data/logs.parquet
It looks good to me (using the neat pqrs
tool from @manojkarthick)
alamb@aal-dev:~/2022-10-05-slow-query-high-cardinality$ pqrs cat --csv /tmp/data/logs.parquet | head
############################
File: /tmp/data/logs.parquet
############################
service,host,pod,container,image,time,client_addr,request_duration_ns,request_user_agent,request_method,request_host,request_bytes,response_bytes,response_status
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000000000+00:00,127.216.178.64,-1261239112,rkxttrfiiietlsaygzphhwlqcgngnumuphliejmxfdznuurswhdcicrlprbnocibvsbukiohjjbjdygwbfhxqvurm,PUT,https://backend.mydomain.com,-312099516,1\
448834362,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000001024+00:00,187.49.24.179,1374800428,sdxkctvmvuqxhwigrhjaouwdzvasqlqphymcgqvfmsbjswswnzgvanmalnmvsvruakcudmqvzateabhlya,PATCH,https://backend.mydomain.com,-1363067408,1111765\
98,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000002048+00:00,14.29.229.168,-1795280692,bhlvymbbtgcqrwzujukyotusnsoidygnklhx,GET,https://backend.mydomain.com,-1323615082,-705662117,400
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000003072+00:00,180.188.29.17,-717290117,hjaynltdswdekcguqmrkucsepzqjhasklmimkibabijihitimmsglgettywifdzmraipvyvekczuwxettayslrffyz,HEAD,https://backend.mydomain.com,-1847395296,\
1206750179,200
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000004096+00:00,68.92.115.208,759902764,yupopowlaqbwskdwvtlitugpzzxoajhvnmndhca,DELETE,https://backend.mydomain.com,-50170254,-415949533,403
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000005120+00:00,230.160.203.201,-1271567754,pwbruedgdgtsavjuksxwkecxulbnjbsaltuvcjxcmblhnraawouvrunwwsmvjbq,GET,https://backend.mydomain.com,-1193079450,1281912293,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000006144+00:00,249.254.50.191,-971196614,amtuqookzibtvrtqfnyzuyesikbrafhcfnjhoaoedvmlwpkypfsedtbbwlbnzigwgjpzcwdxtwhrykcibmhlxnkckynvgli,PATCH,https://backend.mydomain.com,-2627\
74709,-1695212300,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000007168+00:00,77.183.81.164,-547300163,ogkufdxssjqzjphxwvegwvofchpsgntbyslgarcyqcawokzfoppdftoctmtlwcvikazwrujlgrzrlqueaaceibxvdicfhp,HEAD,https://backend.mydomain.com,-1349820\
595,-327759246,204
backend,i-1ec3ca3151468928.ec2.internal,aqcathnxqsphdhgjtgvxsfyiwbmhlmg,backend_container_0,backend_container_0@sha256:30375999bf03beec2187843017b10c9e88d8b1a91615df4eb6350fb39472edd9,1970-01\
-01T00:00:00.000008192+00:00,63.17.88.115,-88404773,ogardohhoorttptpnkxmvyenqfzvvkjabcrfwapoywttjdunvmlgwgstmsjbefxqta,HEAD,https://backend.mydomain.com,1830978558,,200
Error: ArrowReadWriteError(CsvError("Broken pipe (os error 32)"))
let generator = Generator::new(); | ||
|
||
let file = File::create(&path).unwrap(); | ||
let mut writer = ArrowWriter::try_new(file, generator.schema.clone(), None).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make the properties used here explicit?
Like maybe explicitly set what type of statistics are created as well as potentially setting ZSTD compression
https://docs.rs/parquet/24.0.0/parquet/file/properties/struct.WriterPropertiesBuilder.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll need to revisit this again once apache/arrow-rs#2854 is released and pulled in so we cam generate the files with proper page sizes (which should make a significant difference)
Co-authored-by: Andrew Lamb <[email protected]>
I stole the gen code from @tustvold so you know it works :) |
CI check is unrelated : #3798 |
combine_filters(&[ | ||
col("request_method").not_eq(lit("GET")), | ||
col("response_status").eq(lit(400_u16)), | ||
// TODO this fails in the FilterExec with Error: Internal("The type of Dictionary(Int32, Utf8) = Utf8 of binary physical should be same") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercion!
Benchmark runs are scheduled for baseline = ae5b23e and contender = fb39d5d. fb39d5d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3457
Rationale for this change
Need a set benchmarks for evaluating performance implications of parquet predicate pushdown. This sets up some very basic benchmarks which can be used for that purpose. Thanks to @tustvold for cooking up a script to generate synthetic datasets for this purpose.
What changes are included in this PR?
Add new benchmark script
parquet_filter_pushdown
which will execute a series ofParquetExec
plans with different filter predicates. For each predicate in the suite, we will execute the plan with all three differentParquetScanOptions
configurations:Are there any user-facing changes?
No