Fix limited statistic collection accross files with no stats #4521

isidentical · 2022-12-05T21:25:56Z

Which issue does this PR close?

Closes #4323.

Rationale for this change

Even in the cases where none of the files included stats/byte size information, get_statistics_with_limit was generating statistics with the claims of zero rows / zero bytes. This PR fixes it to only include the number of rows when it is known.

What changes are included in this PR?

For preserving the old relaxed behavior, this PR changes the collection to include row/byte count information when any of them are present in any of the files. If it is only present in even one file, that is enough for the listing (all the others are counted towards zero rows, which is the existing behavior).

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

alamb

LGTM -- thank you @isidentical

alamb · 2022-12-15T20:18:25Z

datafusion/core/src/datasource/listing/table.rs

+        let table = ListingTable::try_new(config)?;
+
+        let exec = table.scan(&state, None, &[], None).await?;
+        assert_eq!(exec.statistics().num_rows, None);


alamb · 2022-12-15T20:24:03Z

datafusion/core/src/datasource/mod.rs

+        num_rows = if let Some(num_rows) = num_rows {
+            Some(num_rows + file_stats.num_rows.unwrap_or(0))
+        } else {
+            file_stats.num_rows
+        };
+        total_byte_size = if let Some(total_byte_size) = total_byte_size {
+            Some(total_byte_size + file_stats.total_byte_size.unwrap_or(0))
+        } else {
+            file_stats.total_byte_size
+        };


If you are into a more functional style of coding, you can do something like the following as well

Suggested change

num_rows = if let Some(num_rows) = num_rows {

Some(num_rows + file_stats.num_rows.unwrap_or(0))

} else {

file_stats.num_rows

};

total_byte_size = if let Some(total_byte_size) = total_byte_size {

Some(total_byte_size + file_stats.total_byte_size.unwrap_or(0))

} else {

file_stats.total_byte_size

};

num_rows = num_rows

.map(|num_rows| num_rows + file_stats.num_rows.unwrap_or(0))

.or(file_stats.num_rows);

total_byte_size = total_byte_size

.map(|total_byte_size| total_byte_size + file_stats.total_byte_size.unwrap_or(0))

.or(file_stats.total_byte_size);

alamb · 2022-12-15T20:24:33Z

cc @mingmwang

ursabot · 2022-12-17T11:11:51Z

Benchmark runs are scheduled for baseline = 8d36529 and contender = 067d044. 067d044 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the core Core DataFusion crate label Dec 5, 2022

Fix limited statistic collection accross files with no stats

096afb5

isidentical force-pushed the gh-4323 branch from ab0ce07 to 096afb5 Compare December 9, 2022 20:27

isidentical marked this pull request as ready for review December 14, 2022 20:30

alamb approved these changes Dec 15, 2022

View reviewed changes

alamb merged commit 067d044 into apache:master Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix limited statistic collection accross files with no stats #4521

Fix limited statistic collection accross files with no stats #4521

isidentical commented Dec 5, 2022

alamb left a comment

alamb Dec 15, 2022

alamb Dec 15, 2022

alamb commented Dec 15, 2022

ursabot commented Dec 17, 2022

Fix limited statistic collection accross files with no stats #4521

Fix limited statistic collection accross files with no stats #4521

Conversation

isidentical commented Dec 5, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 15, 2022

Choose a reason for hiding this comment

alamb Dec 15, 2022

Choose a reason for hiding this comment

alamb commented Dec 15, 2022

ursabot commented Dec 17, 2022