-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix limited statistic collection accross files with no stats #4521
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @isidentical
let table = ListingTable::try_new(config)?; | ||
|
||
let exec = table.scan(&state, None, &[], None).await?; | ||
assert_eq!(exec.statistics().num_rows, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
num_rows = if let Some(num_rows) = num_rows { | ||
Some(num_rows + file_stats.num_rows.unwrap_or(0)) | ||
} else { | ||
file_stats.num_rows | ||
}; | ||
total_byte_size = if let Some(total_byte_size) = total_byte_size { | ||
Some(total_byte_size + file_stats.total_byte_size.unwrap_or(0)) | ||
} else { | ||
file_stats.total_byte_size | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are into a more functional style of coding, you can do something like the following as well
num_rows = if let Some(num_rows) = num_rows { | |
Some(num_rows + file_stats.num_rows.unwrap_or(0)) | |
} else { | |
file_stats.num_rows | |
}; | |
total_byte_size = if let Some(total_byte_size) = total_byte_size { | |
Some(total_byte_size + file_stats.total_byte_size.unwrap_or(0)) | |
} else { | |
file_stats.total_byte_size | |
}; | |
num_rows = num_rows | |
.map(|num_rows| num_rows + file_stats.num_rows.unwrap_or(0)) | |
.or(file_stats.num_rows); | |
total_byte_size = total_byte_size | |
.map(|total_byte_size| total_byte_size + file_stats.total_byte_size.unwrap_or(0)) | |
.or(file_stats.total_byte_size); |
cc @mingmwang |
Benchmark runs are scheduled for baseline = 8d36529 and contender = 067d044. 067d044 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4323.
Rationale for this change
Even in the cases where none of the files included stats/byte size information,
get_statistics_with_limit
was generating statistics with the claims of zero rows / zero bytes. This PR fixes it to only include the number of rows when it is known.What changes are included in this PR?
For preserving the old relaxed behavior, this PR changes the collection to include row/byte count information when any of them are present in any of the files. If it is only present in even one file, that is enough for the listing (all the others are counted towards zero rows, which is the existing behavior).
Are these changes tested?
Yes.
Are there any user-facing changes?
No.