sql: always include rows with minimum/maximum histogram values in statistics samples #83730
Labels
A-sql-table-stats
Table statistics (and their automatic refresh).
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-sql-queries
SQL Queries Team
To collect histograms for columns in a table, we sample up to 10k rows at random. For large tables, this means that there is a significant chance that histograms will not cover a range of minimum or maximum values of a column. Queries for values outside of the minimum and maximum histogram bounds often have poor query plans as a result (see #64570 and #83431).
Ideally, we could retain any rows in the sample that contain a minimum or maximum value for histogram. This would require the sample to keep track of the minimum and maximum values seen of each column and assign a low rank when added to the sampler so that the row is sampled and not dropped. However, the ranking mechanism doesn't seem appropriate because we'd likely end up with a sample containing only rows with values near the minimum and maximum. So, when finding a row with a new minimum or maximum value, we need to evict the previous row with a minimum or maximum from the sample (or assign it a random rank so it can get randomly evicted). Rows containing minimums and maximums of multiple columns complicate this further.
Epic: CRDB-16930
Jira issue: CRDB-17227
The text was updated successfully, but these errors were encountered: