-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: avoid choosing index with unreliable stats #64570
Comments
Given that statistics are always out of date, and they are an estimate, would it be reasonable to put a limit on how low statistics can reduce an estimated row count? For example, what if statistics could lower the row count estimate to no less than 1; only contradictions are guaranteed to make a row count 0. In this case it might have to be a lower limit of 1 plus some epsilon, so that the cost doesn't match the primary index lookup. |
It would be nice to get to this in the 21.2 release. cc @kevin-v-ngo @awoods187 for visibility. |
I'm pro doing this in 21.2 provided the opportunity cost isn't too high. Let's discuss level of effort during the milestone planning meeting |
Saw another instance of this in the wild. The use case involved a "status" column and a partial index that restricts the status to "pending". The vast majority of rows have "done" status, with occasionally a few thousand rows showing as "pending". The automatic stats can happen to see zero "pending" rows or see a few thousand of them, depending on timing. When stats show zero rows, the partial index is estimated to be empty and becomes an enticing index for the optimizer. In this case, the reliable plan was to use the PK which guaranteed that we scan at most one row. |
This commit adds a new cost function, largeCardinalityRowCountPenalty, which calculates a penalty that should be added to the row count of scans. It is non-zero for expressions with unbounded maximum cardinality or with maximum cardinality exceeding the row count estimate. Adding a few rows worth of cost helps prevent surprising plans for very small tables or for when stats are stale. Fixes cockroachdb#64570 Release note (performance improvement): When choosing between index scans that are estimated to have the same number of rows, the optimizer now prefers indexes for which it has higher certainty about the maximum number of rows over indexes for which there is more uncertainty in the estimated row count. This helps to avoid choosing suboptimal plans for small tables or if the statistics are stale.
This commit adds a new cost function, largeCardinalityRowCountPenalty, which calculates a penalty that should be added to the row count of scans. It is non-zero for expressions with unbounded maximum cardinality or with maximum cardinality exceeding the row count estimate. Adding a few rows worth of cost helps prevent surprising plans for very small tables or for when stats are stale. Fixes cockroachdb#64570 Release note (performance improvement): When choosing between index scans that are estimated to have the same number of rows, the optimizer now prefers indexes for which it has higher certainty about the maximum number of rows over indexes for which there is more uncertainty in the estimated row count. This helps to avoid choosing suboptimal plans for small tables or if the statistics are stale.
66973: ui: surface the transaction restarts chart r=matthewtodd a=matthewtodd Resolves #65856 Release note (ui change): The KV transaction restarts chart was moved from the "distributed" metrics to the "sql" metrics page so as to be close to the SQL transactions chart, for more prominent visibility. 66979: opt: add cost penalty for scans with large cardinality r=rytaft a=rytaft **opt: ensure we prefer a reverse scan to sorting a forward scan** This commit fixes an issue where in some edge cases the optimizer would prefer sorting the output of a forward scan over performing a reverse scan (when there is no need to sort the output of the reverse scan). Release note (performance improvement): The optimizer now prefers performing a reverse scan over a forward scan + sort if the reverse scan eliminates the need for a sort and the plans are otherwise equivalent. This was the case before in most cases, but some edge cases with a small number of rows have been fixed. **opt: add cost penalty for scans with large cardinality** This commit adds a new cost function, `largeCardinalityRowCountPenalty`, which calculates a penalty that should be added to the row count of scans. It is non-zero for expressions with unbounded maximum cardinality or with maximum cardinality exceeding the row count estimate. Adding a few rows worth of cost helps prevent surprising plans for very small tables or for when stats are stale. Fixes #64570 Release note (performance improvement): When choosing between index scans that are estimated to have the same number of rows, the optimizer now prefers indexes for which it has higher certainty about the maximum number of rows over indexes for which there is more uncertainty in the estimated row count. This helps to avoid choosing suboptimal plans for small tables or if the statistics are stale. Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Rebecca Taft <[email protected]>
This commit adds a new cost function, largeCardinalityRowCountPenalty, which calculates a penalty that should be added to the row count of scans. It is non-zero for expressions with unbounded maximum cardinality or with maximum cardinality exceeding the row count estimate. Adding a few rows worth of cost helps prevent surprising plans for very small tables or for when stats are stale. Fixes cockroachdb#64570 Release note (performance improvement): When choosing between index scans that are estimated to have the same number of rows, the optimizer now prefers indexes for which it has higher certainty about the maximum number of rows over indexes for which there is more uncertainty in the estimated row count. This helps to avoid choosing suboptimal plans for small tables or if the statistics are stale.
Unfortunately it seems like the fix with the cost penalty is not robust enough to handle some variations on this issue. For example, if the alternative plan has cardinality 10, we may not choose it. This is going to require some more thought, so I'll reopen this issue for now. |
Just saw another variation on this. It's similar to the example from Radu above:
However, it's slightly different. There is an The histograms for the The better plan would be to scan the index on Unfortunately, cardinality cannot help us here because I've run into problems like this running Postgres in the past. Our solution was to make automated stats collections much more aggressive. If a table gets very large, automatic stats collection is very unlikely to run if it is triggered by some % of rows being mutated. Imagine a table that gets 100k inserts per day. It's been around for 1000 days so it now has 100m rows. With our default To make automatic stats much more aggressive, you can set Postgres has the ability adjust these stats knobs at the table level. I don't believe we have that ability yet, but it would be useful for this; a user needs the ability to tune these knobs at a per table level based on the workload. Here's how to tune auto-stats collection in Postgres for a specific table:
|
Ran into this on the DRT |
Hit it on |
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
139256: sql/rowenc: reduce index key prefix calls r=annrpom a=annrpom ### sql/rowenc: reduce index key prefix calls This patch removes redundant calls to `MakeIndexKeyPrefix` during the construction of `IndexEntry`s by saving each first-time call in a map that we can later lookup. Previously, we would make this call for each row; however, as the prefix (table id + index id) for a particular index remains the same, we do not need to do any recomputation. Epic: [CRDB-42901](https://cockroachlabs.atlassian.net/browse/CRDB-42901) Fixes: #137798 Release note: None --- ### sql/rowexec: run BenchmarkIndexBackfill on-disk Epic: none Release note: None 139955: crosscluster/logical: use large test pool r=rickystewart a=msbutler This package spins up several TestServers. We've also seen many tests timeout, likely due to resource exhaustion. Informs #138277 Informs #139673 Release note: none 140057: sqlsmith: randomly generate auto partial stats in sqlsmith tests r=rytaft a=rytaft Epic: None Release note: None 140065: opt: add `optimizer_min_row_count` session setting r=mgartner a=mgartner Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate. 140203: server: split join_list to its own file r=RaduBerinde a=andrewbaptist Previously StoreSpec and JoinListType were in the same file. This commit moves JoinListType and its test to a separate file. Epic: CRDB-41111 Release note: None Co-authored-by: Annie Pompa <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs cockroachdb#64570 Informs cockroachdb#130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
Informs #64570 Informs #130201 Release note (sql change): The `optimizer_min_row_count` session setting has been added which sets a lower bound on row count estimates for relational expressions during query planning. A value of zero, which is the default, indicates no lower bound. Note that if this is set to a value greater than zero, a row count of zero can still be estimated for expressions with a cardinality of zero, e.g., for a contradictory filter. Setting this to a value higher than 0, such as 1, may yield better query plans in some cases, such as when statistics are frequently stale and inaccurate.
A customer ran into a case where they were doing a single row UPSERT. Instead of choosing the primary index (which would scan at most 1 row), the optimizer is choosing a secondary index. The index is chosen because according to the histogram, the relevant value would have no rows. But the stats are stale and the query actually reads through 100k+ rows from the index.
We have discussed augmenting the cost value with an "uncertainty range", which would address this problem (primary index has <=1 row with 100% certainty, the secondary index has expected 0 rows but with no upper bound). This would be a big change; but I believe we can also consider a more targeted fix, e.g. we could give a heavy cost "discount" to scan operators which have a known cardinality bound (or a penalty to scans with no cardinality bound).
Below is an illustration. The t_y_v_idx index could in principle return any number of rows.
The correct plan would be:
gz#9150
Jira issue: CRDB-7134
Jira issue: CRDB-13889
gz#16142
gz#18109
The text was updated successfully, but these errors were encountered: