Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: unbounded max cardinality penalty makes number of index columns dominate scan cost #68556

Closed
mgartner opened this issue Aug 6, 2021 · 1 comment · Fixed by #68676
Closed
Assignees
Labels
A-sql-optimizer SQL logical planning and optimizations. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@mgartner
Copy link
Collaborator

mgartner commented Aug 6, 2021

In #66979 we introduced an unboundedMaxCardinalityScanRowCountPenalty to constrained scans that have an unbounded cardinality. This increased the likelihood that, with stale table statistics, the optimizer would choose a plan with a scan on a primary index or unique secondary index with a bounded number of potential rows scanned over a scan on a non-unique secondary index with an unbounded number of potential rows scanned.

This change had a negative side-effect. With stale table statistics that make a query look extremely selective (e.g., filtering for a value greater than the maximum histogram upper-bound), the number of columns in an index dominates the cost of a scan. This causes the optimizer to favor a scan over a few columns over a scan with more columns, even if the former requires an addition filter and/or index join after the scan.

The test below shows that the index with fewer columns is chosen even though an extra select is needed. See the optstepsweb output too.

exec-ddl
CREATE TABLE t (
  k INT PRIMARY KEY,
  a INT,
  b INT,
  c INT,
  INDEX abc (a, b) STORING (c),
  INDEX bc (b) STORING (c)
)
----

# Inject stats for 100k rows where the upper bound of both a and b is 10.
exec-ddl
ALTER TABLE t INJECT STATISTICS '[
    {
        "columns": [
            "k"
        ],
        "created_at": "2021-06-28 14:20:18.273949",
        "distinct_count": 3,
        "histo_col_type": "INT8",
        "null_count": 0,
        "row_count": 100000
    },
    {
        "columns": [
            "a"
        ],
        "created_at": "2021-06-28 14:20:18.273949",
        "distinct_count": 100,
        "histo_buckets": [
            {"distinct_range": 0, "num_eq": 0, "num_range": 0, "upper_bound": "0"},
            {"distinct_range": 100, "num_eq": 0, "num_range": 100000, "upper_bound": "10"}
        ],
        "histo_col_type": "INT8",
        "null_count": 0,
        "row_count": 100000
    },
    {
        "columns": [
            "b"
        ],
        "created_at": "2021-06-28 14:20:18.273949",
        "distinct_count": 100,
        "histo_buckets": [
            {"distinct_range": 0, "num_eq": 0, "num_range": 0, "upper_bound": "0"},
            {"distinct_range": 100, "num_eq": 0, "num_range": 100000, "upper_bound": "10"}
        ],
        "histo_col_type": "INT8",
        "null_count": 0,
        "row_count": 100000
    }
]'
----

# Query for values of a and b that are outside the histogram buckets.
# A scan on the bc index with an additional filter and index join is chosen over
# a scan on the abc index with no additional filter and no index join.
opt format=(show-cost,show-stats)
SELECT * FROM t WHERE a = 11 AND b = 11
----
select
 ├── columns: k:1!null a:2!null b:3!null c:4
 ├── stats: [rows=1e-09, distinct(2)=1e-09, null(2)=0, distinct(3)=1e-09, null(3)=0, distinct(2,3)=1e-09, null(2,3)=0]
 ├── cost: 14.6400007
 ├── index-join t
 │    ├── columns: k:1!null a:2 b:3 c:4
 │    ├── stats: [rows=1e-07]
 │    ├── cost: 14.6200007
 │    └── scan t@bc
 │         ├── columns: k:1!null b:3!null c:4
 │         ├── constraint: /3/1: [/11 - /11]
 │         ├── stats: [rows=1e-07, distinct(3)=1e-07, null(3)=0]
 │         └── cost: 14.6100001
 └── filters
      └── a:2 = 11
@mgartner mgartner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-optimizer SQL logical planning and optimizations. labels Aug 6, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 6, 2021
@mgartner mgartner self-assigned this Aug 6, 2021
@mgartner mgartner changed the title opt: unbounded max cardinality makes number of index column dominate scan cost opt: unbounded max cardinality penalty makes number of index column dominate scan cost Aug 6, 2021
@mgartner mgartner changed the title opt: unbounded max cardinality penalty makes number of index column dominate scan cost opt: unbounded max cardinality penalty makes number of index columns dominate scan cost Aug 6, 2021
@mgartner
Copy link
Collaborator Author

mgartner commented Aug 7, 2021

I think the solution might be something like normalizing the large cardinality row cost penalty by the perRowCost:

diff --git a/pkg/sql/opt/xform/coster.go b/pkg/sql/opt/xform/coster.go
index 084d02337e..3ae0340e89 100644
--- a/pkg/sql/opt/xform/coster.go
+++ b/pkg/sql/opt/xform/coster.go
@@ -656,7 +656,8 @@ func (c *coster) computeScanCost(scan *memo.ScanExpr, required *physical.Require
        // Add a penalty if the cardinality exceeds the row count estimate. Adding a
        // few rows worth of cost helps prevent surprising plans for very small tables
        // or for when stats are stale.
-       rowCount += c.largeCardinalityRowCountPenalty(scan.Relational().Cardinality, rowCount)
+       rowCount += c.largeCardinalityRowCountPenalty(scan.Relational().Cardinality, rowCount) /
+               float64(perRowCost)

        if required.LimitHint != 0 {
                rowCount = math.Min(rowCount, required.LimitHint)

This prevents the number of columns in the index from dominating the cost. However when I tried this there were quite a few plans in the tests that changed, so there's likely more to this.

rytaft added a commit to rytaft/cockroach that referenced this issue Aug 10, 2021
This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes cockroachdb#68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.
rytaft added a commit to rytaft/cockroach that referenced this issue Aug 10, 2021
This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes cockroachdb#68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.
rytaft added a commit to rytaft/cockroach that referenced this issue Aug 11, 2021
This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes cockroachdb#68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.
craig bot pushed a commit that referenced this issue Aug 11, 2021
68676: opt: adjust cost of scan with unbounded cardinality to avoid bad plans r=rytaft a=rytaft

This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes #68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.

Co-authored-by: Rebecca Taft <[email protected]>
craig bot pushed a commit that referenced this issue Aug 11, 2021
68447: ui: fix awkward display of statements in transaction detail page r=Azhng a=Azhng

Previously, we aggregate the statement texts in transaction detail
page. However, this can be misleading in the case when a statement
is executed multiple times in a transaction. Also, different
transactions with same statements but different execution count and
execution order are also incorrectly grouped into a single entry.
This commit removes the aggregation when rendering the SQL Statement
Box in the transaction detail page and correctly groups the different
transactions into different entries in the Transaction Page.
This commit also removes the statement statistics table in the
transaction detail page since this can be misleading.

Release note (ui change): Transaction Detail Page's SQL Box
now shows all of transaction's statements in the order that
they were executed. Transaction Detail Page also no longer
display statement statistics.

68676: opt: adjust cost of scan with unbounded cardinality to avoid bad plans r=rytaft a=rytaft

This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes #68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.

Co-authored-by: Azhng <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
@craig craig bot closed this as completed in 9a3d1ab Aug 11, 2021
rytaft added a commit to rytaft/cockroach that referenced this issue Aug 16, 2021
This commit tweaks the application of the unbounded cardinality penalty
in the coster to add it directly to the cost of scans and zigzag joins
rather than to the row count. This helps prevent an issue in which the
number of index columns could dominate the scan cost and result in
suboptimal plans.

Fixes cockroachdb#68556

Release note (bug fix): Fixed a regression in the optimizer's cost model
that could cause it to choose suboptimal plans when choosing between two
non-unique index scans with different numbers of columns per index.
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants