Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: add cost penalty for scans with large cardinality #66979

Merged
merged 2 commits into from
Jun 28, 2021

Conversation

rytaft
Copy link
Collaborator

@rytaft rytaft commented Jun 28, 2021

opt: ensure we prefer a reverse scan to sorting a forward scan

This commit fixes an issue where in some edge cases the optimizer would
prefer sorting the output of a forward scan over performing a reverse scan
(when there is no need to sort the output of the reverse scan).

Release note (performance improvement): The optimizer now prefers
performing a reverse scan over a forward scan + sort if the reverse
scan eliminates the need for a sort and the plans are otherwise
equivalent. This was the case before in most cases, but some edge
cases with a small number of rows have been fixed.

opt: add cost penalty for scans with large cardinality

This commit adds a new cost function, largeCardinalityRowCountPenalty,
which calculates a penalty that should be added to the row count of scans.
It is non-zero for expressions with unbounded maximum cardinality or with
maximum cardinality exceeding the row count estimate. Adding a few rows
worth of cost helps prevent surprising plans for very small tables or for
when stats are stale.

Fixes #64570

Release note (performance improvement): When choosing between index
scans that are estimated to have the same number of rows, the optimizer
now prefers indexes for which it has higher certainty about the maximum
number of rows over indexes for which there is more uncertainty in the
estimated row count. This helps to avoid choosing suboptimal plans for
small tables or if the statistics are stale.

This commit fixes an issue where in some edge cases the optimizer would
prefer sorting the output of a forward scan over performing a reverse scan
(when there is no need to sort the output of the reverse scan).

Release note (performance improvement): The optimizer now prefers
performing a reverse scan over a forward scan + sort if the reverse
scan eliminates the need for a sort and the plans are otherwise
equivalent. This was the case before in most cases, but some edge
cases with a small number of rows have been fixed.
@rytaft rytaft requested review from RaduBerinde, cucaroach and a team June 28, 2021 17:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for this! Surprising (in a good way) that none of the "xform/external" plans are affected!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @rytaft)


pkg/sql/opt/xform/coster.go, line 146 at r2 (raw file):

	// unbounded maximum cardinality. This helps prevent surprising plans for very
	// small tables or for when stats are stale.
	unboundedMaxCardinalityScanRowCountPenalty = fullScanRowCountPenalty

For full table scans, this is added on top of the fullScanRowPenalty? That should be clarified here.


pkg/sql/opt/xform/testdata/coster/scan, line 300 at r2 (raw file):

----

exec-ddl

[nit] add a comment explaining what these stats are trying to achieve. We're trying to make the (y,v) index more appetizing for finding (10,10,10), yes? Surprised why the histogram on x doesn't contain 10, doesn't this make the index on (x,y) just as good (in that we expect 0 rows)? (maybe the multi-col stats override that? though the test would be less effective if we improved stats to consider the histogram on x too)


pkg/sql/opt/xform/testdata/coster/scan, line 423 at r2 (raw file):


opt
UPSERT INTO t64570 VALUES (10, 10, 10)

[nit] add a comment explaining what we're looking for - a scan of the PK with cardinality 1.

Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, 12 of 41 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @rytaft)

Copy link
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 29 of 41 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

This commit adds a new cost function, largeCardinalityRowCountPenalty,
which calculates a penalty that should be added to the row count of scans.
It is non-zero for expressions with unbounded maximum cardinality or with
maximum cardinality exceeding the row count estimate. Adding a few rows
worth of cost helps prevent surprising plans for very small tables or for
when stats are stale.

Fixes cockroachdb#64570

Release note (performance improvement): When choosing between index
scans that are estimated to have the same number of rows, the optimizer
now prefers indexes for which it has higher certainty about the maximum
number of rows over indexes for which there is more uncertainty in the
estimated row count. This helps to avoid choosing suboptimal plans for
small tables or if the statistics are stale.
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @RaduBerinde)


pkg/sql/opt/xform/coster.go, line 146 at r2 (raw file):

Previously, RaduBerinde wrote…

For full table scans, this is added on top of the fullScanRowPenalty? That should be clarified here.

Done.


pkg/sql/opt/xform/testdata/coster/scan, line 300 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] add a comment explaining what these stats are trying to achieve. We're trying to make the (y,v) index more appetizing for finding (10,10,10), yes? Surprised why the histogram on x doesn't contain 10, doesn't this make the index on (x,y) just as good (in that we expect 0 rows)? (maybe the multi-col stats override that? though the test would be less effective if we improved stats to consider the histogram on x too)

These were the stats created from the example in the issue. The multi-column stats aren't important here, so I've removed them. The reason this test case works is that we don't use histograms when the cardinality is less than 100, so we estimate 1 row for the primary index scan based on the distinct count and cardinality. But we do use the histograms when the cardinality is unbounded, so we estimate 0 rows for the secondary index scan. Here's the comment from statistics_builder.go (I completely forgot that we did this...):

	// This is the minimum cardinality an expression should have in order to make
	// it worth adding the overhead of using a histogram.
	minCardinalityForHistogram = 100

pkg/sql/opt/xform/testdata/coster/scan, line 423 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] add a comment explaining what we're looking for - a scan of the PK with cardinality 1.

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 39 of 41 files at r2, 3 of 3 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)

@rytaft
Copy link
Collaborator Author

rytaft commented Jun 28, 2021

I'll let this bake for a week-or-so and then backport

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 28, 2021

Build succeeded:

@craig craig bot merged commit 045d42e into cockroachdb:master Jun 28, 2021
@rytaft
Copy link
Collaborator Author

rytaft commented Jul 8, 2021

The backport to 20.2 is not clean, so I think I will only backport this to 21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: avoid choosing index with unreliable stats
4 participants