Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowexec: increase the batch size for join reader ordering strategy #84433

Merged
merged 1 commit into from
Jul 15, 2022

Conversation

yuzefovich
Copy link
Member

This commit increases the default value of
sql.distsql.join_reader_ordering_strategy.batch_size cluster setting
(which determines the input row batch size for the lookup joins when
ordering needs to be maintained) from 10KiB to 100KiB since the previous
number is likely to have been too conservative. We have seen support
cases (https://github.com/cockroachlabs/support/issues/1627) where
bumping up this setting was needed to get reasonable performance, we
also change this setting to 100KiB in our TPC-E setup
(https://github.com/cockroachlabs/tpc-e/blob/d47d3ea5ce71ecb1be5e0e466a8aa7c94af63d17/tier-a/src/schema.rs#L404).

I did some historical digging, and I believe that the number 10KiB was
chosen somewhat arbitrarily with no real justification in #48058. That
PR changed how we measure the input row batches from the number of rows
to the memory footprint of the input rows. Prior to that change we had
100 rows limit, so my guess that 10KiB was somewhat dependent on that
number.

The reason we don't want this batch size to be too large is that we
buffer all looked up rows in a disk-backed container, so if too many
responses come back (because we included too many input rows in the
batch), that container has to spill to disk. To make sure we don't
regress in such scenarios this commit teaches the join reader to lower
the batch size bytes limit if the container does spill to disk, until
10 KiB which is treated as the minimum.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the bump-jr-ordering-batch-size branch from 269bbe2 to 4ea7ead Compare July 14, 2022 17:11
@yuzefovich yuzefovich marked this pull request as ready for review July 14, 2022 17:22
@yuzefovich yuzefovich requested a review from a team as a code owner July 14, 2022 17:22
@yuzefovich
Copy link
Member Author

The micro-benchmarks are here. They show that the only noticeable slowdown is when each input row results in many looked up rows (matchratio=onetosixtyfour) as well as when there are many inputs rows and the memory is limited. This slowdown is expected, but it seems acceptable to me given that this is somewhat unrealistic setup (we spill to disk, but never get a chance to lower the batch size dynamically since the input rows ran out).

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be increasing the batch size in the case when we don't use the streamer? My understanding was that part of the reason we hadn't increased the batch size before is because it might have increased the likelihood of OOM since the old path isn't as careful around handling memory as the streamer.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2)

@yuzefovich yuzefovich force-pushed the bump-jr-ordering-batch-size branch from 4ea7ead to c3c1480 Compare July 14, 2022 20:49
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question.

I think we should be safe to increase the batch size here regardless of the fact whether the streamer is used or not. We need to consider two cases:

  1. the KV requests are parallelized and no memory limits are set on BatchRequests - this is the case when lookup columns form a key, meaning that each lookup can return at most one result. Given this at most one lookup row behavior, I think it would be safer to increase the batch size limit even further than the current proposed 100KiB (e.g. we do 2MiB when ordering doesn't have to be maintained).
  2. the KV requests are not parallelized, so the memory limits are set - this is the case when each input row can result in many lookup rows. Here we do get the protection of TargetBytes limit, so we should be safe to increase the batch size.

I believe there is no difference between "lookup join with ordering" and "lookup join with no ordering" from the OOM stability perspective - both strategies have the same KV requests parallelization behavior, so we could have exactly the same batch size for both (i.e. 2MiB). However, the difference between two strategies is that the former has to buffer looked up rows in a container, possibly spilling to disk, so if we use too large of the batch size, the performance will drop. Thus, we chose to have much smaller batch size for the ordering case. I believe the value of 10KiB was too conservative and we should bump it up, especially given a very simple change to lower it in case we do spill to disk, and I don't anticipate the increase in OOMs because of it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2)

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for the explanation! Do you think it would be worth it to allow the batch size to grow dynamically as well, up to a larger limit? Maybe using some strategy like bumping up the batch size if the row container isn't exceeding half capacity.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2)

This commit increases the default value of
`sql.distsql.join_reader_ordering_strategy.batch_size` cluster setting
(which determines the input row batch size for the lookup joins when
ordering needs to be maintained) from 10KiB to 100KiB since the previous
number is likely to have been too conservative. We have seen support
cases (cockroachlabs/support#1627) where
bumping up this setting was needed to get reasonable performance, we
also change this setting to 100KiB in our TPC-E setup
(https://github.com/cockroachlabs/tpc-e/blob/d47d3ea5ce71ecb1be5e0e466a8aa7c94af63d17/tier-a/src/schema.rs#L404).

I did some historical digging, and I believe that the number 10KiB was
chosen somewhat arbitrarily with no real justification in cockroachdb#48058. That
PR changed how we measure the input row batches from the number of rows
to the memory footprint of the input rows. Prior to that change we had
100 rows limit, so my guess that 10KiB was somewhat dependent on that
number.

The reason we don't want this batch size to be too large is that we
buffer all looked up rows in a disk-backed container, so if too many
responses come back (because we included too many input rows in the
batch), that container has to spill to disk. To make sure we don't
regress in such scenarios this commit teaches the join reader to lower
the batch size bytes limit if the container does spill to disk, until
10 KiB which is treated as the minimum.

Release note: None
@yuzefovich yuzefovich force-pushed the bump-jr-ordering-batch-size branch from c3c1480 to e9f1c05 Compare July 14, 2022 22:59
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. Growing and shrinking the limit dynamically seems more difficult, so I added a TODO for that idea.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

@craig
Copy link
Contributor

craig bot commented Jul 15, 2022

Build succeeded:

@craig craig bot merged commit e8818d5 into cockroachdb:master Jul 15, 2022
@yuzefovich yuzefovich deleted the bump-jr-ordering-batch-size branch July 15, 2022 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants