-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowexec: increase the batch size for join reader ordering strategy #84433
rowexec: increase the batch size for join reader ordering strategy #84433
Conversation
269bbe2
to
4ea7ead
Compare
The micro-benchmarks are here. They show that the only noticeable slowdown is when each input row results in many looked up rows ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be increasing the batch size in the case when we don't use the streamer? My understanding was that part of the reason we hadn't increased the batch size before is because it might have increased the likelihood of OOM since the old path isn't as careful around handling memory as the streamer.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2)
4ea7ead
to
c3c1480
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question.
I think we should be safe to increase the batch size here regardless of the fact whether the streamer is used or not. We need to consider two cases:
- the KV requests are parallelized and no memory limits are set on BatchRequests - this is the case when lookup columns form a key, meaning that each lookup can return at most one result. Given this at most one lookup row behavior, I think it would be safer to increase the batch size limit even further than the current proposed 100KiB (e.g. we do 2MiB when ordering doesn't have to be maintained).
- the KV requests are not parallelized, so the memory limits are set - this is the case when each input row can result in many lookup rows. Here we do get the protection of TargetBytes limit, so we should be safe to increase the batch size.
I believe there is no difference between "lookup join with ordering" and "lookup join with no ordering" from the OOM stability perspective - both strategies have the same KV requests parallelization behavior, so we could have exactly the same batch size for both (i.e. 2MiB). However, the difference between two strategies is that the former has to buffer looked up rows in a container, possibly spilling to disk, so if we use too large of the batch size, the performance will drop. Thus, we chose to have much smaller batch size for the ordering case. I believe the value of 10KiB was too conservative and we should bump it up, especially given a very simple change to lower it in case we do spill to disk, and I don't anticipate the increase in OOMs because of it.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Do you think it would be worth it to allow the batch size to grow dynamically as well, up to a larger limit? Maybe using some strategy like bumping up the batch size if the row container isn't exceeding half capacity.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @michae2)
This commit increases the default value of `sql.distsql.join_reader_ordering_strategy.batch_size` cluster setting (which determines the input row batch size for the lookup joins when ordering needs to be maintained) from 10KiB to 100KiB since the previous number is likely to have been too conservative. We have seen support cases (cockroachlabs/support#1627) where bumping up this setting was needed to get reasonable performance, we also change this setting to 100KiB in our TPC-E setup (https://github.com/cockroachlabs/tpc-e/blob/d47d3ea5ce71ecb1be5e0e466a8aa7c94af63d17/tier-a/src/schema.rs#L404). I did some historical digging, and I believe that the number 10KiB was chosen somewhat arbitrarily with no real justification in cockroachdb#48058. That PR changed how we measure the input row batches from the number of rows to the memory footprint of the input rows. Prior to that change we had 100 rows limit, so my guess that 10KiB was somewhat dependent on that number. The reason we don't want this batch size to be too large is that we buffer all looked up rows in a disk-backed container, so if too many responses come back (because we included too many input rows in the batch), that container has to spill to disk. To make sure we don't regress in such scenarios this commit teaches the join reader to lower the batch size bytes limit if the container does spill to disk, until 10 KiB which is treated as the minimum. Release note: None
c3c1480
to
e9f1c05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good point. Growing and shrinking the limit dynamically seems more difficult, so I added a TODO for that idea.
TFTR!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)
Build succeeded: |
This commit increases the default value of
sql.distsql.join_reader_ordering_strategy.batch_size
cluster setting(which determines the input row batch size for the lookup joins when
ordering needs to be maintained) from 10KiB to 100KiB since the previous
number is likely to have been too conservative. We have seen support
cases (https://github.com/cockroachlabs/support/issues/1627) where
bumping up this setting was needed to get reasonable performance, we
also change this setting to 100KiB in our TPC-E setup
(https://github.com/cockroachlabs/tpc-e/blob/d47d3ea5ce71ecb1be5e0e466a8aa7c94af63d17/tier-a/src/schema.rs#L404).
I did some historical digging, and I believe that the number 10KiB was
chosen somewhat arbitrarily with no real justification in #48058. That
PR changed how we measure the input row batches from the number of rows
to the memory footprint of the input rows. Prior to that change we had
100 rows limit, so my guess that 10KiB was somewhat dependent on that
number.
The reason we don't want this batch size to be too large is that we
buffer all looked up rows in a disk-backed container, so if too many
responses come back (because we included too many input rows in the
batch), that container has to spill to disk. To make sure we don't
regress in such scenarios this commit teaches the join reader to lower
the batch size bytes limit if the container does spill to disk, until
10 KiB which is treated as the minimum.
Release note: None