colbuilder: fall back to row-by-row processor wrapping for many renders #85822

yuzefovich · 2022-08-09T16:29:49Z

colbuilder: add a microbenchmark for running many render expressions

This commit adds a microbenchmark of queries with many render
expressions. It'll be used in the following commit to tune when we fall
back to wrapping a row-by-row processor to handle those renders.

Release note: None

colbuilder: fall back to row-by-row processor wrapping for many renders

This commit introduces a mechanism to handle render expressions by
wrapping a row-by-row processor into the vectorized flow when

the estimated number of rows to go through the renders is relatively
small
the number of renders is relatively high.

The idea is that the vectorized projection operators have higher
overhead when many of them are planned AND there is not enough data to
amortize the overhead, so to improve the performance in those cases
we'll use the row-by-row noop processor. Both of the thresholds are
controlled by cluster settings and the defaults were chosen based on
a representative microbenchmark.

It's worth pointing out that we only have the estimated row count for
the scan operators, so the change has limited applicability.

RenderPlanning/rows=1/renders=1-24           407µs ± 2%     408µs ± 2%     ~     (p=0.684 n=10+10)
RenderPlanning/rows=1/renders=8-24           516µs ± 1%     537µs ± 1%   +4.05%  (p=0.000 n=10+10)
RenderPlanning/rows=1/renders=32-24          832µs ± 1%     811µs ± 1%   -2.59%  (p=0.000 n=10+10)
RenderPlanning/rows=1/renders=64-24         1.22ms ± 0%    1.14ms ± 1%   -6.62%  (p=0.000 n=9+10)
RenderPlanning/rows=1/renders=128-24        2.02ms ± 0%    1.80ms ± 1%  -11.18%  (p=0.000 n=8+9)
RenderPlanning/rows=1/renders=512-24        7.75ms ± 1%    5.75ms ± 1%  -25.77%  (p=0.000 n=10+9)
RenderPlanning/rows=1/renders=4096-24        160ms ± 1%      62ms ± 1%  -61.51%  (p=0.000 n=10+9)
RenderPlanning/rows=4/renders=1-24           438µs ± 2%     438µs ± 1%     ~     (p=0.853 n=10+10)
RenderPlanning/rows=4/renders=8-24           603µs ± 1%     633µs ± 2%   +5.06%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=32-24         1.08ms ± 1%    1.08ms ± 1%     ~     (p=0.105 n=10+10)
RenderPlanning/rows=4/renders=64-24         1.72ms ± 0%    1.62ms ± 0%   -5.83%  (p=0.000 n=9+9)
RenderPlanning/rows=4/renders=128-24        3.01ms ± 1%    2.75ms ± 1%   -8.78%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=512-24        11.6ms ± 1%     9.6ms ± 2%  -17.58%  (p=0.000 n=10+10)
RenderPlanning/rows=4/renders=4096-24        192ms ± 2%      91ms ± 2%  -52.58%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=1-24          494µs ± 1%     499µs ± 1%   +1.03%  (p=0.006 n=10+8)
RenderPlanning/rows=16/renders=8-24          855µs ± 1%     901µs ± 1%   +5.37%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=32-24        2.03ms ± 1%    2.04ms ± 1%     ~     (p=0.190 n=10+10)
RenderPlanning/rows=16/renders=64-24        3.58ms ± 1%    3.42ms ± 1%   -4.56%  (p=0.000 n=10+9)
RenderPlanning/rows=16/renders=128-24       6.74ms ± 1%    6.31ms ± 1%   -6.37%  (p=0.000 n=10+10)
RenderPlanning/rows=16/renders=512-24       26.9ms ± 1%    24.7ms ± 1%   -8.24%  (p=0.000 n=9+10)
RenderPlanning/rows=16/renders=4096-24       329ms ± 2%     218ms ± 2%  -33.66%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=1-24          666µs ± 1%     659µs ± 2%   -1.07%  (p=0.007 n=10+10)
RenderPlanning/rows=64/renders=8-24         1.79ms ± 1%    1.84ms ± 1%   +3.01%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=32-24        5.53ms ± 1%    5.79ms ± 2%   +4.74%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=64-24        10.8ms ± 1%    11.0ms ± 1%   +1.87%  (p=0.000 n=10+9)
RenderPlanning/rows=64/renders=128-24       21.2ms ± 1%    21.7ms ± 1%   +2.71%  (p=0.000 n=10+10)
RenderPlanning/rows=64/renders=512-24       83.6ms ± 0%    84.9ms ± 0%   +1.47%  (p=0.000 n=10+7)
RenderPlanning/rows=64/renders=4096-24       824ms ± 1%     751ms ± 2%   -8.88%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=1-24         853µs ± 1%     851µs ± 1%     ~     (p=0.481 n=10+10)
RenderPlanning/rows=128/renders=8-24        2.98ms ± 1%    3.11ms ± 1%   +4.32%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=32-24       10.4ms ± 1%    10.9ms ± 1%   +5.44%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=64-24       20.1ms ± 1%    21.3ms ± 1%   +5.99%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=128-24      39.7ms ± 1%    42.1ms ± 2%   +5.98%  (p=0.000 n=10+10)
RenderPlanning/rows=128/renders=512-24       160ms ± 1%     168ms ± 2%   +5.13%  (p=0.000 n=9+10)
RenderPlanning/rows=128/renders=4096-24      1.44s ± 1%     1.48s ± 2%   +3.15%  (p=0.000 n=9+10)
RenderPlanning/rows=256/renders=1-24        1.22ms ± 1%    1.21ms ± 1%   -1.01%  (p=0.000 n=10+10)
RenderPlanning/rows=256/renders=8-24        5.22ms ± 0%    5.19ms ± 1%   -0.54%  (p=0.011 n=8+9)
RenderPlanning/rows=256/renders=32-24       19.9ms ± 1%    20.0ms ± 1%     ~     (p=0.182 n=9+10)
RenderPlanning/rows=256/renders=64-24       39.0ms ± 0%    38.9ms ± 0%   -0.33%  (p=0.023 n=10+10)
RenderPlanning/rows=256/renders=128-24      76.8ms ± 1%    76.7ms ± 1%     ~     (p=0.739 n=10+10)
RenderPlanning/rows=256/renders=512-24       316ms ± 1%     319ms ± 1%   +1.15%  (p=0.001 n=9+10)
RenderPlanning/rows=256/renders=4096-24      2.75s ± 1%     2.73s ± 1%   -0.64%  (p=0.002 n=8+9)

Fixes: #85632.

Release note: None

cockroach-teamcity · 2022-08-09T16:29:57Z

This change is

This commit adds a microbenchmark of queries with many render expressions. It'll be used in the following commit to tune when we fall back to wrapping a row-by-row processor to handle those renders. Release note: None

This commit introduces a mechanism to handle render expressions by wrapping a row-by-row processor into the vectorized flow when 1. the estimated number of rows to go through the renders is relatively small 2. the number of renders is relatively high. The idea is that the vectorized projection operators have higher overhead when many of them are planned AND there is not enough data to amortize the overhead, so to improve the performance in those cases we'll use the row-by-row noop processor. Both of the thresholds are controlled by cluster settings and the defaults were chosen based on a representative microbenchmark. It's worth pointing out that we only have the estimated row count for the scan operators, so the change has limited applicability. ``` RenderPlanning/rows=1/renders=1-24 407µs ± 2% 408µs ± 2% ~ (p=0.684 n=10+10) RenderPlanning/rows=1/renders=8-24 516µs ± 1% 537µs ± 1% +4.05% (p=0.000 n=10+10) RenderPlanning/rows=1/renders=32-24 832µs ± 1% 811µs ± 1% -2.59% (p=0.000 n=10+10) RenderPlanning/rows=1/renders=64-24 1.22ms ± 0% 1.14ms ± 1% -6.62% (p=0.000 n=9+10) RenderPlanning/rows=1/renders=128-24 2.02ms ± 0% 1.80ms ± 1% -11.18% (p=0.000 n=8+9) RenderPlanning/rows=1/renders=512-24 7.75ms ± 1% 5.75ms ± 1% -25.77% (p=0.000 n=10+9) RenderPlanning/rows=1/renders=4096-24 160ms ± 1% 62ms ± 1% -61.51% (p=0.000 n=10+9) RenderPlanning/rows=4/renders=1-24 438µs ± 2% 438µs ± 1% ~ (p=0.853 n=10+10) RenderPlanning/rows=4/renders=8-24 603µs ± 1% 633µs ± 2% +5.06% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=32-24 1.08ms ± 1% 1.08ms ± 1% ~ (p=0.105 n=10+10) RenderPlanning/rows=4/renders=64-24 1.72ms ± 0% 1.62ms ± 0% -5.83% (p=0.000 n=9+9) RenderPlanning/rows=4/renders=128-24 3.01ms ± 1% 2.75ms ± 1% -8.78% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=512-24 11.6ms ± 1% 9.6ms ± 2% -17.58% (p=0.000 n=10+10) RenderPlanning/rows=4/renders=4096-24 192ms ± 2% 91ms ± 2% -52.58% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=1-24 494µs ± 1% 499µs ± 1% +1.03% (p=0.006 n=10+8) RenderPlanning/rows=16/renders=8-24 855µs ± 1% 901µs ± 1% +5.37% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=32-24 2.03ms ± 1% 2.04ms ± 1% ~ (p=0.190 n=10+10) RenderPlanning/rows=16/renders=64-24 3.58ms ± 1% 3.42ms ± 1% -4.56% (p=0.000 n=10+9) RenderPlanning/rows=16/renders=128-24 6.74ms ± 1% 6.31ms ± 1% -6.37% (p=0.000 n=10+10) RenderPlanning/rows=16/renders=512-24 26.9ms ± 1% 24.7ms ± 1% -8.24% (p=0.000 n=9+10) RenderPlanning/rows=16/renders=4096-24 329ms ± 2% 218ms ± 2% -33.66% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=1-24 666µs ± 1% 659µs ± 2% -1.07% (p=0.007 n=10+10) RenderPlanning/rows=64/renders=8-24 1.79ms ± 1% 1.84ms ± 1% +3.01% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=32-24 5.53ms ± 1% 5.79ms ± 2% +4.74% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=64-24 10.8ms ± 1% 11.0ms ± 1% +1.87% (p=0.000 n=10+9) RenderPlanning/rows=64/renders=128-24 21.2ms ± 1% 21.7ms ± 1% +2.71% (p=0.000 n=10+10) RenderPlanning/rows=64/renders=512-24 83.6ms ± 0% 84.9ms ± 0% +1.47% (p=0.000 n=10+7) RenderPlanning/rows=64/renders=4096-24 824ms ± 1% 751ms ± 2% -8.88% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=1-24 853µs ± 1% 851µs ± 1% ~ (p=0.481 n=10+10) RenderPlanning/rows=128/renders=8-24 2.98ms ± 1% 3.11ms ± 1% +4.32% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=32-24 10.4ms ± 1% 10.9ms ± 1% +5.44% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=64-24 20.1ms ± 1% 21.3ms ± 1% +5.99% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=128-24 39.7ms ± 1% 42.1ms ± 2% +5.98% (p=0.000 n=10+10) RenderPlanning/rows=128/renders=512-24 160ms ± 1% 168ms ± 2% +5.13% (p=0.000 n=9+10) RenderPlanning/rows=128/renders=4096-24 1.44s ± 1% 1.48s ± 2% +3.15% (p=0.000 n=9+10) RenderPlanning/rows=256/renders=1-24 1.22ms ± 1% 1.21ms ± 1% -1.01% (p=0.000 n=10+10) RenderPlanning/rows=256/renders=8-24 5.22ms ± 0% 5.19ms ± 1% -0.54% (p=0.011 n=8+9) RenderPlanning/rows=256/renders=32-24 19.9ms ± 1% 20.0ms ± 1% ~ (p=0.182 n=9+10) RenderPlanning/rows=256/renders=64-24 39.0ms ± 0% 38.9ms ± 0% -0.33% (p=0.023 n=10+10) RenderPlanning/rows=256/renders=128-24 76.8ms ± 1% 76.7ms ± 1% ~ (p=0.739 n=10+10) RenderPlanning/rows=256/renders=512-24 316ms ± 1% 319ms ± 1% +1.15% (p=0.001 n=9+10) RenderPlanning/rows=256/renders=4096-24 2.75s ± 1% 2.73s ± 1% -0.64% (p=0.002 n=8+9) ``` Release note: None

yuzefovich · 2022-08-09T17:08:56Z

It's worth mentioning that it's not the overhead of creating the projection operators that drives the increase in latency (as I initially thought), but the overhead of allocating new vectors during the execution - we allocate a vector of, say, 64 capacity and never reuse it whereas in the row-by-row engine we allocate one row (roughly speaking), the overhead of the former is not amortized when we have small number of rows in general, and that's why I added a cluster setting to control that number too.

DrewKimball · 2022-08-09T19:16:44Z

It's worth mentioning that it's not the overhead of creating the projection operators that drives the increase in latency (as I initially thought), but the overhead of allocating new vectors during the execution - we allocate a vector of, say, 64 capacity and never reuse it whereas in the row-by-row engine we allocate one row (roughly speaking), the overhead of the former is not amortized when we have small number of rows in general, and that's why I added a cluster setting to control that number too.

Where exactly are the allocations being made? Is it in colexecop.OneInputHelper to create the output vectors, or is it part of the logic for specific projection operators (scratch vecs for example)? I'm wondering if the vectorized engine would work well enough in this instance if colexecop.OneInputHelper started off by returning smaller batches, then dynamically increased the size over time.

yuzefovich

The allocations are made in vectorTypeEnforcer which appends a new output vector for each projection operator. The size of the batch is determined by the cFetcher, and there we do want to put everything into a single batch if we have an estimated row count. vectorTypeEnforcer then uses the capacity of the batch as chosen by the cFetcher to allocate its vector.

When we don't have an estimate, then we do exactly as you describe, but in the presence of estimate dynamically allocating batches/vectors would probably be even worse: say we have an estimate of 100, then we'd allocate batches/vectors with capacity 1, 2, 4, 8, 16, 32, 64 without any reuse, but if we do use the estimate, we'd only have a single allocation of 100 capacity.

It's possible that it's not memory allocations that are never reused to blame for the slowdown - I didn't look too deeply since it seems like an edge case, generally speaking.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @rytaft)

DrewKimball

vectorTypeEnforcer then uses the capacity of the batch as chosen by the cFetcher to allocate its vector.

Sorry, I wasn't being clear - I was thinking vectorTypeEnforcer could read a batch as normal, then return smaller windows into the batch with a smaller output vector appended until the end of the batch is reached. If another batch is then read, the window size could be increased.

That being said, this change seems simpler for now.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)

yuzefovich · 2022-08-09T19:53:18Z

I was thinking vectorTypeEnforcer could read a batch as normal, then return smaller windows into the batch with a smaller output vector appended until the end of the batch is reached. If another batch is then read, the window size could be increased.

I see, it's an interesting idea although it seems a bit dangerous to implement. Currently I think the contract is that coldata.Batch.Capacity is at most the capacity of each vector inside of the batch, and implementing the suggestion would break such contract. We could introduce a light wrapper on coldata.Batch that would override Capacity implementation, but then we'd probably run into complexities with the selection vector referencing positions in the vectors that exceed the capacity.

Given that this seems like an edge case, it might not be worth all that extra complexity - if we saw perf regressions with small number of renders or with large number of rows, I'd be more worried.

TFTR!

bors r+

craig · 2022-08-09T22:02:45Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-08-09T22:03:04Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from f5a28fb to blathers/backport-release-21.2-85822: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

error creating merge commit from f5a28fb to blathers/backport-release-22.1-85822: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

yuzefovich added 2 commits August 9, 2022 09:38

colbuilder: add a microbenchmark for running many render expressions

d4490c9

This commit adds a microbenchmark of queries with many render expressions. It'll be used in the following commit to tune when we fall back to wrapping a row-by-row processor to handle those renders. Release note: None

yuzefovich force-pushed the render-row-by-row branch from 2eb1627 to f5a28fb Compare August 9, 2022 16:42

yuzefovich marked this pull request as ready for review August 9, 2022 16:42

yuzefovich requested a review from a team as a code owner August 9, 2022 16:42

yuzefovich requested review from rytaft and DrewKimball August 9, 2022 16:42

yuzefovich added backport-21.2.x labels Aug 9, 2022

yuzefovich commented Aug 9, 2022

View reviewed changes

DrewKimball approved these changes Aug 9, 2022

View reviewed changes

craig bot merged commit 8bcd0cd into cockroachdb:master Aug 9, 2022

yuzefovich deleted the render-row-by-row branch August 9, 2022 22:09

This was referenced Aug 9, 2022

release-22.1: colbuilder: fall back to row-by-row processor wrapping for many renders #85856

Merged

release-21.2: colbuilder: fall back to row-by-row processor wrapping for many renders #85860

Merged

yuzefovich mentioned this pull request Sep 26, 2022

sql/colexec/colbuilder: consider pooling render operators #88525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colbuilder: fall back to row-by-row processor wrapping for many renders #85822

colbuilder: fall back to row-by-row processor wrapping for many renders #85822

yuzefovich commented Aug 9, 2022 •

edited

Loading

cockroach-teamcity commented Aug 9, 2022

yuzefovich commented Aug 9, 2022

DrewKimball commented Aug 9, 2022

yuzefovich left a comment

DrewKimball left a comment •

edited

Loading

yuzefovich commented Aug 9, 2022

craig bot commented Aug 9, 2022

blathers-crl bot commented Aug 9, 2022

colbuilder: fall back to row-by-row processor wrapping for many renders #85822

colbuilder: fall back to row-by-row processor wrapping for many renders #85822

Conversation

yuzefovich commented Aug 9, 2022 • edited Loading

cockroach-teamcity commented Aug 9, 2022

yuzefovich commented Aug 9, 2022

DrewKimball commented Aug 9, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment • edited Loading

Choose a reason for hiding this comment

yuzefovich commented Aug 9, 2022

craig bot commented Aug 9, 2022

blathers-crl bot commented Aug 9, 2022

yuzefovich commented Aug 9, 2022 •

edited

Loading

DrewKimball left a comment •

edited

Loading