-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: optimize the external sort for top K case #66303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding tests for this new code is important.
Also, any benchmark changes of note?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
pkg/sql/colexec/external_sort.go, line 421 at r1 (raw file):
// the new partition). s.outputUnlimitedAllocator.ReleaseMemory(s.outputUnlimitedAllocator.Used()) // Make sure to close out all partitions we have just read from.
Was this being missing before a bug?
pkg/sql/colexec/external_sort.go, line 468 at r1 (raw file):
// If the current partition reaches the desired topK number of tuples, a zero // batch is enqueued and true is returned indicating that the current partition // is done.
I don't understand how this works - why is it okay to stop creating a partition once we reach k values inside the partition? Unlike the top k sort, which uses the heap on every input, we don't do that hear, right? Or do we? Also, when does top k sort spill in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have unit tests for the external sort which runs "top K" test cases too, so we do get the correctness test coverage for free (in fact, the tests caught a bug in my WIP).
What missing is the test that verifies that each partition is at most K tuples (which is actually not true for merging phase - there we're counting with the "batch granularity"), but testing that is annoying and is not worth it IMO. Let me know if you think otherwise.
I've just added another commit to extend the benchmark and will share the results shortly.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @michae2)
pkg/sql/colexec/external_sort.go, line 421 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Was this being missing before a bug?
No, this is only needed if we do not exhaust partitions we're reading from, but previously we would always fully exhaust each old partition. When zero-batch is dequeued from the partition, the partition is closed for reading automatically. So this change has an effect only in the case of top K sort because we might stop reading from the partitions once we have at least K tuples.
Extended the comment.
pkg/sql/colexec/external_sort.go, line 468 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I don't understand how this works - why is it okay to stop creating a partition once we reach k values inside the partition? Unlike the top k sort, which uses the heap on every input, we don't do that hear, right? Or do we? Also, when does top k sort spill in the first place?
The crucial observation here is that b
is not coming from the input to the sort operation as a whole - b
is coming to us either from the in-memory top K sorter (which has already performed the sort over a subset of tuples, with input partitioner
defining the boundaries of that subset) or from the merger (which performs the merge of N already sorted partitions while preserving the order of tuples). So we are implicitly using the heap of the in-memory top K sorter here. Expanded the comment.
Top K sort will spill to disk whenever current "top K" rows (those stored in the heap and the AppendOnlyBufferedBatch) exceed the workmem limit. That limit can be exceeded when either K is very large or rows are very large in size. In the former case the current PR doesn't improve things (and probably makes things slightly worse than before), but in the latter case the current PR should be an improvement.
However, I think the former case should be addressed in a different manner - we should decide during planning that K is too large and use the general sort, we actually have a TODO for this:
cockroach/pkg/sql/colexec/colbuilder/execplan.go
Lines 393 to 394 in 549022d
// TODO(radu): we should not choose this processor when K is very large | |
// - it is slower unless we get significantly more rows than the limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code (assuming no benchmark surprises), thanks for the improved comments and explanation
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks look good:
name old time/op new time/op delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24 67.2µs ± 5% 66.9µs ± 2% ~ (p=0.921 n=10+9)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24 3.75ms ± 5% 3.47ms ±37% ~ (p=0.220 n=6+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24 266µs ± 5% 263µs ± 3% ~ (p=0.353 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24 4.07ms ± 2% 3.95ms ± 7% -2.76% (p=0.015 n=8+8)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24 291µs ± 8% 283µs ± 3% -2.98% (p=0.019 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24 4.40ms ± 4% 4.07ms ± 4% -7.42% (p=0.000 n=8+8)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24 316µs ± 1% 309µs ± 2% -2.17% (p=0.001 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24 4.09ms ± 4% 4.00ms ± 4% -2.24% (p=0.050 n=8+8)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24 737µs ± 2% 751µs ± 3% +1.82% (p=0.023 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24 5.78ms ±17% 4.28ms ±25% -26.03% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24 796µs ± 3% 807µs ± 5% ~ (p=0.393 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24 6.95ms ±14% 4.43ms ±23% -36.24% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24 4.52ms ± 3% 4.54ms ± 2% ~ (p=0.905 n=10+9)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24 10.5ms ± 8% 8.1ms ±12% -22.91% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24 7.17ms ± 4% 7.35ms ± 2% +2.49% (p=0.004 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24 42.6ms ± 5% 10.1ms ± 1% -76.19% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24 7.40ms ± 2% 7.48ms ± 3% ~ (p=0.190 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24 70.6ms ± 3% 10.4ms ± 3% -85.22% (p=0.000 n=10+9)
name old speed new speed delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24 244MB/s ± 5% 245MB/s ± 2% ~ (p=0.921 n=10+9)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24 4.16MB/s ±16% 4.90MB/s ±36% ~ (p=0.080 n=8+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24 123MB/s ± 5% 124MB/s ± 3% ~ (p=0.353 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24 8.06MB/s ± 2% 8.29MB/s ± 6% +2.89% (p=0.014 n=8+8)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24 225MB/s ± 7% 232MB/s ± 3% +3.01% (p=0.019 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24 14.9MB/s ± 5% 16.1MB/s ± 4% +8.02% (p=0.000 n=8+8)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24 415MB/s ± 1% 425MB/s ± 2% +2.23% (p=0.001 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24 32.1MB/s ± 4% 32.8MB/s ± 3% +2.30% (p=0.050 n=8+8)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24 356MB/s ± 2% 349MB/s ± 3% -1.76% (p=0.023 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24 45.9MB/s ±20% 62.4MB/s ±31% +36.00% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24 658MB/s ± 3% 650MB/s ± 4% ~ (p=0.393 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24 75.9MB/s ±16% 112.1MB/s ±11% +47.72% (p=0.000 n=10+8)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24 464MB/s ± 3% 462MB/s ± 2% ~ (p=0.905 n=10+9)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24 200MB/s ± 8% 260MB/s ±13% +30.20% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24 585MB/s ± 3% 571MB/s ± 2% -2.45% (p=0.004 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24 98.5MB/s ± 4% 413.3MB/s ± 1% +319.73% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24 1.13GB/s ± 2% 1.12GB/s ± 3% ~ (p=0.190 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24 119MB/s ± 3% 799MB/s ± 6% +572.36% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24 37.8kB ± 2% 37.8kB ± 2% ~ (p=0.380 n=10+10)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24 219kB ± 0% 99kB ± 0% -54.92% (p=0.000 n=9+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24 48.5kB ± 0% 48.5kB ± 0% ~ (all equal)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24 390kB ± 0% 129kB ± 0% -66.97% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24 70.5kB ± 0% 70.5kB ± 0% -0.09% (p=0.000 n=9+9)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24 524kB ± 0% 179kB ± 0% -65.73% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24 37.6kB ± 0% 37.6kB ± 0% +0.09% (p=0.000 n=9+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24 1.03MB ± 0% 0.10MB ± 1% -90.42% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24 48.5kB ± 0% 48.5kB ± 0% ~ (all equal)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24 1.69MB ± 0% 0.13MB ± 0% -92.40% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24 70.5kB ± 0% 70.5kB ± 0% -0.09% (p=0.000 n=9+9)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24 3.00MB ± 0% 0.18MB ± 0% -94.03% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24 37.6kB ± 0% 37.6kB ± 0% +0.08% (p=0.000 n=9+8)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24 14.2MB ± 0% 0.1MB ± 1% -99.31% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24 48.6kB ± 0% 48.6kB ± 0% +0.01% (p=0.000 n=8+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24 26.3MB ± 0% 0.1MB ± 0% -99.51% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24 70.6kB ± 0% 70.5kB ± 0% -0.09% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24 50.1MB ± 0% 0.2MB ± 0% -99.64% (p=0.000 n=9+9)
name old allocs/op new allocs/op delta
ExternalSort/rows=2048/cols=1/topK/spilled=false-24 113 ± 0% 112 ± 0% -0.88% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=1/topK/spilled=true-24 327 ± 0% 317 ± 0% -3.06% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=false-24 130 ± 0% 128 ± 0% -1.54% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=2/topK/spilled=true-24 392 ± 0% 364 ± 0% -7.14% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=false-24 162 ± 0% 160 ± 0% -1.23% (p=0.000 n=10+10)
ExternalSort/rows=2048/cols=4/topK/spilled=true-24 509 ± 0% 453 ± 0% -11.00% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=false-24 113 ± 0% 112 ± 0% -0.88% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=1/topK/spilled=true-24 418 ± 0% 317 ± 0% -24.16% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=false-24 130 ± 0% 128 ± 0% -1.54% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=2/topK/spilled=true-24 570 ± 0% 364 ± 0% -36.14% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=false-24 162 ± 0% 160 ± 0% -1.23% (p=0.000 n=10+10)
ExternalSort/rows=16384/cols=4/topK/spilled=true-24 866 ± 0% 453 ± 0% -47.69% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=false-24 113 ± 0% 112 ± 0% -0.88% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=1/topK/spilled=true-24 1.70k ± 0% 0.32k ± 0% -81.40% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=false-24 130 ± 0% 128 ± 0% -1.54% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=2/topK/spilled=true-24 3.26k ± 0% 0.36k ± 0% -88.84% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=false-24 162 ± 0% 160 ± 0% -1.23% (p=0.000 n=10+10)
ExternalSort/rows=262144/cols=4/topK/spilled=true-24 6.24k ± 0% 0.45k ± 0% -92.74% (p=0.000 n=10+10)
(note that this for K == 128
, so imagining that each row is like 1MiB in size, we would see roughly the same improvement)
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2)
Build failed: |
Release note: None
Previously, if the top K sort spilled to disk, we used the general external sort. However, we could easily optimize that case with the knowledge that only K tuples are needed by the output. Namely, we can use the in-memory top K sort (in order to create each new partition of the desired size) and also limit the size of each merged partition by K tuples. This commit adds these optimizations. Release note: None
Needed to fix a linter. bors r+ |
Build succeeded: |
colexec: extend external sort benchmark for top K case
Release note: None
colexec: optimize the external sort for top K case
Previously, if the top K sort spilled to disk, we used the general
external sort. However, we could easily optimize that case with the
knowledge that only K tuples are needed by the output.
Namely, we can use the in-memory top K sort (in order to create each new
partition of the desired size) and also limit the size of each merged
partition by K tuples. This commit adds these optimizations.
Addresses: #45192.
Release note: None