-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-stage PREWHERE enabled by default #46365
Conversation
dd2857e
to
838f4c1
Compare
9c22581
to
23fe24c
Compare
This is an automated comment for commit 6886e84 with description of existing statuses. It's updated for the latest CI running
|
50a4d14
to
f4239f9
Compare
7f636a7
to
e0bac26
Compare
b171451
to
e223668
Compare
e2a7bc4
to
aaf06b0
Compare
1ca1f9e
to
af7f871
Compare
68329c4
to
6886e84
Compare
@davenger in theory |
@azat yes, |
I'm was worrying extra seeks, but apparently the problem is in the code. Consider the following example: CREATE TABLE data
(
`a` UInt64,
`b` UInt64,
`c` UInt64,
`d` UInt64
)
ENGINE = MergeTree
ORDER BY tuple();
INSERT INTO data SELECT
number % 1,
number % 1,
number % 1,
number
FROM numbers(100000000.)
Query id: 02488797-7172-45c9-88ba-243955669efe
Ok.
0 rows in set. Elapsed: 1.573 sec. Processed 100.65 million rows, 805.21 MB (63.97 million rows/s., 511.79 MB/s.)
p620.local :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=0
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 0
Query id: 18acb99f-3636-4e17-8778-d345ec81f341
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.091 sec. Processed 100.00 million rows, 2.40 GB (1.10 billion rows/s., 26.36 GB/s.)
p620.local :) select count() from data prewhere a = 1 and b = 0 and c = 0
SELECT count()
FROM data
PREWHERE (a = 1) AND (b = 0) AND (c = 0)
Query id: fb7950c0-92ca-4a6b-98c1-277c478ed9cb
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.200 sec. Processed 100.00 million rows, 2.40 GB (501.22 million rows/s., 12.03 GB/s.)
p620.local :) select count() from data prewhere a = 1 and b = 0 and c = 0 settings merge_tree_min_rows_for_seek=10e6, enable_multiple_prewhere_read_steps=1
SELECT count()
FROM data
PREWHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS merge_tree_min_rows_for_seek = 10000000., enable_multiple_prewhere_read_steps = 1
Query id: 444e9360-360f-4669-8881-06de4387c742
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.252 sec. Processed 100.00 million rows, 800.00 MB (396.17 million rows/s., 3.17 GB/s.)
It looks like there is something suboptimal, because in one thread the difference is not that big: p620.local :) select count() from data prewhere a = 1 and b = 0 and c = 0 settings enable_multiple_prewhere_read_steps=1, max_threads=1
SELECT count()
FROM data
PREWHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS enable_multiple_prewhere_read_steps = 1, max_threads = 1
Query id: 6a6d1ff0-dee5-4444-871e-b16212dcdef7
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.249 sec. Processed 100.00 million rows, 800.00 MB (402.10 million rows/s., 3.22 GB/s.)
p620.local :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=0, max_threads=1
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 0, max_threads = 1
Query id: 2e414046-c987-49dc-9203-3c1e507b7a71
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.282 sec. Processed 100.00 million rows, 2.40 GB (354.57 million rows/s., 8.51 GB/s.) |
This query I guess filters out a lot of data, hence it give this benefit. |
@azat is the data in your tests on HDD? davenger-xps :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=0
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 0
Query id: 53725fbe-5075-47b3-867a-3ab3b5ee28be
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.108 sec. Processed 100.00 million rows, 2.40 GB (929.14 million rows/s., 22.30 GB/s.)
davenger-xps :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=1, enable_multiple_prewhere_read_steps=0, move_all_conditions_to_prewhere=0
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 1, enable_multiple_prewhere_read_steps = 0, move_all_conditions_to_prewhere = 0
Query id: 08f80037-d8df-47c2-8f32-59744eb0ea54
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.033 sec. Processed 100.00 million rows, 800.00 MB (3.05 billion rows/s., 24.43 GB/s.)
davenger-xps :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=1, enable_multiple_prewhere_read_steps=1, move_all_conditions_to_prewhere=1
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 1, enable_multiple_prewhere_read_steps = 1, move_all_conditions_to_prewhere = 1
Query id: 01b2b94d-1aa7-4316-945d-f66e20955306
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.025 sec. Processed 100.00 million rows, 800.00 MB (3.93 billion rows/s., 31.40 GB/s.)
davenger-xps :) select count() from data where a = 1 and b = 0 and c = 0 settings optimize_move_to_prewhere=1, enable_multiple_prewhere_read_steps=1, move_all_conditions_to_prewhere=1, merge_tree_min_rows_for_seek = 10000000.
SELECT count()
FROM data
WHERE (a = 1) AND (b = 0) AND (c = 0)
SETTINGS optimize_move_to_prewhere = 1, enable_multiple_prewhere_read_steps = 1, move_all_conditions_to_prewhere = 1, merge_tree_min_rows_for_seek = 10000000.
Query id: 5d11a8f0-8815-4531-adc6-fa6b82aadf8d
┌─count()─┐
│ 0 │
└─────────┘
1 row in set. Elapsed: 0.026 sec. Processed 100.00 million rows, 800.00 MB (3.85 billion rows/s., 30.79 GB/s.)
|
SSD, I would say that it was even memory, because the data was in page cache. |
Funny enough - indeed it is! And I guess my initial concerns about random IO were not wealthy, since you will not read more data anyway, plus there is still compressed blocks that will be read at once and will not be read twice as well due to caching in the code. |
The results of ClickBench with settings off and on: https://pastila.nl/?000b1ba6/95bfe2f2029917388977bc956a6ddf6c.html |
Wow! 1.5x improvement on one of the queries! |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Enable
move_all_conditions_to_prewhere
andenable_multiple_prewhere_read_steps
settings by default.