Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support late materialization #5829

Open
2 of 6 tasks
Lloyd-Pottiger opened this issue Sep 8, 2022 · 1 comment · Fixed by #6803
Open
2 of 6 tasks

Support late materialization #5829

Lloyd-Pottiger opened this issue Sep 8, 2022 · 1 comment · Fixed by #6803
Assignees
Labels
component/storage type/feature-request Categorizes issue or PR as related to a new feature. type/performance

Comments

@Lloyd-Pottiger
Copy link
Contributor

Lloyd-Pottiger commented Sep 8, 2022

Feature Request

Is your feature request related to a problem? Please describe:

For some query like select * from t where a + b <> 0;, we now scan all columns and then do filter, which is heavy and we can do better.

Describe the feature you'd like:

Support late materialization. We can scan column(s) at first to filter out unnecessary data segment, and then just scan the rest columns.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Subtask:

  • support late materialization in Selection
  • support late materialization in fast scan
  • support late materialization in disaggregated TiFlash
  • better scan context after some filter conditions pushed down
  • (optional) support push down TopN
  • (optional) support late materialization in Join
@Lloyd-Pottiger
Copy link
Contributor Author

Given this query:

SELECT id, view_count, like_count, dislike_count, title FROM youtube ORDER BY view_count DESC LIMIT 100

The column title is very large, and the query takes a long time:

100 rows in set. Elapsed: 26.280 sec. Processed 4.56 billion rows, 500.52 GB (173.38 million rows/s., 19.05 GB/s.)

It will be much faster to read the view_count column first while keeping track of the data parts and row number in the data part where the records are located, ORDER BY and LIMIT the required number of records, then using the same table snapshot, read the remaining columns.

For example, this query:

SELECT view_count FROM youtube ORDER BY view_count DESC LIMIT 100

Only takes:

100 rows in set. Elapsed: 6.920 sec. Processed 4.56 billion rows, 41.01 GB (658.44 million rows/s., 5.93 GB/s.)

Implementation proposal

Introduce a new column type, ColumnLazy, that will store the reference to the table snapshot to use, the list of data parts, and its main content will be (part num, row num) pairs. Multiple ColumnLazy can share the same content (but represent different delayed columns).

ywqzzy pushed a commit to ywqzzy/tiflash_1 that referenced this issue Feb 13, 2023
ti-chi-bot pushed a commit that referenced this issue Mar 16, 2023
ti-chi-bot pushed a commit that referenced this issue Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment