Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

Open
Arpit-Bandejiya opened this issue Feb 21, 2025 · 6 comments
Open

[Discussion] Efficient Row Selection for Multi-Engine Support #14816

Arpit-Bandejiya opened this issue Feb 21, 2025 · 6 comments

Comments

@Arpit-Bandejiya
Copy link

Arpit-Bandejiya commented Feb 21, 2025

Background

We have an usecase where data is stored in multiple engines/formats and Parquet is the primary format containing all the data. While text queries are handled by inverted index format, numeric data queries and aggregations are processed via Parquet files. While the file formats are different, the data is sorted and stored in the same order across them.

We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet.

Example:

Assume we have the following data stored in parquet file:
colA colB
200 Autumn leaves
200 Salty breeze
100 Misty mountains
100 Misty mountains
200 Velvet curtains

For example, assume have an query like SELECT colB where colA = 100

The matching documents can be represented in the form of bitset : 00110 (row number starts from left). We want to use the matching document information collected from any underlying engine to fetch the relevant documents in the parquet file using DataFusion.

What we explored

We explored that one of the ways to fetch specific rows in DataFusion is by creating an access plan and passing it to ParquetExec. Since we need the complete plan, we can't parallelize it and start collecting data from Parquet, which reduces the overall query performance and is also memory-inefficient as we need to iterate the complete stream and convert it to the AccessPlan.

Possible Solution

If there is a way to:
  1. Pass the iterator directly to DataFusion, or
  2. Process the matching rows in batches.
Then it will enable on-demand conversion from the matching rows iterator to RowSelection in DataFusion thus improving efficiency by reducing memory overhead.

Questions

  1. Are there existing mechanisms in DataFusion to handle external iterators or row sources?
  2. What are the best practices for integrating DataFusion with external data sources in a streaming or batched manner?
  3. Are there any plans or ongoing work in the DataFusion project that might address this use case?
  4. Any alternative approaches or design patterns that might help us achieve efficient row selection in our multi-engine implementation?

@Arpit-Bandejiya
Copy link
Author

@alamb @andygrove please provide your opinion on this usecase!

@alamb
Copy link
Contributor

alamb commented Feb 22, 2025

Are there existing mechanisms in DataFusion to handle external iterators or row sources?

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)

What are the best practices for integrating DataFusion with external data sources in a streaming or batched manner?

Are there any plans or ongoing work in the DataFusion project that might address this use case?

Any alternative approaches or design patterns that might help us achieve efficient row selection in our multi-engine implementation?

I think you should check out https://github.com/datafusion-contrib/datafusion-federation which has a variety of items that are used for building a federated query engine

@philippemnoel may also have ideas / suggestions for this

@alamb
Copy link
Contributor

alamb commented Feb 22, 2025

We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet.

I think there are two parts to your question:

  1. Representing the results as a bitset: I think you would have to imlement a custom "pivot" type operation that took row ids somehow and created a bitset from them

  2. Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

@Arpit-Bandejiya
Copy link
Author

Thanks for the response @alamb ! Couple of follow up questions:

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)
#14057

Is there any way to get the row_id data for Parquet? Any suggestion to build it? @alamb @chenkovsky

Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

Will be happy to collaborate on it. @XiangpengHao any initial plan or POC you have done for it?

@chenkovsky
Copy link

Thanks for the response @alamb ! Couple of follow up questions:

There is a PR we are currently working on related to metadata columns (which could provide row ids perhaps)
#14057

Is there any way to get the row_id data for Parquet? Any suggestion to build it? @alamb @chenkovsky

Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values (RowSelection). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader

Will be happy to collaborate on it. @XiangpengHao any initial plan or POC you have done for it?

@Arpit-Bandejiya

I created an example for getting row_id for parquet based on PR #14057. https://github.com/chenkovsky/datafusion/pull/3/files

@bharath-techie
Copy link

Hi @chenkovsky ,
Thanks a ton for quick POC on this. :)

The row ids seems to be specific to each batch and not across the entire parquet file - is my understanding correct ?

Reason is our use case will mainly benefit from parquet file level row ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants