Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Delta Lake TableProvider #525

Closed
Dandandan opened this issue Jun 8, 2021 · 7 comments
Closed

Add Delta Lake TableProvider #525

Dandandan opened this issue Jun 8, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Jun 8, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Delta is used more and more as a storage format, as it has some nice features like ACID transactions, collection of table statistics and storage optimization.

Describe the solution you'd like
Use delta-rs to add support for reading delta datasets. The library already has a TableProvider (which might be used for inspiration) and some other features like bin packing.

Describe alternatives you've considered

Additional context
Mailing List Thread: https://lists.apache.org/thread.html/r334e90fb7c53930272f264b66aaf2911ba778e55ef4e41f6a938f514%40%3Cdev.arrow.apache.org%3E

@Dandandan Dandandan added the enhancement New feature or request label Jun 8, 2021
@Dandandan Dandandan changed the title Add delta tableprovider Add delta lake tableprovider Jun 8, 2021
@Dandandan Dandandan changed the title Add delta lake tableprovider Add Delta Lake TableProvider Jun 8, 2021
@Dandandan
Copy link
Contributor Author

Dandandan commented Jun 8, 2021

FYI @houqp what do you think of integrating this into DataFusion?

@jorgecarleitao
Copy link
Member

fwiw, imo this should be discussed over the mailing list.

@Dandandan
Copy link
Contributor Author

fwiw, imo this should be discussed over the mailing list.

I agree, if we have some positive reactions I will send something over the mailing list.

@houqp
Copy link
Member

houqp commented Jun 8, 2021

I am all for this. I think this is a good move, especially for ballista. I am happy to help maintain the deltalake support in datafusion going forward as well. If we go with this route, I would like to drop the table provider implementation in delta-rs so we can all focus on one official datafusion provider implementation in arrow-datafusion.

I am also planning to promote datafusion as the default query engine for executing native delta lake queries in delta-rs. This will make it easier for us to provide deltalake query access to other languages and runtimes.

@nevi-me
Copy link
Contributor

nevi-me commented Jun 9, 2021

I am also planning to promote datafusion as the default query engine for executing native delta lake queries in delta-rs. This will make it easier for us to provide deltalake query access to other languages and runtimes.

I like this approach, and I think there might be other approaches to adding IO support to datafusion.

How about separate crates implementing functionality through traits, then having a contrib section in the README listing them?

@jacobmarble
Copy link

Delta Sharing may be a pragmatic alternative. It appears to be nothing more than a small REST API for Parquet catalogs (the client fetches the data directly from S3, etc). The propaganda is that this is intended to be a data exchange protocol, so not tied directly to any particular product.

@houqp
Copy link
Member

houqp commented Dec 30, 2022

I think we can close this now since the table provider has already been implemented in https://github.com/delta-io/delta-rs/blob/8c67d78c8c67fdc9dd16c7e0d1fa9867ae7c1a5d/rust/src/delta_datafusion.rs#L322

@alamb alamb closed this as completed Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants