-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Split datasources out from datafusion
crate (datafusion/core
)
#14444
Comments
FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun |
take |
Whoa, this is awesome! I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening. I'm definitely going to be keeping up and helping with this work. |
@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects. |
@logan-keede Thanks for the tip! |
(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)
Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out) BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged: Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc Now that that is in, we will be able to push this project along like a 🚀 |
Update here is that @logan-keede is cranking right along: After some discussion with @jayzhan211 I think we have a good plan going forward I feel like we are on the cusp of finally getting these things split from the core 🙏 |
This has been an update in plan, specifically addition of
Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment) Please refer to the above mentioned issue for more context. |
@logan-keede as part of moving avro functionality into |
I do not have much context on why it has been kept this way, but here is what I think It seems like you can currently make a new TL;DR: I think we can do that, but it seems like an API change so deprecation first. @alamb WDYT? |
I think putting the entire AvroExec behind a feature flag makes sense even if it is an API change. I don't think we have to go through deprecation first as the change will be "add the flag" |
I might not be able to contribute much till 7th March (due to institute commitments).
I would be happy to continue working on this as soon as I find some time, though that may only be after March 7. |
I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review. edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big. |
Thank you @AdamGS -- that is super helpful In general, fewer smaller PRs is far easier to review. Also PRs that are mostly mechanical are also easy / fast to review PRs that make changes that might have larger downstream implications are harder / take longer (as they require finding more focused time to review them) |
Ok I got a first draft that just moves |
#14838 turned out to be easier than I thought, mostly because @logan-keede did much of the work to move test infrastructure over. |
I'll keep going later this week, my current plan is to split the rest of the work into three PRs:
|
Is your feature request related to a problem or challenge?
Historically DataFusion was one (very) large crate
datafusion,
and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)As described by @waynexia the build time of DataFusion has been growing,
Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the
datafusion
/ core crate in https://github.com/apache/datafusion/tree/main/datafusion/coreWhile we are pursuing additional ways to reduce compile time, I think we should also move more code out of
datafusion/core
into their own crates.We have successfully done this in the past with other projects such as
Describe the solution you'd like
I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core
Describe alternatives you've considered
I think we will end up with several new crates
datafusion-catalog-listing
:ListingTable
and associated types likePartitionedFile
datafusion-datasource-parquet
:ParquetExec
and file firmatdatafusion-datasource-avro
AvroExec
and file formatsdatafusion-datasource-arrow
datafusion-datasource-json
datafusion-datasource-csv
I think we could start by creating
datafusion-catalog-listing
and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow
perhaps)Additional context
No response
The text was updated successfully, but these errors were encountered: