Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

Open
alamb opened this issue Feb 3, 2025 · 18 comments
Open

[Epic] Split datasources out from datafusion crate (datafusion/core) #14444

alamb opened this issue Feb 3, 2025 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

Is your feature request related to a problem or challenge?

Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)

As described by @waynexia the build time of DataFusion has been growing,

Some of this is due to the fact there is more code / more features to test. However a non trivial part of the long compile time is the time taken to compile the datafusion / core crate in https://github.com/apache/datafusion/tree/main/datafusion/core

Image

While we are pursuing additional ways to reduce compile time, I think we should also move more code out of datafusion/core into their own crates.

We have successfully done this in the past with other projects such as

Describe the solution you'd like

I would like to split out the https://github.com/apache/datafusion/tree/main/datafusion/core/src/datasource from DataFusion core

Describe alternatives you've considered

I think we will end up with several new crates

  • datafusion-catalog-listing: ListingTable and associated types like PartitionedFile
  • datafusion-datasource-parquet: ParquetExec and file firmat
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

I think we could start by creating datafusion-catalog-listing and trying to pull some of the listing table implementation into there and then trying to move one of the simpler datasources out (datafusion-datasource-arrow perhaps)

Additional context

No response

@alamb alamb added the enhancement New feature or request label Feb 3, 2025
@alamb
Copy link
Contributor Author

alamb commented Feb 3, 2025

FYI @logan-keede this is the idea I was mentioning to you regarding more refactoring fun

@logan-keede
Copy link
Contributor

take

@davisp
Copy link
Member

davisp commented Feb 5, 2025

Whoa, this is awesome!

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

I'm definitely going to be keeping up and helping with this work.

@davisp
Copy link
Member

davisp commented Feb 5, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@logan-keede
Copy link
Contributor

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

@davisp try this

rm -rf target 
cargo build --release --timings --lib --quiet

@davisp
Copy link
Member

davisp commented Feb 6, 2025

@logan-keede Thanks for the tip!

@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2025

@alamb Can you point me at whatever tool you used to generate those compile timing graphs? Those look like something I'd absolutely adopt in a bunch of projects.

(and to be clear, that was @waynexia who made that chart for #14256 I just copy pasted it here)

I'm still ramping up on learning DataFusion internals as I add my own extensions and one thing that's been nagging at me is that it almost feels like the built-in data source providers are "cheating" because they get to live in core. Moving them all to separate crates that have to use the same interfaces as external datasources is something I've been contemplating suggesting for awhile so its great to see this already happening.

Awesome -- indeed this is the case (and a similar thing used to happen with functions before we pulled them out)

BTW there is a major refactor for datasources done by @mertak-synnada @ozankabak and others that (just) merged:

Among other things it makes it easier to re-use common datasource features like pushdown, limit, etc

Now that that is in, we will be able to push this project along like a 🚀

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2025

Update here is that @logan-keede is cranking right along:

After some discussion with @jayzhan211 I think we have a good plan going forward

I feel like we are on the cusp of finally getting these things split from the core 🙏

@logan-keede
Copy link
Contributor

logan-keede commented Feb 12, 2025

This has been an update in plan, specifically addition of datafusion-datasource crate

  • datafusion-catalog-listing: ListingTable and original datasource::listing
  • datafusion-datasource: NEW that holds PartitonedFile,and DataSource and other File related components
  • datafusion-datasource-parquet: ParquetExec and file formats
  • datafusion-datasource-avro AvroExec and file formats
  • datafusion-datasource-arrow
  • datafusion-datasource-json
  • datafusion-datasource-csv

Originally posted by @alamb and suggested by @jayzhan211 in #14616 (comment)

Please refer to the above mentioned issue for more context.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 22, 2025

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

@logan-keede
Copy link
Contributor

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic).
I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

@alamb
Copy link
Contributor Author

alamb commented Feb 22, 2025

@logan-keede as part of moving avro functionality into datafusion-datasource-avro, WDYT about putting all of it behind a feature flag, the same way parquet is now? Seems like its the only format/source that is visible with default-features = false but panics at runtime (for example in AvroSource::create_file_opener)

I do not have much context on why it has been kept this way, but here is what I think

It seems like you can currently make a new AvroSource but not read an existing file. There might be some limited use case where user only needed to simulate AvroSource for some reason. (even that might crash as some other functionalities also panic). I think this use case is very limited and unintuitive and it is better to have it behind a feature flag(as you suggested), even if it is an API-Change.

TL;DR: I think we can do that, but it seems like an API change so deprecation first.

@alamb WDYT?

I think putting the entire AvroExec behind a feature flag makes sense even if it is an API change. I don't think we have to go through deprecation first as the change will be "add the flag"

@logan-keede
Copy link
Contributor

logan-keede commented Feb 23, 2025

I might not be able to contribute much till 7th March (due to institute commitments).
If anybody wants to continue on this issue, please feel free to do so. Here are possible next steps that I had in my mind:-

  1. Move remaining part of FileScanConfig and FileStream to datasource .
  2. Move FileSource to datasource.
  3. Move FileFormat to datasource.
  4. Move listing and file format specific implementation to their own crates.

I would be happy to continue working on this as soon as I find some time, though that may only be after March 7.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 23, 2025

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

@alamb
Copy link
Contributor Author

alamb commented Feb 23, 2025

I'll try and take a stab at it, @alamb do you have a preference as to how many PRs I should break it into? There are no logical changes but I expect a very large numbers of small changes and would I love to do it in a way that you and others will be happy to review.

edit: Started moving things around and seems like there a bunch of dependencies between them. I'll keep going until I get something that makes sense, hopefully the resulting PR won't be too big.

Thank you @AdamGS -- that is super helpful

In general, fewer smaller PRs is far easier to review. Also PRs that are mostly mechanical are also easy / fast to review

PRs that make changes that might have larger downstream implications are harder / take longer (as they require finding more focused time to review them)

@AdamGS
Copy link
Contributor

AdamGS commented Feb 23, 2025

Ok I got a first draft that just moves FileStream and FileScanConfig and everything that comes with them. There are still some small issues to solve, mostly around test functionality that they pull from the core crate but expect that to be solvable.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 24, 2025

#14838 turned out to be easier than I thought, mostly because @logan-keede did much of the work to move test infrastructure over.

@AdamGS
Copy link
Contributor

AdamGS commented Feb 25, 2025

I'll keep going later this week, my current plan is to split the rest of the work into three PRs:

  1. FileFormat - Move FileFormat and related pieces to datafusion-datasource #14873
  2. ListingTable and friends.
  3. Split out each individual format implementation into its own crate (csv, json, parquet, avro).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants