Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move DataSource to datafusion-datasource #14671

Merged
merged 19 commits into from
Feb 21, 2025

Conversation

logan-keede
Copy link
Contributor

@logan-keede logan-keede commented Feb 14, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

refactor of DataSource

Are these changes tested?

not completely, only cargo check work for now

Are there any user-facing changes?

yes, but only if they are building from source.

@logan-keede logan-keede marked this pull request as draft February 14, 2025 18:27
@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate substrait proto Related to proto crate labels Feb 14, 2025
@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

So exciting...

@logan-keede
Copy link
Contributor Author

logan-keede commented Feb 15, 2025

@alamb any Idea on what to do with unit tests in physical-plan? I cant think of anything except pulling functions that depend on the component that I moved, to core_integration tests or maybe to datasource crate. is there a better method?

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

@alamb any Idea on what to do with unit tests in physical-plan? I cant think of anything except pulling functions that depend on the component that I moved, to core_integration tests or maybe to datasource crate. is there a better method?

I looked at the errors, and most of them appear to be related to the tests in physical_plan using MemoryExec (well now DataSource with a MemoryExec).

Some other ideas:

  1. Add a dev-dependency between datafusion-physical-plananddatafusion-datasource`
  2. (better) Add a mock ExecutionPlan only for testing in physical_plan -- this would end up with some amount of duplicate code with DataSourceExec but I think that would be ok for testing.
  3. Move the tests, as you say, into tests in core_integration or something

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked

I can try and find time to sketch it out if needed

@logan-keede
Copy link
Contributor Author

BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked

Thanks for the suggestions. I will try to make MockExecutionPlan.

@logan-keede
Copy link
Contributor Author

@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending.

I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction.

Thanks,
Logan

@alamb
Copy link
Contributor

alamb commented Feb 16, 2025

@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending.

I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction.

Thanks, Logan

I think this sounds reasonable -- I was just saying that the split between MockMemorySourceConfig and MockDataSource is likely not necessary. There is a split in the current DataSourceConfig and DataSourceExec to permit sharing between difefrent data sources

@logan-keede
Copy link
Contributor Author

I think this sounds reasonable -- I was just saying that the split between MockMemorySourceConfig and MockDataSource is likely not necessary.

It does look that way, I will try to make it one Mock struct.

@logan-keede logan-keede changed the title WIP: move DataSource to datafusion-datasource refactor: move DataSource to datafusion-datasource Feb 18, 2025
@logan-keede logan-keede marked this pull request as ready for review February 18, 2025 19:11
@logan-keede
Copy link
Contributor Author

@alamb What do we want to do with benches? (they are causing circular dependency)
except that I think this is ready for review.

@alamb
Copy link
Contributor

alamb commented Feb 19, 2025

@alamb What do we want to do with benches? (they are causing circular dependency) except that I think this is ready for review.

I think we should move the benches into datafusion/core/benches

@alamb alamb mentioned this pull request Feb 19, 2025
@alamb
Copy link
Contributor

alamb commented Feb 19, 2025

Thank you very much for this PR @logan-keede -- I will try and review it later today (I am out of the office this week so I have only limited time to review PRs)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you SO much @logan-keede -- this is a really nice PR -- its structure is 💯 and the fact you got everything working is very impressive

I had a few small comments (unecessary dependency and a rename) but otherwise this PR is ready to go. I also think if necessary we could do those refactors as follow on PRs

Again thank you so much

FYI @mertak-synnada

@@ -28,10 +28,10 @@ use arrow::record_batch::RecordBatch;
use arrow::util::pretty::pretty_format_batches;
use datafusion::catalog::{Session, TableFunctionImpl};
use datafusion::common::{plan_err, Column};
use datafusion::datasource::memory::MemorySourceConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giphy

@@ -222,3 +222,7 @@ required-features = ["nested_expressions"]
[[bench]]
harness = false
name = "dataframe"

[[bench]]
harness = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/// Data source configuration for reading in-memory batches of data
#[derive(Clone)]
pub struct MemorySourceConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just confirming that the plan is that datasource will have MemorySourceConfig / MemorySource and that we will move the other format specific things (like parquet) to their own crates like datafusion-datasource-parquet ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, That's what I had in my mind so far. However, now that I think about it, I don't think format-specific crate(except parquet) will have more than 2(fileformat, Source+mod) files, If we are going to keep a separate crate for them, maybe we can keep a separate crate for MemorySourceConfig(1 file, Source + mod.rs ) too. What do you think?

use crate::memory::MemorySourceConfig;
use crate::source::DataSourceExec;
use crate::test::MockMemorySourceConfig;
// use crate::test::MockMemorySourceConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, at a few more places too. I think I have cleaned this out with latest commit.

use crate::repartition::RepartitionExec;
use crate::source::DataSourceExec;
use crate::test::MockMemorySourceConfig;
// use crate::test::MockMemorySourceConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// use crate::test::MockMemorySourceConfig;


pub mod exec;

#[derive(Clone, Debug)]
pub struct MockMemorySourceConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions:

  1. Document that this implements an in memory DataSource for testing purposes (and maybe we can add a link to the real memory source) and that it duplicates much of the MemorySourceConfig to avoid a dependency between physical-plan and datasource
  2. Rename it to something like TestMemoryExec as this is not a config but rather something that impl ExecutionPlan and is used as a source of data in the tests

Copy link
Contributor

@mertak-synnada mertak-synnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for all the hard work. I’ve added a few optional minor comments

Comment on lines 283 to 301
/// A memory table can be ordered by multiple expressions simultaneously.
/// [`EquivalenceProperties`] keeps track of expressions that describe the
/// global ordering of the schema. These columns are not necessarily same; e.g.
/// ```text
/// ┌-------┐
/// | a | b |
/// |---|---|
/// | 1 | 9 |
/// | 2 | 8 |
/// | 3 | 7 |
/// | 5 | 5 |
/// └---┴---┘
/// ```
/// where both `a ASC` and `b DESC` can describe the table ordering. With
/// [`EquivalenceProperties`], we can keep track of these equivalences
/// and treat `a ASC` and `b DESC` as the same ordering requirement.
///
/// Note that if there is an internal projection, that projection will be
/// also applied to the given `sort_information`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove this documentation and add a link to the original MemorySourceConfig

}

#[cfg(test)]
mod memory_exec_tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This looks forgotten. While moving the tests, I think we can rename the module as memory_source_tests since there's no MemoryExec

@logan-keede
Copy link
Contributor Author

@alamb @mertak-synnada TYSM for the review!! I think I have addressed your suggestions in the latest commit. Let me know if they look good to you (especially the documentation part) or if you have anything else.

@@ -49,8 +49,15 @@ use futures::{Future, FutureExt};

pub mod exec;

/// `TestMemoryExec` is a mock equivalent to [`MemorySourceConfig`] with [`ExecutionPlan`] implemented for testing.
/// i.e. It has some but not all the functionality of [`MemorySourceConfig`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Feb 21, 2025

Thanks @logan-keede and @mertak-synnada -- this looks great and thanks (again) for driving the plan forward

@alamb alamb merged commit b66beb8 into apache:main Feb 21, 2025
26 checks passed
ozankabak pushed a commit to synnada-ai/datafusion-upstream that referenced this pull request Feb 25, 2025
* build moves only tests+benches pending

* unstable

* some tests fixed

* Mock MemorySourceConfig and DataSource

* some cleanup

* test pass but Mock is not efficient

* temporary stable

* one struct, test pass, cleaning pending

* cleaning

* more cleaning

* clippy

* 🧹🧹*cleaning*🧹🧹

* adding re-export

* fix:cargo fmt

* fix: doctest

* fix: leftout doctest

* fix: circular dependency

* clean, rename, document, improve
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants