-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: move DataSource
to datafusion-datasource
#14671
Conversation
So exciting... |
@alamb any Idea on what to do with unit tests in |
I looked at the errors, and most of them appear to be related to the tests in physical_plan using Some other ideas:
|
BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked I can try and find time to sketch it out if needed |
Thanks for the suggestions. I will try to make |
@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending. I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction. Thanks, |
I think this sounds reasonable -- I was just saying that the split between |
It does look that way, I will try to make it one Mock struct. |
DataSource
to datafusion-datasource
DataSource
to datafusion-datasource
@alamb What do we want to do with benches? (they are causing circular dependency) |
I think we should move the benches into |
Thank you very much for this PR @logan-keede -- I will try and review it later today (I am out of the office this week so I have only limited time to review PRs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you SO much @logan-keede -- this is a really nice PR -- its structure is 💯 and the fact you got everything working is very impressive
I had a few small comments (unecessary dependency and a rename) but otherwise this PR is ready to go. I also think if necessary we could do those refactors as follow on PRs
Again thank you so much
FYI @mertak-synnada
@@ -28,10 +28,10 @@ use arrow::record_batch::RecordBatch; | |||
use arrow::util::pretty::pretty_format_batches; | |||
use datafusion::catalog::{Session, TableFunctionImpl}; | |||
use datafusion::common::{plan_err, Column}; | |||
use datafusion::datasource::memory::MemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -222,3 +222,7 @@ required-features = ["nested_expressions"] | |||
[[bench]] | |||
harness = false | |||
name = "dataframe" | |||
|
|||
[[bench]] | |||
harness = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/// Data source configuration for reading in-memory batches of data | ||
#[derive(Clone)] | ||
pub struct MemorySourceConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just confirming that the plan is that datasource will have MemorySourceConfig / MemorySource and that we will move the other format specific things (like parquet) to their own crates like datafusion-datasource-parquet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, That's what I had in my mind so far. However, now that I think about it, I don't think format-specific crate(except parquet) will have more than 2(fileformat, Source+mod) files, If we are going to keep a separate crate for them, maybe we can keep a separate crate for MemorySourceConfig
(1 file, Source + mod.rs ) too. What do you think?
use crate::memory::MemorySourceConfig; | ||
use crate::source::DataSourceExec; | ||
use crate::test::MockMemorySourceConfig; | ||
// use crate::test::MockMemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, at a few more places too. I think I have cleaned this out with latest commit.
use crate::repartition::RepartitionExec; | ||
use crate::source::DataSourceExec; | ||
use crate::test::MockMemorySourceConfig; | ||
// use crate::test::MockMemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// use crate::test::MockMemorySourceConfig; |
datafusion/physical-plan/src/test.rs
Outdated
|
||
pub mod exec; | ||
|
||
#[derive(Clone, Debug)] | ||
pub struct MockMemorySourceConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions:
- Document that this implements an in memory
DataSource
for testing purposes (and maybe we can add a link to the real memory source) and that it duplicates much of theMemorySourceConfig
to avoid a dependency between physical-plan and datasource - Rename it to something like
TestMemoryExec
as this is not a config but rather something thatimpl ExecutionPlan
and is used as a source of data in the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for all the hard work. I’ve added a few optional minor comments
datafusion/physical-plan/src/test.rs
Outdated
/// A memory table can be ordered by multiple expressions simultaneously. | ||
/// [`EquivalenceProperties`] keeps track of expressions that describe the | ||
/// global ordering of the schema. These columns are not necessarily same; e.g. | ||
/// ```text | ||
/// ┌-------┐ | ||
/// | a | b | | ||
/// |---|---| | ||
/// | 1 | 9 | | ||
/// | 2 | 8 | | ||
/// | 3 | 7 | | ||
/// | 5 | 5 | | ||
/// └---┴---┘ | ||
/// ``` | ||
/// where both `a ASC` and `b DESC` can describe the table ordering. With | ||
/// [`EquivalenceProperties`], we can keep track of these equivalences | ||
/// and treat `a ASC` and `b DESC` as the same ordering requirement. | ||
/// | ||
/// Note that if there is an internal projection, that projection will be | ||
/// also applied to the given `sort_information`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove this documentation and add a link to the original MemorySourceConfig
datafusion/datasource/src/memory.rs
Outdated
} | ||
|
||
#[cfg(test)] | ||
mod memory_exec_tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This looks forgotten. While moving the tests, I think we can rename the module as memory_source_tests
since there's no MemoryExec
@alamb @mertak-synnada TYSM for the review!! I think I have addressed your suggestions in the latest commit. Let me know if they look good to you (especially the documentation part) or if you have anything else. |
@@ -49,8 +49,15 @@ use futures::{Future, FutureExt}; | |||
|
|||
pub mod exec; | |||
|
|||
/// `TestMemoryExec` is a mock equivalent to [`MemorySourceConfig`] with [`ExecutionPlan`] implemented for testing. | |||
/// i.e. It has some but not all the functionality of [`MemorySourceConfig`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @logan-keede and @mertak-synnada -- this looks great and thanks (again) for driving the plan forward |
* build moves only tests+benches pending * unstable * some tests fixed * Mock MemorySourceConfig and DataSource * some cleanup * test pass but Mock is not efficient * temporary stable * one struct, test pass, cleaning pending * cleaning * more cleaning * clippy * 🧹🧹*cleaning*🧹🧹 * adding re-export * fix:cargo fmt * fix: doctest * fix: leftout doctest * fix: circular dependency * clean, rename, document, improve
Which issue does this PR close?
datafusion
crate (datafusion/core
) #14444.Rationale for this change
What changes are included in this PR?
refactor of
DataSource
Are these changes tested?
not completely, only
cargo check
work for nowAre there any user-facing changes?
yes, but only if they are building from source.