-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for executing infinite files and boundedness-aware join reordering rule #4694
Conversation
…ada-ai/arrow-datafusion into feature/unbounded-execution
As we all agreed when discussing the streaming roadmap a few weeks ago, our initial focus was to identify the small-but-powerful infrastructural improvements and hooks we can introduce in Datafusion so that more complex streaming use cases can be supported either outside or inside Datafusion. IMO, this PR makes a huge step towards this goal. It enables Datafusion to process infinite files like FIFOs, to present an API for factoring in boundedness during planning and optimization, and it even gives Datafusion power to deduce whether it can run a given query with the given finite/infinite inputs. As @metesynnada mentions, the PR looks big, but the LOC comes mostly from tests. Other than those, changes are mostly localized to the file defining the Looking forward to your comments and feedback! |
Thank you for this PR -- I hope to find time to review it in the next day or two |
I am sorry for the delay in review. It is on my queue. I just haven't had the time yet |
No worries, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this code and tests carefully and I found it well written, well documented, and straightforward to follow 🏆 . Very impressive work.
Overall Feedback
Thank you for the wonderful PR description.
I left some minor comments but I don't think they are necessary to merge this PR. Two I think are worth addressing (with comments, not necessarily code) are
- The new
nix
dependency - Running
BasicEnforcement
twice
End to end tests
I also recommend creating an even higher level test in
datafusion/core/tests/fifo.rs
- Creates a FIFO file
- Plans a query with infinite input
- Starts running the query
- Feeds data into the FIFO
- Ensures the query produces results
to test this feature end-to-end
Documentation
It would be great to document the capability somewhere in the docs to help it become more discoverable. Somewhere in
https://github.com/apache/arrow-datafusion/tree/master/docs
I added a note to the "documentation epic" #3058
I recommend eventually creating an end to end (SQL) test with fifo into their own test file, perhaps something like
datafusion/core/tests/fifo.rs
Doing so would help its discoverability
_, | ||
JoinType::Right | JoinType::RightSemi | JoinType::RightAnti | JoinType::Full, | ||
) => Err(DataFusionError::Internal(format!( | ||
"{} join cannot be swapped.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"{} join cannot be swapped.", | |
"{} join cannot be swapped for unbounded input.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend making these messages more specific to aid grep
ing for them if they are hit in end to end tests
let physical_optimizer_subrules: Vec<Box<PipelineCheckerSubrule>> = | ||
vec![Box::new(hash_join_swap_subrule)]; | ||
let state = pipeline.transform_up(&|p| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need for a Vec of subrules. Does this means you intend to add more such rules in a follow on PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are planning to expand this type of subrules. Separating the execution plan changes (it will be more than one eventually 😀) into multiple PhysicalOptimizerRule
will add redundant plan traversals (and possibly a fixed point iteration), we plan to make the necessary changes for all possible executor changes while we are traversing.
cc @hntd187 |
Thanks for the detailed review. We are also excited about adding such functionality to Datafusion. We will work on documentation as well 😀. I'd like to ask about the "end-to-end" test, I provide the |
I think What I was imaginging was something that showed this feature working end to end -- with SQL and a FIFO file as input and record batches as output. |
@metesynnada please let me know when you are ready to merge this PR and I will do so. We can then continue to iterate on the API in subsequent PRs I think |
@alamb, feel free to merge after CI passes. The last commit I just sent only has some minor code simplifications and comment improvements. We agree to address the API stuff in a follow-on. Thank you! |
Thanks again @ozankabak |
Benchmark runs are scheduled for baseline = 760f108 and contender = cd4fd80. cd4fd80 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Note: Approximately 1100 lines of changes come from test and test utils.
Which issue does this PR close?
Makes progress on #4285.
Closes #4692
Rationale for this change
This PR adds support for FIFO files and relevant basic infrastructure to Datafusion so that others can build on top of Datafusion to develop more complex streaming systems.
Key features and benefits of this addition include:
What changes are included in this PR?
We will discuss the changes in multiple categories:
Data source changes
The current situation is that Datafusion provides support for working with regular
Within this PR, we worked on the first three file types so that unbounded sources involving such formats will be supported.
Changes in
CsvReadOptions
,AvroReadOptions
, andNdJsonReadOptions
For each file type, we added a new attribute to its options so that a user can "mark" their data source as infinite. This information is thereafter propagated appropriately:
ListingOptions
propagates the necessary information intoFileScanConfig
, this in turn affects each file format’s execution plans (CsvExec
,NdJsonExec
, andAvroExec
).We did not include
ParquetReadOptions
in this PR.ExecutionPlan changes
The following addition to the
ExecutionPlan
API is powerful enough to provide a foundation for custom streaming operators:With this change, each ExecutionPlan can now know whether its input is potentially infinite or not. For
FilterExec
, it issince
FilterExec
does not collect its whole input for calculating results. However,SortExec
does not override the defaultOk(false)
setting since it has to collect everything to generate any result at all. Thus, it is a pipeline breaker.Let’s define a simple
INNER JOIN
queryAs we know, a
HashJoinExec
collects the left side and streams the right side. Thus, the left side has to be a finite source.Given this new simple API, we now have the ability to write a rule that swaps join inputs to transform this query into a runnable query.
Physical optimization changes
Before this PR, we could not execute the query
SELECT t2.c1 FROM infinite as t1 JOIN finite as t2 ON t1.c1 = t2.c1
since the left side comes from an infinite data source. However, we could save this query and make it executable by simply swapping join sides. Now, we introduce a new physical optimizer rule namedPipelineChecker
that coordinates the “execution saver” subrules.Swapping join sides depending on statistical properties is not new, but we add an additional swap rule depending on the boundedness properties of the inputs (
hash_join_swap_subrule
). It basically transforms the above physical plan into this:Obviously, not all queries can be "saved". Therefore, we introduce the checker that leverages the
unbounded_output(children: &[bool])
API to output a useful error indicating exactly why the query can not run (i.e. it shows where the pipeline breaks). The rule simply applies the following logic via thetransform_up
API:As seen above, the checker retrieves boundedness properties of an operator's children and checks whether the operator supports the configuration in question.
Are these changes tested?
PipelineChecker
testshash_join_swap_subrule
testsCsvReadOptions
,NdJsonReadOptions
, andAvroReadOptions
creation testsCsvReadOptions
,NdJsonReadOptions
, andAvroReadOptions
to ListingTable conversion testsAre there any user-facing changes?
There is one new optional flag users can supply when reading files to mark them as infinite; e.g.
We also unified the APIs so that all now support schema injection from options, which was missing for JSON and AVRO formats. There is no breaking API change.