Implement streaming versions of Dataframe.collect methods #789

andygrove · 2021-07-27T23:57:32Z

Which issue does this PR close?

Closes #47.

Rationale for this change

In addition to the current collect* methods that load results into memory in a Vec<RecordBatch> this PR adds alternate execute_stream* methods that return streams instead so that results don't have to be loaded into memory before being processed.

What changes are included in this PR?

New execute_stream and execute_stream_partitioned methods on DataFrame.

Are there any user-facing changes?

Yes, new DataFrame methods.

Co-authored-by: Andrew Lamb <[email protected]>

alamb

I think the idea looks very nice 👍

alamb · 2021-07-28T12:20:16Z

datafusion/src/dataframe.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    async fn collect_stream(&self) -> Result<SendableRecordBatchStream>;


What if we called this something like execute rather than collect_stream?

async fn execute_stream(&self) -> Result<SendableRecordBatchStream>;

This would mirror the naming of ExecutionPlan::execute and might make it clearer that collect means collect into a Vec and execute means get a stream

Good idea. I renamed these to execute_stream and execute_stream_partitioned

alamb · 2021-07-28T12:21:39Z

datafusion/src/execution/dataframe_impl.rs

+
+    /// Convert the logical plan represented by this DataFrame into a physical plan and
+    /// execute it, collecting all resulting batches into memory while maintaining
+    /// partitioning
    async fn collect_partitioned(&self) -> Result<Vec<Vec<RecordBatch>>> {
        let state = self.ctx_state.lock().unwrap().clone();


You could probably rewrite collect_partitioned to be in terms of collect_stream_partitioned:

collect(self.collect_stream_partitioned().await?)

or something like that

I've cleaned the code up and removed a fair bit of duplication now.

alamb

Looks like a really nice change to me

…tream

* feat: Optimze CreateNamedStruct preserve dictionaries Instead of serializing the return data_type we just serialize the field names. The original implmentation was done as it lead to slightly simpler implementation, but it clear from apache#750 that this was the wrong choice and leads to issues with the physical data_type. * Support dictionary data_types in StructVector and MapVector * Add length checks

Andy Grove and others added 16 commits July 25, 2021 10:33

Draft Ballista README and examples

e2d0730

Add SQL example

81a0d89

improve docs

2fd3e83

fix port in exampls

6b8d751

Add ASF header to ballista-exampls README

0365451

Update ballista/README.md

c62b8dc

Co-authored-by: Andrew Lamb <[email protected]>

Refactor distributed query execution into new execution plan

bd9114b

Refactor for integration between DataFusion and Ballista

ae2fa6a

Change examples to binaries

f66a881

merge

b577c54

Merge branch 'ballista-examples-docs' into ballista-collect

abe66b0

DataFrame.collect now works with Ballista DataFrames

79de6d5

Merge remote-tracking branch 'apache/master' into ballista-collect

4655ed9

rename function

619c9bf

Docs

2300dcb

Rough out streaming versions of DataFrame.collect

b4a7324

github-actions bot added ballista datafusion Changes in the datafusion crate labels Jul 27, 2021

andygrove self-assigned this Jul 27, 2021

andygrove mentioned this pull request Jul 27, 2021

Change return type of 'DataFrame.collect()' #442

Closed

alamb reviewed Jul 28, 2021

View reviewed changes

andygrove added 5 commits July 28, 2021 07:31

Merge branch 'master' into dataframe-collect-stream

cf4e56b

Refactor code

340f434

Rename methods

63b1716

Fix regression

ccd647e

Specify schema in EmptyRecordBatchStream

b8f253a

andygrove changed the title ~~WIP: Implement streaming versions of Dataframe.collect methods~~ Implement streaming versions of Dataframe.collect methods Jul 28, 2021

format

1b3bd82

alamb approved these changes Jul 28, 2021

View reviewed changes

Merge remote-tracking branch 'apache/master' into dataframe-collect-s…

a04dbcf

…tream

andygrove merged commit d637871 into apache:master Jul 30, 2021

andygrove deleted the dataframe-collect-stream branch July 30, 2021 17:30

houqp added the enhancement New feature or request label Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement streaming versions of Dataframe.collect methods #789

Implement streaming versions of Dataframe.collect methods #789

andygrove commented Jul 27, 2021 •

edited

Loading

alamb left a comment

alamb Jul 28, 2021

andygrove Jul 28, 2021

alamb Jul 28, 2021

andygrove Jul 28, 2021

alamb left a comment

Implement streaming versions of Dataframe.collect methods #789

Implement streaming versions of Dataframe.collect methods #789

Conversation

andygrove commented Jul 27, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 28, 2021

Choose a reason for hiding this comment

andygrove Jul 28, 2021

Choose a reason for hiding this comment

alamb Jul 28, 2021

Choose a reason for hiding this comment

andygrove Jul 28, 2021

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

andygrove commented Jul 27, 2021 •

edited

Loading