Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate produces batch of data #31155

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

antiguru
Copy link
Member

Instead of flattening the outputs of consolidate, provide the output as a
whole batch, encoded as an Rc<Vec<C>>. This allows downstream operators
to act on a whole batch at once, without relying on undocumented properties
of how Timely channels behave. For operators that need to flatten the
output need slightly different logic.

An alternative approach would be to wrap the chain in a Box, but at this
time, Timely doesn't recognize Box as a valid container type.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@antiguru antiguru requested a review from a team as a code owner January 23, 2025 08:15
@antiguru antiguru requested review from petrosagg and teskje January 23, 2025 08:16
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The looks reasonable to me, but I'll leave the approval to someone who knows the copy-to implementation.

@@ -777,7 +794,7 @@ pub fn consolidate_pact<Ba, P, G>(
stream: &StreamCore<G, Ba::Input>,
pact: P,
name: &str,
) -> StreamCore<G, Ba::Output>
) -> StreamCore<G, Rc<Vec<Ba::Output>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the Rc isn't clear to me. Is it a safe-guard to avoid accidental expensive cloning of Vecs when the output stream is later forked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is so that a chain of updates is inseparable, no matter what channel it traverses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern being that Timely might otherwise try to pull the Vec apart while passing it through channels? Would be great to have a comment explaining this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the PR such that consolidate_pact produces a Stream<G, Vec<Ba::Output>>, which means that the elements we're transferring are Vec<_>. This gives us the certainty that Timely will not look into the chains.

Instead of flattening the outputs of consolidate, provide the output as a
whole batch, encoded as an `Vec<C>`. This allows downstream operators
to act on a whole batch at once, without relying on undocumented properties
of how Timely channels behave. For operators that need to flatten the
output need slightly different logic.

The copy-to-s3 operator sees some changes, which are roughly:
* Accept the chain formed by `consolidate_pact`, and rely on it
  partitioning the data according to the exchange function.
* Skip encoding the batch, and recompute it when writing the data.

Signed-off-by: Moritz Hoffmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants