Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista: Finish implementing shuffle mechanism [DRAFT] #709

Closed
wants to merge 9 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 11, 2021

Which issue does this PR close?

Closes #707 but is actually much larger than one issue because it turns out that the shuffle mechanism wasn't fully implemented and wasn't really being used.

With this PR, I now see the executors using hash partitioning in the shuffle writes.

=== Physical plan with metrics ===
ShuffleWriterExec: Some(Hash([Column { name: "l_orderkey", index: 0 }], 2)), metrics=[writeTime=21538883]
  CoalesceBatchesExec: target_batch_size=4096, metrics=[]

Other changes:

  • Fixed bug where all shuffle tasks within the same query stage were writing to the same output files, causing corruption
  • TaskStatus now includes task meta-data so that that the scheduler can pass the correct partition meta-data to shuffle readers (and skip empty partitions)

Remaining work:

  • Get it all working

Rationale for this change

Shuffles were broken. The executor always ran the shuffle writes with partioning of None instead of the hash partitioning they were supposed to use. This information was never sent as part of the protobuf. Queries still worked correctly but this was not scaling since there was always a single partition.

What changes are included in this PR?

Are there any user-facing changes?

No

@andygrove andygrove self-assigned this Jul 11, 2021
@andygrove
Copy link
Member Author

@edrevo fyi

@andygrove andygrove changed the title Ballista: Fix shuffle writes Ballista: Fix shuffle writes [DRAFT] Jul 11, 2021
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Jul 15, 2021
@andygrove andygrove changed the title Ballista: Fix shuffle writes [DRAFT] Ballista: Finish implementing shuffle mechanism [DRAFT] Jul 15, 2021
@andygrove andygrove closed this Jul 17, 2021
@andygrove andygrove deleted the fix-shuffle-write branch February 6, 2022 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ballista: Finish implementing shuffle mechanism
1 participant