Improve parallelism of repartition operator with multiple cores #6310

alamb · 2023-05-09T15:16:56Z

Which issue does this PR close?

Rationale for this change

I was testing query performance for #6278 and noticed that only. a single core was being used on a query entirely in memory. When I spent some time looking into it, the plan looked correct with repartitioning but for some reason it wasn't properly repartitoning

What changes are included in this PR?

Don't yield on every batch -- yield only after we have made some decent progress (in this case at least partition_count batches)

Are these changes tested?

I manually tested this -- I will add a benchmark for it shortly

My manual test results are:

On main (Keeps only 1 core busy for most of the time)

4 rows in set. Query took 10.631 seconds.

With this PR (keeps the cores much busier)

4 rows in set. Query took 3.705 seconds.

Are there any user-facing changes?

alamb · 2023-05-09T15:17:24Z

cc @crepererum

alamb · 2023-05-09T15:23:54Z

datafusion/core/src/physical_plan/repartition/mod.rs

-            // If the input stream is endless, we may spin forever and never yield back to tokio. Hence let us yield.
-            // See https://github.com/apache/arrow-datafusion/issues/5278.
-            tokio::task::yield_now().await;
+            // If the input stream is endless, we may spin forever and


I would love some thoughts from reviewers about better heuristics here -- as the comments say I am happy with this heuristic for round robin partitioning but there may be a better way when hash partitioning (like ensure that all channels have at least one batch 🤔 )

You could use https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html but I'm honestly a little confused by this. "we may spin forever" would imply an issue with unbounded receivers, not an issue with the repartition operator?

TLDR I'd vote to not yield at all, I don't agree that this fixes #5278 rather just papers over it with a dubious fix

I think if the tokio executor has only a single thread and the input stream can provide data infinitely, without a yield it will buffer the entire input which seems non ideal

I agree #5278 as described seems somewhat more like "when we used blocking IO with a single tokio thread it blocked everything" -- as described on #5278 (comment)

it will buffer the entire input which seems non ideal

That seems like a bug in whatever is using unbounded buffers, which I thought we had removed the last of? Basically we shouldn't be relying on yield_now to return control, but the buffer filling up

You can call it a bug or a design issue of DF / tokio. But if you run two spawned tasks and one never returns to Tokio then the other will never run. Unbounded buffers are NOT avoidable in the current DF design, because you cannot predict tokio scheduling and hash outputs. So the fix here is adequate. consume_budget would be the better solution but it's an unstable tokio feature, so that's not usable.

alamb · 2023-05-10T00:33:33Z

FYI @metesynnada I don't know if you have any thoughts on this approach (or the yield in general, as you reported #5278 initially)

ozankabak · 2023-05-11T00:07:23Z

Hi @alamb, @metesynnada is unavailable for this month, he will be back at the end of the month. I checked this PR out and it looks better than the status quo to me.

alamb · 2023-05-11T12:11:34Z

I plan to leave this open for another day or two for comments and then will merge in

alamb · 2023-05-12T14:50:55Z

I merged up from main -- and once CI passes I plan to merge this PR

crepererum · 2023-05-12T15:00:39Z

Side note: when I've added the yield statement I was wondering if this would be too much overhead, but my assumption was the the batches would hopefully be big enough so that wouldn't matter. Seems that I was wrong 😅

alamb · 2023-05-12T15:44:56Z

Side note: when I've added the yield statement I was wondering if this would be too much overhead, but my assumption was the the batches would hopefully be big enough so that wouldn't matter. Seems that I was wrong 😅

I think the fact that the source in this case is a MemoryExec (where all the data is already in memory and can be provided almost instantaneously hurts us)

Improve parallelism of repartition operator

3aeca09

github-actions bot added the core Core DataFusion crate label May 9, 2023

alamb mentioned this pull request May 9, 2023

Datafusion not using all cores on a TPCH like query during query with repartiton #6290

Closed

alamb changed the title ~~Improve parallelism of repartition operator~~ Improve parallelism of repartition operator with multiple cores May 9, 2023

alamb commented May 9, 2023

View reviewed changes

crepererum approved these changes May 9, 2023

View reviewed changes

Merge remote-tracking branch 'apache/main' into alamb/more_parallelism

ee9ec09

alamb merged commit d0aadd6 into apache:main May 12, 2023

alamb deleted the alamb/more_parallelism branch May 12, 2023 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parallelism of repartition operator with multiple cores #6310

Improve parallelism of repartition operator with multiple cores #6310

alamb commented May 9, 2023

alamb commented May 9, 2023

alamb May 9, 2023

tustvold May 9, 2023

alamb May 9, 2023

tustvold May 9, 2023 •

edited

Loading

crepererum May 9, 2023

alamb commented May 10, 2023

ozankabak commented May 11, 2023

alamb commented May 11, 2023

alamb commented May 12, 2023

crepererum commented May 12, 2023

alamb commented May 12, 2023

Improve parallelism of repartition operator with multiple cores #6310

Improve parallelism of repartition operator with multiple cores #6310

Conversation

alamb commented May 9, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented May 9, 2023

alamb May 9, 2023

Choose a reason for hiding this comment

tustvold May 9, 2023

Choose a reason for hiding this comment

alamb May 9, 2023

Choose a reason for hiding this comment

tustvold May 9, 2023 • edited Loading

Choose a reason for hiding this comment

crepererum May 9, 2023

Choose a reason for hiding this comment

alamb commented May 10, 2023

ozankabak commented May 11, 2023

alamb commented May 11, 2023

alamb commented May 12, 2023

crepererum commented May 12, 2023

alamb commented May 12, 2023

tustvold May 9, 2023 •

edited

Loading