-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce (default) number of partitions option, use it in DataFusion/Ballista #683
Conversation
FYI @andygrove |
ballista/rust/core/src/utils.rs
Outdated
let config = ExecutionConfig::new().with_concurrency(2); // TODO: this is hack to enable partitioned joins | ||
let config = ExecutionConfig::new() | ||
.with_concurrency(1) | ||
.with_partitions(4); // TODO: make it easier to configure from Ballista |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed https://github.com/apache/arrow-datafusion/issues/682 for this
datafusion/src/execution/context.rs
Outdated
@@ -681,6 +684,14 @@ impl ExecutionConfig { | |||
self | |||
} | |||
|
|||
/// Customize default number of partitions being used in repartioning | |||
pub fn with_partitions(mut self, n: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps with_default_partitions
would be better?
datafusion/src/execution/context.rs
Outdated
@@ -639,6 +641,7 @@ impl Default for ExecutionConfig { | |||
fn default() -> Self { | |||
Self { | |||
concurrency: num_cpus::get(), | |||
partitions: num_cpus::get(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility what would you think about defaulting partitions
to None
and if it was not set using the value for concurrency
instead? Otherwise users may have to tune two separate knobs when they may just want a big hammer "how many cores should data fusion try and keep busy"
I spent some more time reviewing the codebase this morning and it appears that we no longer use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comment I added. I don't think we want two configs for concurrency vs min partitions.
Please take a look at my proposal in #706 |
Closing as this seems to be superceded by #706 -- please reopen if that is not correct |
Which issue does this PR close?
Closes apache/datafusion-ballista#20
Rationale for this change
Currently we have
concurrency=partitions
which makes sense for good use of parallelism when you can load results of a single partition in memory, but is not what you want in a distributed / big data setting.What changes are included in this PR?
This PR adds a new option to DataFusion
partitions
andwith_partitions
that is used for the default number of partitions in aHash
orRoundRobin
repartitioning, or a shuffle in Ballista.I experimented a bit with setting the number of partitions in Ballista a bit higher, but it seems the writing / reading of small partition has quite a bit of overhead and maybe picking up the tasks in the executors too (?), for a small dataset this is very slow.
I used a couple of executors too.
Are there any user-facing changes?
Added members to
ExecutionConfig
.As the
ExecutionConfig
is public and a fieldpartitions
has been added it is technically anapi change