Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Using None for mpu when PP > 1 #34

Merged
merged 2 commits into from
Feb 1, 2023

Conversation

zarzen
Copy link
Contributor

@zarzen zarzen commented Feb 1, 2023

Description

  • Check the pipeline parallel size before creating the mpu grid to avoid initialization error
  • Avoid creating duplicate communication groups when pipeline parallelism is used.

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @szhengac

@@ -22,7 +22,11 @@ def init_ds_engine(model, **kwargs):
raise ValueError("DeepSpeed config not provided.")
mpu = kwargs.get("topology", None)
if mpu is not None and isinstance(mpu, PipeModelDataParallelTopology):
mpu = PipelineParallelGrid(topology=mpu)
if mpu.get_dim("pipe") <= 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could be 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, just for separating the conditions for pipeline and no pipeline in a binary form.

@szhengac szhengac merged commit dc795d6 into awslabs:main Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants