Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix uneven issue & add balance autotp #4697

Merged
merged 8 commits into from
Jan 13, 2024

Conversation

Yejing-Lai
Copy link
Contributor

This PR aims to balance the shard size of each worker as even as possible.

  1. We refactor the tp_shard logic that can make AutoTP work when split_shape % num_kv_heads != 0.
  2. When num_kv_heads is defined, the attention module relies on it to sharding, but the mlp and lm_head modules can use near even division to get more balance shard. It will get better performance.

@Yejing-Lai
Copy link
Contributor Author

Hi @RezaYazdaniAminabadi @delock. Could you please help review this PR? Thanks~

@delock
Copy link
Collaborator

delock commented Nov 17, 2023

Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements?
https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31

@Yejing-Lai
Copy link
Contributor Author

Hi @Yejing-Lai can you give some explaination on the need to have grainularity of 64 elements? https://github.com/microsoft/DeepSpeed/pull/4697/files#diff-214e32993d5440123080193836e988f024771aa4f6931c614ef9ad42a493f398R31

DNN library favors tensor size in granularity of power of 2, we pick 64 as a common granularity size.

@delock
Copy link
Collaborator

delock commented Nov 21, 2023

Hi @RezaYazdaniAminabadi , FYI This PR improves AutoTP sharding when number of heads cannot be divided by number of ranks. MLP layers will have better load balance when running AutoTP on 3 devices or 3 CPU sub-NUMA clusters.

@tjruwase
Copy link
Contributor

@Yejing-Lai, please help resolve conflict.

@Yejing-Lai
Copy link
Contributor Author

@Yejing-Lai, please help resolve conflict.

Hi @tjruwase. I resolved the conflict. Can you approve the workflows? Thanks~

@Yejing-Lai
Copy link
Contributor Author

Hi @tjruwase. The conflict had been resolved. Could you please help review this PR? Thanks~

@delock
Copy link
Collaborator

delock commented Jan 12, 2024

Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included in DeepSpeed next release. Thanks!

@tjruwase
Copy link
Contributor

Hi @tjruwase is this PR under review state or merge state? We are working on Intel Extension for PyTorch release and want to know whether this PR will be included in DeepSpeed next release. Thanks!

@delock, sorry for the delay. This should be reviewed soon and will be included in next release.

@awan-10 awan-10 added this pull request to the merge queue Jan 12, 2024
Merged via the queue into deepspeedai:master with commit 29417ab Jan 13, 2024
12 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
This PR aims to balance the shard size of each worker as even as
possible.
1. We refactor the tp_shard logic that can make AutoTP work when
split_shape % num_kv_heads != 0.
2. When num_kv_heads is defined, the attention module relies on it to
sharding, but the mlp and lm_head modules can use near even division to
get more balance shard. It will get better performance.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Lev Kurilenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants