-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Disco] Add loader for presharded params. #15957
[Disco] Add loader for presharded params. #15957
Conversation
This PR was developed in collaboration with @csullivan, and is based on #15676. |
e68fb67
to
9f019ca
Compare
Rebased onto main to re-run CI, as 2-week-old CI results are a bit stale for my preferences. @junrushao Could I get a review on this PR? |
Prior to this commit, sharding of model weights was always performed when initializing the model. This could cause slow initialization, especially for larger numbers of GPUs, as all model weights are initially transferred to GPU-0, before being scattered to all workers. This commit updates the `tvm::runtime::ShardLoaderObj` to also allow loading of pre-sharded model weights. With pre-sharded model weights, the tensors are sharded while the model is being built, and each worker independently loads the specific model weights that it requires.
9f019ca
to
8a62451
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that it works. @junrushao Any concerns in merging this?
Prior to this commit, sharding of model weights was always performed when initializing the model. This could cause slow initialization, especially for larger numbers of GPUs, as all model weights are initially transferred to GPU-0, before being scattered to all workers.
This commit updates the
tvm::runtime::ShardLoaderObj
to also allow loading of pre-sharded model weights. With pre-sharded model weights, the tensors are sharded while the model is being built, and each worker independently loads the specific model weights that it requires.