Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade deepspeed version #17748

Merged
merged 11 commits into from
Jun 5, 2023
Merged

Upgrade deepspeed version #17748

merged 11 commits into from
Jun 5, 2023

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 4, 2023

What does this PR do?

Bumps the deepspeed version to 0.9.3. The latest version introduced a change that breaks existing tests. As of 0.9.3, it is no longer possible to define submodules outside configure_sharded_model and have them automatically moved to the GPU. This is the error we get on master with 0.9.3:

=================================== FAILURES ===================================
_______________ test_deepspeed_multigpu_stage_3_resume_training ________________

args = (tensor([[0.3834, 0.2902, 0.3263],
        [0.3819, 0.2893, 0.3288],
        [0.3817, 0.2918, 0.3265],
        [0.3827... 0.3266],
        [0.3834, 0.2892, 0.3275]], device='cuda:0'), tensor([2, 0, 2, 0, 1, 2, 1, 2, 0, 0], device='cuda:0'))
kwargs = {}

    @functools.wraps(update)
    def wrapped_func(*args: Any, **kwargs: Any) -> None:
        self._computed = None
        self._update_count += 1
        with torch.set_grad_enabled(self._enable_grad):
            try:
>               update(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py:390: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MulticlassAccuracy()
preds = tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]], device='cuda:0')
target = tensor([[2],
        [0],
        [2],
        [0],
        [1],
        [2],
        [1],
        [2],
        [0],
        [0]], device='cuda:0')

    def update(self, preds: Tensor, target: Tensor) -> None:  # type: ignore
        """Update state with predictions and targets."""
        if self.validate_args:
            _multiclass_stat_scores_tensor_validation(
                preds, target, self.num_classes, self.multidim_average, self.ignore_index
            )
        preds, target = _multiclass_stat_scores_format(preds, target, self.top_k)
        tp, fp, tn, fn = _multiclass_stat_scores_update(
            preds, target, self.num_classes, self.top_k, self.average, self.multidim_average, self.ignore_index
        )
>       self._update_state(tp, fp, tn, fn)

/usr/local/lib/python3.10/dist-packages/torchmetrics/classification/stat_scores.py:322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MulticlassAccuracy(), tp = tensor(4, device='cuda:0')
fp = tensor(6, device='cuda:0'), tn = tensor(14, device='cuda:0')
fn = tensor(6, device='cuda:0')

    def _update_state(self, tp: Tensor, fp: Tensor, tn: Tensor, fn: Tensor) -> None:
        """Update states depending on multidim_average argument."""
        if self.multidim_average == "samplewise":
            self.tp.append(tp)
            self.fp.append(fp)
            self.tn.append(tn)
            self.fn.append(fn)
        else:
>           self.tp += tp
E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

/usr/local/lib/python3.10/dist-packages/torchmetrics/classification/stat_scores.py:70: RuntimeError


...

E                   RuntimeError: Encountered different devices in metric calculation (see stacktrace for details). This could be due to the metric class not being on the same device as input. Instead of `metric=MulticlassAccuracy(...)` try to do `metric=MulticlassAccuracy(...).to(device)` where device corresponds to the device of the input.

The change was introduced in this PR: microsoft/DeepSpeed#3611. In this PR, I'm making our tests pass with 0.9.3.

cc @carmocca @Borda @awaelchli @justusschock

@awaelchli awaelchli added ci Continuous Integration strategy: deepspeed fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Jun 4, 2023
@awaelchli awaelchli added this to the 2.1 milestone Jun 4, 2023
@awaelchli awaelchli changed the title Upgrade deepspeed version WIP Upgrade deepspeed version Jun 4, 2023
@awaelchli awaelchli requested a review from justusschock as a code owner June 4, 2023 21:58
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt, tests/tests_pytorch/strategies/test_deepspeed_strategy.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) success

These checks are required after the changes to requirements/pytorch/strategies.txt, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.

🟢 fabric: Docs
Check ID Status
make-doctest (fabric) success
make-html (fabric) success

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to requirements/pytorch/strategies.txt.

🟢 pytorch_lightning: Docker
Check ID Status
build-cuda (3.9, 1.11, 11.3.1) success
build-cuda (3.9, 1.12, 11.6.1) success
build-cuda (3.9, 1.13, 11.7.1) success
build-cuda (3.10, 2.0, 11.7.1) success
build-pl (3.9, 1.11, 11.3.1) success
build-pl (3.9, 1.12, 11.6.1) success
build-pl (3.9, 1.13, 11.7.1) success
build-pl (3.10, 2.0, 11.7.1) success

These checks are required after the changes to requirements/pytorch/strategies.txt, requirements/fabric/strategies.txt.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Jun 4, 2023
@awaelchli awaelchli changed the title WIP Upgrade deepspeed version Upgrade deepspeed version Jun 4, 2023
.azure/gpu-tests-fabric.yml Outdated Show resolved Hide resolved
.azure/gpu-tests-fabric.yml Outdated Show resolved Hide resolved
.azure/gpu-tests-pytorch.yml Outdated Show resolved Hide resolved
requirements/fabric/strategies.txt Show resolved Hide resolved
@github-actions github-actions bot removed the ci Continuous Integration label Jun 5, 2023
@mergify mergify bot added the ready PRs ready to be merged label Jun 5, 2023
@awaelchli awaelchli enabled auto-merge (squash) June 5, 2023 10:06
@awaelchli awaelchli merged commit 0eb8fdc into master Jun 5, 2023
@awaelchli awaelchli deleted the ci/update-deepspeed branch June 5, 2023 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: deepspeed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants