Upgrade deepspeed version #17748

awaelchli · 2023-06-04T14:45:16Z

What does this PR do?

Bumps the deepspeed version to 0.9.3. The latest version introduced a change that breaks existing tests. As of 0.9.3, it is no longer possible to define submodules outside configure_sharded_model and have them automatically moved to the GPU. This is the error we get on master with 0.9.3:

=================================== FAILURES ===================================
_______________ test_deepspeed_multigpu_stage_3_resume_training ________________

args = (tensor([[0.3834, 0.2902, 0.3263],
        [0.3819, 0.2893, 0.3288],
        [0.3817, 0.2918, 0.3265],
        [0.3827... 0.3266],
        [0.3834, 0.2892, 0.3275]], device='cuda:0'), tensor([2, 0, 2, 0, 1, 2, 1, 2, 0, 0], device='cuda:0'))
kwargs = {}

    @functools.wraps(update)
    def wrapped_func(*args: Any, **kwargs: Any) -> None:
        self._computed = None
        self._update_count += 1
        with torch.set_grad_enabled(self._enable_grad):
            try:
>               update(*args, **kwargs)

/usr/local/lib/python3.10/dist-packages/torchmetrics/metric.py:390: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MulticlassAccuracy()
preds = tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]], device='cuda:0')
target = tensor([[2],
        [0],
        [2],
        [0],
        [1],
        [2],
        [1],
        [2],
        [0],
        [0]], device='cuda:0')

    def update(self, preds: Tensor, target: Tensor) -> None:  # type: ignore
        """Update state with predictions and targets."""
        if self.validate_args:
            _multiclass_stat_scores_tensor_validation(
                preds, target, self.num_classes, self.multidim_average, self.ignore_index
            )
        preds, target = _multiclass_stat_scores_format(preds, target, self.top_k)
        tp, fp, tn, fn = _multiclass_stat_scores_update(
            preds, target, self.num_classes, self.top_k, self.average, self.multidim_average, self.ignore_index
        )
>       self._update_state(tp, fp, tn, fn)

/usr/local/lib/python3.10/dist-packages/torchmetrics/classification/stat_scores.py:322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = MulticlassAccuracy(), tp = tensor(4, device='cuda:0')
fp = tensor(6, device='cuda:0'), tn = tensor(14, device='cuda:0')
fn = tensor(6, device='cuda:0')

    def _update_state(self, tp: Tensor, fp: Tensor, tn: Tensor, fn: Tensor) -> None:
        """Update states depending on multidim_average argument."""
        if self.multidim_average == "samplewise":
            self.tp.append(tp)
            self.fp.append(fp)
            self.tn.append(tn)
            self.fn.append(fn)
        else:
>           self.tp += tp
E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

/usr/local/lib/python3.10/dist-packages/torchmetrics/classification/stat_scores.py:70: RuntimeError


...

E                   RuntimeError: Encountered different devices in metric calculation (see stacktrace for details). This could be due to the metric class not being on the same device as input. Instead of `metric=MulticlassAccuracy(...)` try to do `metric=MulticlassAccuracy(...).to(device)` where device corresponds to the device of the input.

The change was introduced in this PR: microsoft/DeepSpeed#3611. In this PR, I'm making our tests pass with 0.9.3.

cc @carmocca @Borda @awaelchli @justusschock

github-actions · 2023-06-04T21:59:29Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
pl-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 1.13)	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt, tests/tests_pytorch/strategies/test_deepspeed_strategy.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to requirements/pytorch/strategies.txt, tests/tests_pytorch/strategies/test_deepspeed_strategy.py, requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.

🟢 fabric: Docs

Check ID	Status
make-doctest (fabric)	success	✅
make-html (fabric)	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to requirements/pytorch/strategies.txt.

🟢 pytorch_lightning: Docker

Check ID	Status
build-cuda (3.9, 1.11, 11.3.1)	success	✅
build-cuda (3.9, 1.12, 11.6.1)	success	✅
build-cuda (3.9, 1.13, 11.7.1)	success	✅
build-cuda (3.10, 2.0, 11.7.1)	success	✅
build-pl (3.9, 1.11, 11.3.1)	success	✅
build-pl (3.9, 1.12, 11.6.1)	success	✅
build-pl (3.9, 1.13, 11.7.1)	success	✅
build-pl (3.10, 2.0, 11.7.1)	success	✅

These checks are required after the changes to requirements/pytorch/strategies.txt, requirements/fabric/strategies.txt.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11)	success	✅
fabric-cpu (macOS-11, lightning, 3.9, 1.12)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 1.13)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11)	success	✅
fabric-cpu (windows-2022, lightning, 3.9, 1.12)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 1.13)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 1.13)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 1.13)	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs)	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to requirements/fabric/strategies.txt, requirements/pytorch/strategies.txt.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

for more information, see https://pre-commit.ci

…ng into ci/update-deepspeed

.azure/gpu-tests-fabric.yml

.azure/gpu-tests-pytorch.yml

requirements/fabric/strategies.txt

upgrade deepspeed dependency

a3e017b

awaelchli requested review from Borda, carmocca, ethanwharris, lantiga and tchaton as code owners June 4, 2023 14:45

awaelchli added ci Continuous Integration strategy: deepspeed fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Jun 4, 2023

awaelchli added this to the 2.1 milestone Jun 4, 2023

awaelchli changed the title ~~Upgrade deepspeed version~~ WIP Upgrade deepspeed version Jun 4, 2023

awaelchli added 2 commits June 4, 2023 11:18

wiip

48ae45d

update test

043a2b1

awaelchli requested a review from justusschock as a code owner June 4, 2023 21:58

pre-commit-ci bot and others added 6 commits June 4, 2023 21:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

20c9a2c

for more information, see https://pre-commit.ci

add comment

9d23cae

reset

f6d9f45

Merge branch 'ci/update-deepspeed' of github.com:Lightning-AI/lightni…

457b635

…ng into ci/update-deepspeed

reset

751eb2c

reset

a55e423

awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Jun 4, 2023

awaelchli changed the title ~~WIP Upgrade deepspeed version~~ Upgrade deepspeed version Jun 4, 2023

awaelchli commented Jun 4, 2023

View reviewed changes

.azure/gpu-tests-fabric.yml Outdated Show resolved Hide resolved

Borda approved these changes Jun 5, 2023

View reviewed changes

.azure/gpu-tests-fabric.yml Outdated Show resolved Hide resolved

.azure/gpu-tests-pytorch.yml Outdated Show resolved Hide resolved

requirements/fabric/strategies.txt Show resolved Hide resolved

Apply suggestions from code review

9dbee08

github-actions bot removed the ci Continuous Integration label Jun 5, 2023

justusschock approved these changes Jun 5, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jun 5, 2023

Merge branch 'master' into ci/update-deepspeed

36b854d

awaelchli enabled auto-merge (squash) June 5, 2023 10:06

awaelchli merged commit 0eb8fdc into master Jun 5, 2023

awaelchli deleted the ci/update-deepspeed branch June 5, 2023 10:28

awaelchli mentioned this pull request Jun 11, 2023

DeepSpeed doesn't move tensors to GPU in deepspeed 0.9.3 and above #17806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade deepspeed version #17748

Upgrade deepspeed version #17748

awaelchli commented Jun 4, 2023 •

edited

Loading

github-actions bot commented Jun 4, 2023 •

edited

Loading

Upgrade deepspeed version #17748

Upgrade deepspeed version #17748

Conversation

awaelchli commented Jun 4, 2023 • edited Loading

What does this PR do?

github-actions bot commented Jun 4, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

awaelchli commented Jun 4, 2023 •

edited

Loading

github-actions bot commented Jun 4, 2023 •

edited

Loading