DeepSpeed Stage 2 Tensors on Different Devices #9521

kelvins64 · 2021-09-14T17:52:17Z

🐛 Bug

Attempting to run Trainer.fit with GPUs other than cuda:0 with the DeepSpeed Zero Stage 2 plugin results in RuntimeError: Expected all tensors to be on the same device, but found at least two devices.

To Reproduce

import os
from typing import Union

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
import argparse

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

# Start new code
def run(str_args: Union[str, None] = None):
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)

    args = parser.parse_args() if str_args is None else parser.parse_args(str_args.split())
# End new code

    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
# Start new code
    trainer = Trainer.from_argparse_args(
        args,
        plugins='deepspeed_stage_2',
# End new code
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)

if __name__ == "__main__":
    run('--gpus 1,') # New code

The error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking arugment for argument mat1 in method wrapper_addmm)

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.4.6
PyTorch Version (e.g., 1.8): 1.9.0
Python version: 3.9.6
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.3
GPU models and configuration: NVIDIA Tesla V100
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

tchaton · 2021-09-14T18:26:11Z

Hey @kelvins64,

Thanks for sharing a script, I confirm I can reproduce this bug on master.

Best,
T.C

SeanNaren · 2021-09-15T11:24:40Z

Looking into the DeepSpeed engine I noticed that there is an assumption regarding the local rank: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L596-L604

It seems the assumption is that the GPU rank is the same as the local rank of the machine (i.e if you had a 4 GPU machine, each process local rank of 0 to 4 matches the GPU rank). This wouldn't be the case if you specified certain GPU IDs as in this script.

A solution is to introduce a gpu_rank into DeepSpeed args that can be used to decide the device ID to set the device to, which can default to the LOCAL_RANK if not specified. I'll set up the PR now in DeepSpeed to see what the authors think of this solution. I've verified locally that this works!

SeanNaren · 2021-09-16T09:09:30Z

Associated DeepSpeed PR has been merged, once a release has been made we can include this fix into Lightning!

Hecim1984 · 2021-09-24T03:45:04Z

Good

gurvindersingh · 2021-10-06T15:15:39Z

@SeanNaren any update on this

SeanNaren · 2021-10-06T16:16:39Z

Still waiting on DeepSpeed to make a release, I'll ping them to see if we can get this done sooner! cc @jeffra

jeffra · 2021-10-06T16:29:00Z

@SeanNaren v0.5.4 is now released to pypi: https://pypi.org/project/deepspeed/0.5.4/ this should include the PR in question :)

SeanNaren · 2021-10-12T09:29:35Z

Thanks everyone! Should now be fixed on lightning master, and with the latest Deepspeed version (pip install deepspeed -U)

kelvins64 added bug Something isn't working help wanted Open to be worked on labels Sep 14, 2021

SeanNaren self-assigned this Sep 14, 2021

SeanNaren mentioned this issue Sep 15, 2021

Fix missing deepspeed distributed call #9540

Merged

12 tasks

SeanNaren mentioned this issue Sep 15, 2021

Introduce a device rank when setting device deepspeedai/DeepSpeed#1370

Merged

awaelchli mentioned this issue Sep 22, 2021

GPU memory issue when resuming from a checkpoint (fault tolerant enabled only) #9541

Closed

SeanNaren mentioned this issue Oct 6, 2021

DeepSpeed support for device IDs #9847

Merged

12 tasks

SeanNaren closed this as completed in #9847 Oct 12, 2021

kelvins64 mentioned this issue Jun 8, 2023

Tensors not on the same device when specifying GPU id and using deepspeed_stage_3(_offload) #15066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed Stage 2 Tensors on Different Devices #9521

DeepSpeed Stage 2 Tensors on Different Devices #9521

kelvins64 commented Sep 14, 2021 •

edited

Loading

tchaton commented Sep 14, 2021

SeanNaren commented Sep 15, 2021

SeanNaren commented Sep 16, 2021

Hecim1984 commented Sep 24, 2021

gurvindersingh commented Oct 6, 2021

SeanNaren commented Oct 6, 2021

jeffra commented Oct 6, 2021

SeanNaren commented Oct 12, 2021

DeepSpeed Stage 2 Tensors on Different Devices #9521

DeepSpeed Stage 2 Tensors on Different Devices #9521

Comments

kelvins64 commented Sep 14, 2021 • edited Loading

🐛 Bug

To Reproduce

Environment

Additional context

tchaton commented Sep 14, 2021

SeanNaren commented Sep 15, 2021

SeanNaren commented Sep 16, 2021

Hecim1984 commented Sep 24, 2021

gurvindersingh commented Oct 6, 2021

SeanNaren commented Oct 6, 2021

jeffra commented Oct 6, 2021

SeanNaren commented Oct 12, 2021

kelvins64 commented Sep 14, 2021 •

edited

Loading