[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

shuyingsunshine21 · 2021-05-14T00:21:24Z

🐛 Bug

Please reproduce using the BoringModel

from typing import Dict, Any
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

from torch.utils.data import DataLoader
import logging
import os
import torch
import torch.nn as nn
import torch.optim as optim

import pytorch_lightning as pl

from torch.optim import AdamW

class ToyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

class ToyTask(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.loss_fn = nn.MSELoss()
    
    def setup(self, stage: str):
        if stage == "test":
            return
    
        self.model = ToyModel()
        self.optimizer = AdamW(self.model.parameters(), lr=0.001, betas=[0.9, 0.999], eps=1.0e-08, weight_decay=0, amsgrad=False)

    def forward(self, x):
        return self.model(x)
        
    def training_step(self, batch, batch_idx):
        targets = self.forward(batch["model_input"])
        loss = self.loss_fn(targets, batch["label"]) 

        # Log loss results per train step and per epoch
        self.log("loss", loss)

        # Tell Lightning to minimize loss
        return loss
            
    def configure_optimizers(self):
        return self.optimizer
    
    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")

setup training

task = ToyTask()

dataset = [
    {"model_input": torch.randn(20, 10), "label": torch.randn(20, 5)} for _ in range(10)
]

train_dataloader = DataLoader(dataset, batch_size=None)
val_dataloader = DataLoader(dataset, batch_size=None)

    
model_checkpoint = ModelCheckpoint(
   save_last=True,
   every_n_val_epochs=1,
)

trainer = pl.Trainer(
    gpus=1,
    precision=16,
    max_epochs=3,
    progress_bar_refresh_rate=100,
    log_gpu_memory=None,
    reload_dataloaders_every_epoch=True,
    limit_train_batches=10,
    limit_val_batches=10,
    limit_test_batches=10,
    callbacks=[model_checkpoint],
)

results = trainer.fit(task, train_dataloader)

resume from checkpoint

trainer = pl.Trainer(
    gpus=1,
    precision=16,
    max_epochs=4,
    reload_dataloaders_every_epoch=True,
    limit_train_batches=10,
    limit_val_batches=10,
    limit_test_batches=10,
    callbacks=[model_checkpoint],
    resume_from_checkpoint=model_checkpoint.last_model_path,
)
trainer.fit(task, train_dataloader) <--- this is where will fail

breaks at the first training step:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/precision/native_amp.py#L96

complains about

/mnt/xarfuse/uid-26337/a9b1f2c7-seed-nspid4026533638_cgpid18793775-ns-4026533627/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
    334             self.unscale_(optimizer)
    335 
--> 336         assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
    337 
    338         retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)

AssertionError: No inf checks were recorded for this optimizer.

Expected behavior

Expected to resume training.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

shuyingsunshine21 · 2021-05-14T20:27:03Z

Thanks @tchaton !

awaelchli · 2021-05-14T20:36:10Z

Hi
You are calling setup() in on_load_checkpiont(). This erases the initialized model parameters and states. The setup hook is meant to be called by lightning at a specific point (after distributed connection init). Can you explain your use case and why you need to call setup there manually?

I find that if I comment the manual setup call, your code example runs without error.

shuyingsunshine21 · 2021-05-14T21:13:12Z

Thanks @awaelchli for the response.

If I understand correctly, setup will be called here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L718

after distributed connection init is set. And loading states from resume checkpoint path will be afterwards in _pre_training_routine

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L835.

Inside _pre_training_routine, we call restore from ckpt, so though the module will be re-initialized again (i.e. call setup) for on_load_checkpoint, it should be fine, right?

For context of this use case, cc @hudeven.

hudeven · 2021-05-14T21:22:42Z

@awaelchli @tchaton thanks for looking into it! we have to init model in setup("fit") due to:

our model depends on transform output(e.g. reading training data and passing num_classes as out_dim of model), and the transform is not pickable. Putting them in LightningModule.__init__() will make LightningModule not pickable. Our internal system requires LightningModule to be pickle-able
some pre-train model weights are very large and need to be download from remote sources. So we want to defer the model loading, instead of doing it in __init__()

During loading from checkpoint, it requires model object to be created before loading weights. So we call setup("fit") to create model in on_load_checkpiont()

shuyingsunshine21 · 2021-05-14T21:28:20Z

@hudeven, I think instantiating model in setup is fine, what @awaelchli mentioned is this when loading states:

    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")

hudeven · 2021-05-14T21:31:08Z

sorry, just edited my comment. "During loading from checkpoint, it requires model object to be created before loading weights. So we call setup("fit") to create model in on_load_checkpiont()"

awaelchli · 2021-05-15T00:33:37Z

In our trainer when calling fit, the sequence is the following:

init ddp connection
setup() hook called -> you build your model here
model gets wrapped with DDP
precision plugin sets up model and optimizer (amp conversion)
restore_model() -> calls on_load_checkpoint -> You rebuild model once again and erase weights
weights get restored

If you rebuild your model in on_load_checkpoint, the precision setup is skipped/undone.

One solution from our side could be to move restore_weights() to directly after step 2. I'm hesitating however, because I'm not sure yet if this could have undesired side-effects? Any thoughts?

shuyingsunshine21 · 2021-05-15T03:35:33Z

I see, missed the part when step 5 is called (model rebuilt), 3 and 4 needed to be done again, so basically now the optimizers are not bounded to the rebuilt model.

This is different from the scenario for test/validate/predict loading from ckpt as there we just need to load the module, not other training states and the order there is:

load/restore lightning module -> init ddp process -> setup -> ....

in these cases, for avoiding rebuild model erase weights from setup, user has to specify not setting up for test/validate/predict mode.

One solution from our side could be to move restore_weights() to directly after step 2. I'm hesitating however, because I'm not sure yet if this could have undesired side-effects? Any thoughts?

For this, I guess, we could only move the logic for restore lightning module part, for restoring other states, has to be done after 4.

sounds like different types of states (like module states, optimizer states, and other trainer states) might needed to be loaded at different phases.

it might not be ideal, but it is possible that we resetting up after lightning module restored weights, we could call resetting up here before loading rest of the pieces:

https://github.com/PyTorchLightning/pytorch-lightning/blob/e126649d19e85c449b007008361f10374878f2f4/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L92-L102

awaelchli · 2021-05-15T13:38:35Z

it might not be ideal, but it is possible that we resetting up after lightning module restored weights, we could call resetting up here before loading rest of the pieces:

To simulate that, I tried the following workaround and it seems to work.

    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")
        self.trainer.accelerator.setup(self.trainer, self)

Can you check if that works for you too?

shuyingsunshine21 · 2021-05-15T22:20:38Z

Thanks @awaelchli , will check.

shuyingsunshine21 · 2021-05-17T17:38:36Z

it works for me as well, but I think we also need to add

    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")
        self.trainer._call_configure_sharded_model(model) 
        self.trainer.accelerator.setup(self.trainer, self)
        self.trainer._pre_dispatch()

as we are thinking to not exposing trainer to LightningModule, wonder if this is a good approach. cc @ananthsub . Thinking whether some fix/modification from trainer side makes more sense.

awaelchli · 2021-05-17T18:18:05Z

Thanks for testing this!

No, by all standards this is not a good approach for sure. This workaround will hopefully unblock you but it means we need to discuss if and how we want to move the restore call to an earlier point in the fit call. This is complicated, because apparently recently some changes were introduced that now let the training type plugin call the model hooks for restoring ... so care needs to be taken when splitting this up.

awaelchli · 2021-05-19T15:53:02Z

What Lightning version are you using here, master?

shuyingsunshine21 · 2021-05-19T21:40:33Z

Yes, master.

awaelchli · 2021-05-20T21:04:03Z

Hi again
So I have been going back and forth with different attempts for solutions, but I'm always running into problems. Every scenario I have considered leads to the conclusion that the root problem is always the manual call to self.setup() in the on_load_checkpoint hook. The hook here is really not the intended place to make such a call.

So let me go back to @hudeven's comments once again:

our model depends on transform output(e.g. reading training data and passing num_classes as out_dim of model), and the transform is not pickable. Putting them in LightningModule.init() will make LightningModule not pickable. Our internal system requires LightningModule to be pickle-able

I think there can be two reasons why you get the pickle error:

Because you run self.save_hyperparameters and this will try to save the transforms to the checkpoint. Solution: Use save_hyperparameters(ignore="transform") to avoid it.
Because you use accelerator="ddp_spawn" and since the transform is not pickle-able, it will fail. This is due to limitations of ddp_spawn and if it is unavoidable, we recommend actually to use accelerator="ddp".

But besides that, building the model and optimizer in setup() is totally fine and is not a problem for resuming. All good here.

some pre-train model weights are very large and need to be download from remote sources. So we want to defer the model loading, instead of doing it in init()

Is this the reason why you call setup("fit") again in on_load_checkpoint()? I don't get that, could you give more details?

During loading from checkpoint, it requires model object to be created before loading weights. So we call setup("fit") to create model in on_load_checkpiont()

setup() is happening before the weights are loaded. So there is no need to be calling it again in on_load_checkpoint(). What am I missing?

shuyingsunshine21 · 2021-05-21T01:13:04Z

Thanks @awaelchli for the reply.

for

Because you use accelerator="ddp_spawn" and since the transform is not pickle-able, it will fail. This is due to limitations of ddp_spawn and if it is unavoidable, we recommend actually to use accelerator="ddp".

we actually use DDP internally, the example code I showed is using single device training type.

Is this the reason why you call setup("fit") again in on_load_checkpoint()? I don't get that, could you give more details?

Not 100% certain about the reason for DDP (I could ask @hudeven for more details), but this exposes question where we should restore from checkpoint. Currently we do it after we setup everything. But there is usecase where we would like to restore model weights after setup directly and then call configure sharded model, this is more like controlled by specific plugin. For example the on-going FSDP, we would like to load model states for unwrapped model and later it will be wrapped in FSDP at configuring sharded model stage. As the current flow of restoring does not allow us to do this. The solution we ended up is to override on_load_checkpoint, to allow us construct the unwrapped model and load. BTW, the workaround

    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")
        self.trainer.accelerator.setup(self.trainer, self)

actually does not work FSDP.

The fix we ended up is overriding this for LightningModule

    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
        self.setup("fit")
	
    def on_pretrain_routine_end(self) -> None:
        if self.trainer.resume_from_checkpoint is None:
            return

        # Before reconnecting, as we already restored optimizer states, lr scheduler, amp states
        #  we store it temporarily and after reconnecting, we will load it.
        restored_checkpoint = self.get_restored_trainer_states() # <- basically repeat what we have in checkpoint_connector file for optimizer states, lr scheduler and amp states

        # Reconnecting model, configure, and pre dispatch.

        self.trainer.accelerator.connect(self)
        self.trainer.accelerator.setup_environment()
        if self.enable_configure_sharded_model:
            self.trainer._call_configure_sharded_model(self)
        self.trainer.accelerator.setup(self.trainer, self)
        self.trainer.accelerator.pre_dispatch(self.trainer)

        # Restore the optimizers, lr scheduler, amp states for re-connected and configured model
        self.restore_trainer_states(restored_checkpoint) # <- similar, repeat what we have in checkpoint_connector file for restoring optimizer states, lr scheduler and amp states

I just wonder two things:

split loading states to more granular stages (like model states, optimizer states, etc. they do not necessariliy get restored at the same time)
should plugin control more on loading logics

cc @ananthsub

shuyingsunshine21 · 2021-05-21T01:35:56Z

confirmed with @hudeven that another use-case need this setup("fit") for on_load_checkpoint is:

MyModel.load_from_checkpoint(checkpoint_path=xxx)

awaelchli · 2021-05-24T10:32:15Z

split loading states to more granular stages (like model states, optimizer states, etc. they do not necessariliy get restored at the same time)

and

But there is usecase where we would like to restore model weights after setup directly and then call configure sharded model, this is more like controlled by specific plugin. For example the on-going FSDP, we would like to load model states for unwrapped model and later it will be wrapped in FSDP at configuring sharded model stage. As the current flow of restoring does not allow us to do this.

I'm working on #7652 to enable that.
This way, you don't have to call setup() manually in on_load_checkpoint()

should plugin control more on loading logics

No, I don't think so. The plugins already have many responsibilities. I believe we should aim to find a way to restore model and trainer state that fits all plugins well.

awaelchli · 2021-06-04T11:28:31Z

Wanted to give an update.

In #7652 I'm loading the model weights in this order:

model.setup("fit")  # trainer calls setup hook

# model weights get restored as soon as model is setup
restore_model() # also calls model.on_load_checkpoint()

call_configure_sharded_model(model)
accelerator.setup(model) 

restore_training_state()  # restore optimizer, precision, loop progress etc.

so after the setup hook is called, but before the accelerator setup.
and optimizer gets restored after the accelerator setup.

Does that make sense?

Q: should call_configure_sharded_model happen before or after model weights get restored? Based on the comments from @shuyingsunshine21 on this PR it sounds like we want to restore weights before that hook.

shuyingsunshine21 · 2021-06-04T18:18:45Z

Thanks @awaelchli for the update.

model states loaded before accelerator setup makes sense.

optimizer gets restored after accelerator setup

for this, I think might need to be postponed to after pre-dispatch (in the case optimizer is setup in pre-dispatch stage)

Q: should call_configure_sharded_model happen before or after model weights get restored?

I think this is dependent on TrainingTypePlugin (whether the model is instantiated before configure_sharded_model or on configure_sharded_model). Though all current use cases is before, and we would like to restore states before, I do see that there is need for instantiate model on call_configure_sharded_model for large model, in that case, we need to load states after. So I think we might need to expose some flag in TrainingTypePlugin to control this.

I am thinking of the following flow:

model.setup("fit")  # trainer calls setup hook

if TrainingTypePlugin.restore_before_configure_sharded_model:
    restore_model() # also calls model.on_load_checkpoint()

call_configure_sharded_model(model)

if not TrainingTypePlugin.restore_before_configure_sharded_model:
    restore_model() # also calls model.on_load_checkpoint()

accelerator.setup(model) 

_pre_dispatch()

restore_training_state()  # restore optimizer, precision, loop progress etc.

wdyt?

awaelchli · 2021-06-04T18:36:23Z

Thanks!!!

I think might need to be postponed to after pre-dispatch (in the case optimizer is setup in pre-dispatch stage)

good catch! yes then it should restore after pre_dispatch!

I think this is dependent on TrainingTypePlugin (whether the model is instantiated before configure_sharded_model or on configure_sharded_model).

Yes, in theory we can have the training plugin decide to restore before or after (nice idea!). We would however sacrifice on a consistent hook call order #7740 , so depends if we are ok with making an exception here. Well, there would be no way around if we want to allow shifting the layer instantiation.

shuyingsunshine21 · 2021-06-04T21:59:44Z

Well, there would be no way around if we want to allow shifting the layer instantiation.

yeah, that is the tricky part

hudeven · 2021-06-11T21:58:38Z

Hi @awaelchli ,

We are unblocked by the workaround below. It supports resuming checkpoint for both DDP and FSDP. However, it's too hacky. We hope there would be an official solution in Lightning.

cc: @ananthsub

class MyTask(CheckpointMixin, LightningModule):
    def setup(self, stage: str):
        if stage == "test":
            return
        # resetting call_configure_sharded_model_hook attribute so that we could configure model
        self.call_configure_sharded_model_hook = False


class CheckpointMixin(object):
    """Mixin to enable resuming from checkpoint
    Currently, resuming from checkpointing requires a hack in `on_pretrain_routine_end`.
    TODO: @stevenliu remove this class after the official fix landed in Lightning
    Usage:
        MyTask(CheckpointMixin, LightningModule):
            ...

    Note: CheckpointMixin must be added ahead of LightningModule. For FSDP, it's
    required to add attribute `enable_configure_sharded_model` to Task and set it to True
    """

    def on_pretrain_routine_end(self):
        if self.trainer is None:
            return

        if self.trainer.resume_from_checkpoint is None:
            return

        # Before reconnecting, as we already restored optimizer states, lr scheduler, amp states
        #  we store it temporarily and after reconnecting, we will load it.
        restored_checkpoint = self._get_restored_trainer_states()

        # Reconnecting model, configure, and pre dispatch.
        self.trainer.accelerator.connect(self)
        self.trainer.accelerator.setup_environment()
        if getattr(self, "enable_configure_sharded_model", False):
            self.trainer._call_configure_sharded_model(self)
        self.trainer.optimizers = []
        self.trainer.accelerator.setup(self.trainer, self)
        self.trainer.accelerator.pre_dispatch(self.trainer)

        # Restore the optimizers, lr scheduler, amp states for re-connected and configured model
        self._restore_trainer_states(restored_checkpoint)

    def _get_restored_trainer_states(self) -> Dict[str, Any]:
        restored_checkpoint = {}
        optimizer_states = []
        for _, optimizer in enumerate(self.trainer.optimizers):
            # Rely on accelerator to dump optimizer state
            optimizer_state = self.trainer.accelerator.optimizer_state(optimizer)
            optimizer_states.append(optimizer_state)
        restored_checkpoint["optimizer_states"] = optimizer_states
        # dump lr schedulers
        lr_schedulers = []
        for scheduler in self.trainer.lr_schedulers:
            lr_schedulers.append(scheduler["scheduler"].state_dict())
        restored_checkpoint["lr_schedulers"] = lr_schedulers
        # dump amp scaling
        if (
            self.trainer.amp_backend == AMPType.NATIVE
            and self.trainer._device_type != DeviceType.TPU
            and self.trainer.scaler is not None
        ):
            restored_checkpoint[
                "native_amp_scaling_state"
            ] = self.trainer.scaler.state_dict()
        elif self.trainer.amp_backend == AMPType.APEX:
            restored_checkpoint["amp_scaling_state"] = amp.state_dict()
        return restored_checkpoint

    def _restore_trainer_states(self, checkpoint: Dict[str, Any]) -> None:
        # restore the optimizers
        optimizer_states = checkpoint["optimizer_states"]
        for optimizer, opt_state in zip(self.trainer.optimizers, optimizer_states):
            optimizer.load_state_dict(opt_state)

            # move optimizer to GPU 1 weight at a time
            # avoids OOM
            if self.trainer.root_gpu is not None:
                for state in optimizer.state.values():
                    for k, v in state.items():
                        if isinstance(v, torch.Tensor):
                            state[k] = v.cuda(self.trainer.root_gpu)
        # restore the lr schedulers
        lr_schedulers = checkpoint["lr_schedulers"]
        for scheduler, lrs_state in zip(self.trainer.lr_schedulers, lr_schedulers):
            scheduler["scheduler"].load_state_dict(lrs_state)
        # restore amp scaling
        if (
            self.trainer.amp_backend == AMPType.NATIVE
            and "native_amp_scaling_state" in checkpoint
        ):
            self.trainer.scaler.load_state_dict(checkpoint["native_amp_scaling_state"])
        elif (
            self.trainer.amp_backend == AMPType.APEX
            and "amp_scaling_state" in checkpoint
        ):
            amp.load_state_dict(checkpoint["amp_scaling_state"])

awaelchli · 2021-06-13T00:05:30Z

@hudeven I implemented the changes here #7652. In summary, the restoring will happen like this:

In Trainer.fit:

model.setup("fit")

# restore model weights
checkpoint_connector.restore_model()

model.configure_sharded_model()
...
accelerator.setup()
...
pre_dispatch()

# restore optimizers, loop, etc.
checkpoint_connector.restore_trainer_state()

shuyingsunshine21 · 2021-06-17T20:27:35Z

Thanks @awaelchli

dave-epstein · 2021-08-07T17:46:27Z

I am running into an issue with the model weights being restored before the call to configure sharded model. In my case, I don't set up the modules in init and only do the setup in configure sharded model. So when the code tries to load state dict, it is loading it into a non-existent model. What is the best way to bypass this? Basically, I need the model restore to be called after configure sharded model is called. Is it better / okay to define the modules in init and only wrap them with checkpoint_wrapper / auto_wrap and wrap in the configure_sharded method?

awaelchli · 2021-08-07T20:25:25Z

@dave-epstein the model can be built in the setup hook. the order is the following:

init
setup hook
restore weights
configure sharded hook
train

this way the weights can be loaded before the model gets wrapped. Does this help?

dave-epstein · 2021-08-07T22:11:09Z

@dave-epstein the model can be built in the setup hook. the order is the following:
init

setup hook

restore weights

configure sharded hook

train
this way the weights can be loaded before the model gets wrapped. Does this help?

Yeah, just saw the documentation shows this use case as well. It seems to work.

shuyingsunshine21 added bug Something isn't working help wanted Open to be worked on labels May 14, 2021

shuyingsunshine21 changed the title ~~[bug]Resuming From Checkpoint for fp16 failure~~ [bug]Resuming From Checkpoint for FP16 failure (Single GPU) May 14, 2021

awaelchli self-assigned this May 14, 2021

awaelchli added the priority: 0 High priority task label May 14, 2021

awaelchli mentioned this issue May 17, 2021

[wip] re-setup optimizers and model after restore #7578

Closed

11 tasks

Lightning-AI deleted a comment from tchaton May 19, 2021

awaelchli added checkpointing Related to checkpointing feature Is an improvement or enhancement design Includes a design discussion discussion In a discussion stage labels May 21, 2021

awaelchli mentioned this issue May 22, 2021

progressive restoring of trainer state #7652

Merged

11 tasks

awaelchli mentioned this issue Jun 7, 2021

refactor CheckpointConnector.restore_weights #7862

Merged

11 tasks

awaelchli added this to the v1.4 milestone Jun 16, 2021

awaelchli closed this as completed in #7652 Jun 17, 2021

awaelchli mentioned this issue Jun 22, 2021

OOM issues with loading large model checkpoints w/ FSDP after checkpoint refactor #8043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

shuyingsunshine21 commented May 14, 2021 •

edited by Borda

Loading

shuyingsunshine21 commented May 14, 2021

awaelchli commented May 14, 2021

shuyingsunshine21 commented May 14, 2021

hudeven commented May 14, 2021 •

edited

Loading

shuyingsunshine21 commented May 14, 2021 •

edited

Loading

hudeven commented May 14, 2021 •

edited

Loading

awaelchli commented May 15, 2021 •

edited

Loading

shuyingsunshine21 commented May 15, 2021 •

edited

Loading

awaelchli commented May 15, 2021

shuyingsunshine21 commented May 15, 2021

shuyingsunshine21 commented May 17, 2021

awaelchli commented May 17, 2021 •

edited

Loading

awaelchli commented May 19, 2021

shuyingsunshine21 commented May 19, 2021

awaelchli commented May 20, 2021 •

edited

Loading

shuyingsunshine21 commented May 21, 2021

shuyingsunshine21 commented May 21, 2021

awaelchli commented May 24, 2021 •

edited

Loading

awaelchli commented Jun 4, 2021

shuyingsunshine21 commented Jun 4, 2021

awaelchli commented Jun 4, 2021 •

edited

Loading

shuyingsunshine21 commented Jun 4, 2021 •

edited

Loading

hudeven commented Jun 11, 2021 •

edited by awaelchli

Loading

awaelchli commented Jun 13, 2021 •

edited

Loading

shuyingsunshine21 commented Jun 17, 2021

dave-epstein commented Aug 7, 2021 •

edited

Loading

awaelchli commented Aug 7, 2021

dave-epstein commented Aug 7, 2021

[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

[bug]Resuming From Checkpoint for FP16 failure (Single GPU) #7535

Comments

shuyingsunshine21 commented May 14, 2021 • edited by Borda Loading

🐛 Bug

Please reproduce using the BoringModel

Expected behavior

Environment

Additional context

shuyingsunshine21 commented May 14, 2021

awaelchli commented May 14, 2021

shuyingsunshine21 commented May 14, 2021

hudeven commented May 14, 2021 • edited Loading

shuyingsunshine21 commented May 14, 2021 • edited Loading

hudeven commented May 14, 2021 • edited Loading

awaelchli commented May 15, 2021 • edited Loading

shuyingsunshine21 commented May 15, 2021 • edited Loading

awaelchli commented May 15, 2021

shuyingsunshine21 commented May 15, 2021

shuyingsunshine21 commented May 17, 2021

awaelchli commented May 17, 2021 • edited Loading

awaelchli commented May 19, 2021

shuyingsunshine21 commented May 19, 2021

awaelchli commented May 20, 2021 • edited Loading

shuyingsunshine21 commented May 21, 2021

shuyingsunshine21 commented May 21, 2021

awaelchli commented May 24, 2021 • edited Loading

awaelchli commented Jun 4, 2021

shuyingsunshine21 commented Jun 4, 2021

awaelchli commented Jun 4, 2021 • edited Loading

shuyingsunshine21 commented Jun 4, 2021 • edited Loading

hudeven commented Jun 11, 2021 • edited by awaelchli Loading

awaelchli commented Jun 13, 2021 • edited Loading

shuyingsunshine21 commented Jun 17, 2021

dave-epstein commented Aug 7, 2021 • edited Loading

awaelchli commented Aug 7, 2021

dave-epstein commented Aug 7, 2021

shuyingsunshine21 commented May 14, 2021 •

edited by Borda

Loading

hudeven commented May 14, 2021 •

edited

Loading

shuyingsunshine21 commented May 14, 2021 •

edited

Loading

hudeven commented May 14, 2021 •

edited

Loading

awaelchli commented May 15, 2021 •

edited

Loading

shuyingsunshine21 commented May 15, 2021 •

edited

Loading

awaelchli commented May 17, 2021 •

edited

Loading

awaelchli commented May 20, 2021 •

edited

Loading

awaelchli commented May 24, 2021 •

edited

Loading

awaelchli commented Jun 4, 2021 •

edited

Loading

shuyingsunshine21 commented Jun 4, 2021 •

edited

Loading

hudeven commented Jun 11, 2021 •

edited by awaelchli

Loading

awaelchli commented Jun 13, 2021 •

edited

Loading

dave-epstein commented Aug 7, 2021 •

edited

Loading