Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure when running on cluster #1509

Open
nicolas-dufour opened this issue Apr 11, 2023 · 5 comments
Open

Failure when running on cluster #1509

nicolas-dufour opened this issue Apr 11, 2023 · 5 comments

Comments

@nicolas-dufour
Copy link

When running on a cluster with 2 nodes with 4 gpus each i randomly run into this bug:

Traceback (most recent call last):
  File "<string>", line 21, in _softmax_backward
KeyError: ('2-.-0-.-0-460d6c1309cde60a1044fec0550efc5a-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-39a47c39a781214e791a745d670003ed-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, False, False), (True, True, True, (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
    return function(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 978, in _run_stage
    self.fit_loop.run()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 201, in run
    self.advance()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 218, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 185, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 261, in _optimizer_step
    call._call_lightning_module_hook(
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 142, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/core/module.py", line 1265, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/core/optimizer.py", line 158, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 257, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 224, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 114, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/gpfsdswork/projects/rech/ipk/uey53ph/diffusion/utils/optimizers.py", line 48, in step
    loss = closure()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 101, in _wrap_closure
    closure_result = closure()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 135, in closure
    self._backward_fn(step_output.closure_loss)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 233, in backward_fn
    call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/strategies/strategy.py", line 199, in backward
    self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 67, in backward
    model.backward(tensor, *args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/core/module.p
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/lightning/pytorch/core/module.p
y", line 1054, in backward
y", line 1054, in backward
    loss.backward(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
    torch.autograd.backward(
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
ne 274, in apply
    return user_fn(self, *args)
    return user_fn(self, *args)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_bwd
    return bwd(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/xformers/triton/softmax.py", li  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/xformers/triton/softmax.py", li
ne 111, in backward
ne 111, in backward
    _softmax_backward[grid_2d](
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
ine 77, in run
ine 77, in run
ine 77, in run
ine 77, in run
ine 77, in run
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
ine 77, in run
ne 111, in backward
", line 123, in decorate_bwd
    return bwd(*args, **kwargs)
    return bwd(*args, **kwargs)
    return bwd(*args, **kwargs)
    return bwd(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/xformers/triton/softmax.py", li
ne 111, in backward
    _softmax_backward[grid_2d](
    _softmax_backward[grid_2d](
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
ine 77, in run
ine 77, in run
    timings = {config: self._bench(*args, config=config, **kwargs)
    timings = {config: self._bench(*args, config=config, **kwargs)
ine 77, in run
    _softmax_backward[grid_2d](
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py
", line 123, in decorate_bwd
    return bwd(*args, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/xformers/triton/softmax.py", li
ne 111, in backward
    _softmax_backward[grid_2d](
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", l
ine 77, in run
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 77, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 65, in _bench
    return do_bench(kernel_call)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/testing.py", line 143, in do_bench
    fn()
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "<string>", line 41, in _softmax_backward
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/compiler.py", line 1631, in compile
    fn_cache_manager.put(json.dumps(metadata), f"{name}.json", binary=False)
  File "~/miniconda3/envs/diffusion/lib/python3.9/site-packages/triton/compiler.py", line 1344, in put
    os.rename(filepath + ".tmp", filepath)
FileNotFoundError: [Errno 2] No such file or directory: '~/.triton/cache/8399787016451e9f19dfb601de4344a6/_softmax_backward.json.tmp' -> '~/.triton/cache/8399787016451e9f19dfb601de4344a6/_softmax_backward.json'

The bug occurs kinda randomly. With the exact same configuration, it appears sometimes or not.

I'm using python 3.9, pytorch 2.0, Xformers 0.16

@mitchellnw
Copy link

Also running into something similar -- will document it in another issue.

@mitchellnw
Copy link

Were you able to fix this issue? Still having a similar problem linked above.

@ptillet
Copy link
Collaborator

ptillet commented Apr 19, 2023

You can set TRITON_CACHE_DIR to control where kernels are cached, to make sure that it's in a directory the machine that runs the code has access to

@devadigapratham
Copy link

You can try clearing the cache by deleting the /.triton/cache directory or the specific cache directory mentioned in the error message (/.triton/cache/8399787016451e9f19dfb601de4344a6 <-the one that's mentioned in the message you provided.)

@youkaichao
Copy link
Contributor

I came into a similar issue, too. I fixed the issue by a symlink of libcuda.so. Could you please try that PR #1981 and see if it solves your problem? @nicolas-dufour

ZzEeKkAa pushed a commit to ZzEeKkAa/triton that referenced this issue Aug 5, 2024
In the long term, we plan to remove all usages of GenISA intrinsics.

Signed-off-by: Whitney Tsang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants