[BUG] False positive error using Conv1d on MPS `Output channels > 65536 not supported at the MPS device.` #505

sdatkinson · 2024-11-24T01:28:41Z

Describe the bug
Training locally on my MBP with macOS 15.1 I see the following error:

  | Name | Type    | Params | Mode 
-----------------------------------------
0 | _net | WaveNet | 13.8 K | train
-----------------------------------------
13.8 K    Trainable params
0         Non-trainable params
13.8 K    Total params
0.055     Total estimated model params size (MB)
111       Modules in train mode
0         Modules in eval mode
Sanity Checking DataLoader 0:   0%|                       | 0/1 [00:00<?, ?it/s]Exception in Tkinter callback
Traceback (most recent call last):
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/tkinter/__init__.py", line 1921, in __call__
    return self.func(*args)
  File "/Users/steve/src/neural-amp-modeler/nam/train/gui/__init__.py", line 684, in _train
    self._train2()
  File "/Users/steve/src/neural-amp-modeler/nam/train/gui/__init__.py", line 704, in _train2
    train_output = core.train(
  File "/Users/steve/src/neural-amp-modeler/nam/train/core.py", line 1447, in train
    trainer.fit(model, train_dataloader, val_dataloader)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self._run_sanity_check()
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1052, in _run_sanity_check
    val_loop.run()
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 411, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/base.py", line 311, in validation_step
    preds, targets, loss_dict = self._shared_step(batch)
  File "/Users/steve/src/neural-amp-modeler/nam/models/base.py", line 254, in _shared_step
    preds = self(*args, pad_start=False)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/base.py", line 234, in forward
    return self.net(*args, **kwargs)  # TODO deprecate--use self.net() instead.
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/_base.py", line 182, in forward
    y = self._forward(x, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/wavenet.py", line 434, in _forward
    y = self._net(x)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/wavenet.py", line 336, in forward
    head_input, y = layer(y, x, head_input=head_input)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/steve/src/neural-amp-modeler/nam/models/wavenet.py", line 220, in forward
    x = self._rechannel(x)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 375, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Users/steve/opt/anaconda3/envs/nam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 370, in _conv_forward
    return F.conv1d(
NotImplementedError: Output channels > 65536 not supported at the MPS device. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Tracing it down, I'm not using "> 65536" output channels. So I know there's a bug (pytorch/pytorch#129207) but this NotImplementedError seems to be reaching too far.

To Reproduce
Steps to reproduce the behavior:

Install via environment_cpu.yml
Verify PyTorch v2.5.0 or 2.5.1 is installed.
nam, pick files, start training.
Error above

Screenshots
N/A

Desktop (please complete the following information):

OS: macOS 15.1. This may be an issue introduced in 15.1, but I'm not sure.
Local training; Colab likely not affected because this is MPS-related.
Package version: commit 0670778

Additional context
Rolling PyTorch back to v2.4.1 (<2.5.0) appears to resolve the issue. Also reported in the FB group.

The text was updated successfully, but these errors were encountered:

sdatkinson · 2024-11-24T02:20:29Z

It seems that the error is phrased incorrectly--this refers to the sequence length, not the number of output channels.

So another workaround would be to process the output in chunks, but that sounds dreadful. Haven't checked the speed yet. Possibly a try/catch?... I can do that.

hvaara · 2024-12-10T20:08:15Z

but this NotImplementedError seems to be reaching too far.

I agree. The guard is checking all sizes, not just out_channels. For macOS >= 15.1 it's been fixed via pytorch/pytorch#140726, but for <= 15.0 it's still a live bug. I can open an issue and propose a fix. Would love it if you'd be willing to test the PR to ensure it resolves the issue for you.

By the way, you're more than welcome to open issues in upstream when you detect weirdness or bugs. I only came across a reference to this issue by chance.

sdatkinson · 2024-12-19T02:24:37Z

@hvaara thanks for your work upstream on this. I'd love to help so I appreciate the invitation, but can't guarantee I'll have bandwidth unfortunately.

sdatkinson added bug Something isn't working unread This issue is new and hasn't been seen by the maintainers yet priority:low Low-priority issues and removed unread This issue is new and hasn't been seen by the maintainers yet labels Nov 24, 2024

sdatkinson self-assigned this Nov 24, 2024

sdatkinson changed the title ~~[BUG] False positive error usign Conv1d on MPS ``~~ [BUG] False positive error usign Conv1d on MPS Output channels > 65536 not supported at the MPS device. Nov 24, 2024

sdatkinson changed the title ~~[BUG] False positive error usign Conv1d on MPS Output channels > 65536 not supported at the MPS device.~~ [BUG] False positive error using Conv1d on MPS Output channels > 65536 not supported at the MPS device. Nov 24, 2024

sdatkinson mentioned this issue Nov 24, 2024

[BUGFIX] Workaround for PyTorch MPS bug with sequences longer than 65,536 samples #506

Merged

sdatkinson closed this as completed in #506 Nov 24, 2024

sdatkinson mentioned this issue Dec 3, 2024

[BUG] cuDNN error existing when training parametric LSTM model in PyTorch 2.5.1 #512

Closed

hvaara mentioned this issue Dec 10, 2024

[MPS] out_channels <= 2**16 guard for convolution is too broad pytorch/pytorch#142515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] False positive error using Conv1d on MPS `Output channels > 65536 not supported at the MPS device.` #505

[BUG] False positive error using Conv1d on MPS `Output channels > 65536 not supported at the MPS device.` #505

sdatkinson commented Nov 24, 2024

sdatkinson commented Nov 24, 2024

hvaara commented Dec 10, 2024

sdatkinson commented Dec 19, 2024

[BUG] False positive error using Conv1d on MPS Output channels > 65536 not supported at the MPS device. #505

[BUG] False positive error using Conv1d on MPS Output channels > 65536 not supported at the MPS device. #505

Comments

sdatkinson commented Nov 24, 2024

sdatkinson commented Nov 24, 2024

hvaara commented Dec 10, 2024

sdatkinson commented Dec 19, 2024

[BUG] False positive error using Conv1d on MPS `Output channels > 65536 not supported at the MPS device.` #505

[BUG] False positive error using Conv1d on MPS `Output channels > 65536 not supported at the MPS device.` #505