코드 싱크 맞추기.. #17

teang1995 · 2022-03-19T07:57:35Z

학습이 잘 되게 싱크를 맞추는 과정을 거친다.
서버에서는 권한이 막힌 게 많아 테스트가 어려우니, 도커 환경에서 할 예정.

TODO's

학습 되게 코드 마무리 짓기
README 갱신
- 파일 구조 갱신
- dockerfile 설명
- 학습 한 번에 되는 sh파일 작성 후 업로드 및 설명 게재
필요한 게 있다면 생각날 때마다 추가

teang1995 · 2022-03-19T09:56:23Z

root@cd74c2b719cf:/pytorch-autorec# python -m autorec.train
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Error executing job with overrides: []
Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1125, in _run
    self._callback_connector._attach_model_callbacks()
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 266, in _attach_model_callbacks
    model_callbacks = self.trainer.call_hook("configure_callbacks")
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1486, in call_hook
    prev_fx_name = pl_module._current_fx_name
AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

와 같은 에러가 발생.. 뭐가 문제일까

teang1995 · 2022-03-19T10:02:09Z

root@cd74c2b719cf:/pytorch-autorec# python -m autorec.train
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Error executing job with overrides: []
Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1125, in _run
    self._callback_connector._attach_model_callbacks()
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 266, in _attach_model_callbacks
    model_callbacks = self.trainer.call_hook("configure_callbacks")
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1486, in call_hook
    prev_fx_name = pl_module._current_fx_name
AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

와 같은 에러가 발생.. 뭐가 문제일까

AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name' 를 검색해보니 link와 같은 이슈를 발견..
바본가.. AutoRecModule을 lightningdatamodule에서 상속 받고 있었다. LightningModule에서 상속 받게 하여 해결.

teang1995 · 2022-03-19T10:16:55Z

3090에서 발생한 에러.
CUDA error: no kernel image is available for execution on the device
link를 보고 고쳐보려 했으나,
ImportError: /usr/local/lib/python3.9/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo 와 같은 에러가 발생..

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchtext==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html 로 torchtext 버전 맞추어 주어 해결

teang1995 · 2022-03-19T10:17:51Z

TypeError: setup() got an unexpected keyword argument 'stage' 에러 발생.

teang1995 · 2022-03-19T10:21:21Z

TypeError: setup() got an unexpected keyword argument 'stage' 에러 발생.

setup 함수에 stage 인자 넣어줌!

teang1995 · 2022-03-19T10:21:38Z

Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1145, in _run
    self.accelerator.setup(self)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/gpu.py", line 46, in setup
    return super().setup(trainer)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 93, in setup
    self.setup_optimizers(trainer)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 354, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 245, in init_optimizers
    return trainer.init_optimizers(model)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/optimizers.py", line 35, in init_optimizers
    optim_conf = self.call_hook("configure_optimizers", pl_module=pl_module)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/pytorch-autorec/autorec/model/autorec_module.py", line 24, in configure_optimizers
    raise torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)
TypeError: exceptions must derive from BaseException
``` 에러 발생.

teang1995 · 2022-03-19T10:24:18Z

raise torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)를
return torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)로 고쳐 해결.. 왜 raise를 썼지?

teang1995 · 2022-03-19T11:12:20Z

아무튼 돌아간다. 하지만 수렴이 너무 늦다
본문의 TODO list의 것들을 수행하면서, 수렴 속도를 개선할 방법을 찾아보자.
의심되는 부분으로는 Rprop의 수렴 속도가 너무 늦은 것과 loss를 잘못 계산하고 있는 것 두 가지가 의심된다.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

코드 싱크 맞추기.. #17

코드 싱크 맞추기.. #17

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022 •

edited

Loading

코드 싱크 맞추기.. #17

코드 싱크 맞추기.. #17

Comments

teang1995 commented Mar 19, 2022 • edited Loading

TODO's

teang1995 commented Mar 19, 2022 • edited Loading

teang1995 commented Mar 19, 2022 • edited Loading

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022

teang1995 commented Mar 19, 2022 • edited Loading

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022 •

edited

Loading

teang1995 commented Mar 19, 2022 •

edited

Loading