Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

코드 싱크 맞추기.. #17

Open
1 of 6 tasks
teang1995 opened this issue Mar 19, 2022 · 8 comments
Open
1 of 6 tasks

코드 싱크 맞추기.. #17

teang1995 opened this issue Mar 19, 2022 · 8 comments

Comments

@teang1995
Copy link
Owner

teang1995 commented Mar 19, 2022

학습이 잘 되게 싱크를 맞추는 과정을 거친다.
서버에서는 권한이 막힌 게 많아 테스트가 어려우니, 도커 환경에서 할 예정.

TODO's

  • 학습 되게 코드 마무리 짓기
  • README 갱신
    • 파일 구조 갱신
    • dockerfile 설명
    • 학습 한 번에 되는 sh파일 작성 후 업로드 및 설명 게재
  • 필요한 게 있다면 생각날 때마다 추가
@teang1995
Copy link
Owner Author

teang1995 commented Mar 19, 2022

root@cd74c2b719cf:/pytorch-autorec# python -m autorec.train
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Error executing job with overrides: []
Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1125, in _run
    self._callback_connector._attach_model_callbacks()
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 266, in _attach_model_callbacks
    model_callbacks = self.trainer.call_hook("configure_callbacks")
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1486, in call_hook
    prev_fx_name = pl_module._current_fx_name
AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

와 같은 에러가 발생.. 뭐가 문제일까

@teang1995
Copy link
Owner Author

teang1995 commented Mar 19, 2022

root@cd74c2b719cf:/pytorch-autorec# python -m autorec.train
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
Error executing job with overrides: []
Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1125, in _run
    self._callback_connector._attach_model_callbacks()
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/connectors/callback_connector.py", line 266, in _attach_model_callbacks
    model_callbacks = self.trainer.call_hook("configure_callbacks")
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1486, in call_hook
    prev_fx_name = pl_module._current_fx_name
AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

와 같은 에러가 발생.. 뭐가 문제일까

AttributeError: 'AutoRecModule' object has no attribute '_current_fx_name' 를 검색해보니 link와 같은 이슈를 발견..
바본가.. AutoRecModulelightningdatamodule에서 상속 받고 있었다. LightningModule에서 상속 받게 하여 해결.

@teang1995
Copy link
Owner Author

3090에서 발생한 에러.
CUDA error: no kernel image is available for execution on the device
link를 보고 고쳐보려 했으나,
ImportError: /usr/local/lib/python3.9/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtERKSt10shared_ptrIS0_EPSo 와 같은 에러가 발생..

pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchtext==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html 로 torchtext 버전 맞추어 주어 해결

@teang1995
Copy link
Owner Author

TypeError: setup() got an unexpected keyword argument 'stage' 에러 발생.

@teang1995
Copy link
Owner Author

TypeError: setup() got an unexpected keyword argument 'stage' 에러 발생.

setup 함수에 stage 인자 넣어줌!

@teang1995
Copy link
Owner Author

Traceback (most recent call last):
  File "/pytorch-autorec/autorec/train.py", line 77, in main
    trainer.fit(model=train_module,
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1145, in _run
    self.accelerator.setup(self)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/gpu.py", line 46, in setup
    return super().setup(trainer)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 93, in setup
    self.setup_optimizers(trainer)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 354, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers(
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 245, in init_optimizers
    return trainer.init_optimizers(model)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/optimizers.py", line 35, in init_optimizers
    optim_conf = self.call_hook("configure_optimizers", pl_module=pl_module)
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
    output = model_fx(*args, **kwargs)
  File "/pytorch-autorec/autorec/model/autorec_module.py", line 24, in configure_optimizers
    raise torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)
TypeError: exceptions must derive from BaseException
``` 에러 발생. 

@teang1995
Copy link
Owner Author

raise torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)
return torch.optim.Rprop(self.net.parameters(), lr=self.init_lr)로 고쳐 해결.. 왜 raise를 썼지?

@teang1995
Copy link
Owner Author

teang1995 commented Mar 19, 2022

아무튼 돌아간다. 하지만 수렴이 너무 늦다
본문의 TODO list의 것들을 수행하면서, 수렴 속도를 개선할 방법을 찾아보자.
의심되는 부분으로는 Rprop의 수렴 속도가 너무 늦은 것과 loss를 잘못 계산하고 있는 것 두 가지가 의심된다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant