You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run the training via train.py on the 1.x branch of the MMOCR project as described in the documentation.
Reproduces the problem - error message
Upon hitting
runner=Runner.from_cfg(cfg)
Application will crash with
Traceback (most recent call last):
File "tools/train.py", line 117, in <module>
self.setup_env(env_cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 644, in setup_env
runner = cls(
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 345, in __init__
main()
self.setup_env(env_cfg) File "tools/train.py", line 106, in main
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 644, in setup_env
runner = Runner.from_cfg(cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 431, in from_cfg
init_dist(self.launcher, **dist_cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 56, in init_dist
self.setup_env(env_cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 644, in setup_env
_init_dist_pytorch(backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 94, in _init_dist_pytorch
init_dist(self.launcher, **dist_cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 56, in init_dist
runner = cls(
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 345, in __init__
torch_dist.init_process_group(backend=backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 539, in init_process_group
_init_dist_pytorch(backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 94, in _init_dist_pytorch
init_dist(self.launcher, **dist_cfg)self.setup_env(env_cfg)
File "/usr/local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 644, in setup_env
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 56, in init_dist
torch_dist.init_process_group(backend=backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 539, in init_process_group
raise RuntimeError(
RuntimeError: Expected timeout argument to be of typedatetime.timedelta
_init_dist_pytorch(backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 94, in _init_dist_pytorch
torch_dist.init_process_group(backend=backend, **kwargs)init_dist(self.launcher, **dist_cfg)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 539, in init_process_group
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 56, in init_dist
raise RuntimeError(
RuntimeError: Expected timeout argument to be of typedatetime.timedelta
_init_dist_pytorch(backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmengine/dist/utils.py", line 94, in _init_dist_pytorch
torch_dist.init_process_group(backend=backend, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 539, in init_process_group
raise RuntimeError(
RuntimeError: Expected timeout argument to be of typedatetime.timedelta
raise RuntimeError(
RuntimeError: Expected timeout argument to be of typedatetime.timedelta
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 197) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Additional information
What I expect to happen?
I expect that the distributed training is initialized with the specified timeout (in seconds)
Why do you need to set the timeout?
In our training, we have a large validation dataset which exceeds the default 30-minutes timeout and then causes the entire training to crash
What is the reason for the problem?
PyTorch expects a timedelta object to be provided during initialization. However, we can't just run replace the object with a timedelta object
Prerequisite
Environment
Reproduces the problem - code sample
Specify a timeout in the
default_runtime.py
of the MMOCR projectReproduces the problem - command or script
Run the training via
train.py
on the 1.x branch of the MMOCR project as described in the documentation.Reproduces the problem - error message
Upon hitting
Application will crash with
Additional information
What I expect to happen?
I expect that the distributed training is initialized with the specified timeout (in seconds)
Why do you need to set the timeout?
In our training, we have a large validation dataset which exceeds the default 30-minutes timeout and then causes the entire training to crash
What is the reason for the problem?
PyTorch expects a
timedelta
object to be provided during initialization. However, we can't just run replace the object with a timedelta objectbecause YAPL validation will crash when encountering a timedelta object instead of a primitive like a string or integer.
What is the solution?
Support a timeout in
mmengine
by automatically converting an integer into the required timedelta, e.g., inmmengine/mmengine/runner/runner.py
Lines 617 to 665 in 79067e4
like this:
The text was updated successfully, but these errors were encountered: