Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb breaks tests - importlib.util.find_spec-related under forked process #9623

Closed
stas00 opened this issue Jan 15, 2021 · 13 comments
Closed
Labels

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 15, 2021

This has to do with a forked process environment:

I was running:

pytest -sv examples/seq2seq/test_finetune_trainer.py -k deepspeed

and was getting:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 367, in <module>
stderr:     main()
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 297, in main
stderr:     train_result = trainer.train(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer.py", line 998, in train
stderr:     self.control = self.callback_handler.on_train_end(self.args, self.state, self.control)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 342, in on_train_end
stderr:     return self.call_event("on_train_end", args, state, control)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 377, in call_event
    result = getattr(callback, event)(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/integrations.py", line 565, in on_train_end
100%|██████████| 1/1 [00:00<00:00,  1.88it/s]    self._wandb.log({})
stderr:   File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb/sdk/lib/preinit.py", line 37, in preinit_wrapper
stderr:     raise wandb.Error("You must call wandb.init() before {}()".format(name))
stderr: wandb.errors.error.Error: You must call wandb.init() before wandb.log()
stderr: 2021-01-15 09:38:11 | INFO | wandb.sdk.internal.internal | Internal process exited

I tried to remove wandb and while pip uninstall wandb worked, wandb left code behind and I had to remove it manually:

rm -r /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb

But the problem continued without having any wandb installed:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 367, in <module>
stderr:     main()
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 282, in main
stderr:     trainer = Seq2SeqTrainer(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer.py", line 304, in __init__
stderr:     self.callback_handler = CallbackHandler(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 282, in __init__
stderr:     self.add_callback(cb)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 299, in add_callback
stderr:     cb = callback() if isinstance(callback, type) else callback
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/integrations.py", line 488, in __init__
stderr:     wandb.ensure_configured()
stderr: AttributeError: module 'wandb' has no attribute 'ensure_configured'

The strange stderr prefix is from our multiprocess testing setup which requires special handling as pytest can't handle DDP and a like on its own.

The only way I was able to overcome this is with:

export WANDB_DISABLED=true

I'm on transformers master.

@stas00
Copy link
Contributor Author

stas00 commented Jan 15, 2021

@sgugger, I think the culprit for the 2nd error, when I uninstalled wandb is:

def is_wandb_available():
    if os.getenv("WANDB_DISABLED"):
        return False
    return importlib.util.find_spec("wandb") is not None

as it returns True, when it shouldn't since:

ls -l /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb
ls: cannot access '/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb': No such file or directory

You can see it with any ddp test, so you don't need to install deepspeed or fairscale to see it, e.g. this fails too:

pytest -sv examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_ddp

But a single unforked process test works just fine:

pytest -sv examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_dp

and then there is another problem which occurs with wandb installed. See the first error in OP.

@stas00 stas00 changed the title wandb breaks tests wandb breaks tests - importlib.util.find_spec-related under forked process Jan 15, 2021
@stas00
Copy link
Contributor Author

stas00 commented Jan 17, 2021

But with wandb installed the 1st error I get with DDP too, w/o needing to fork a process in tests:

python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500
[...]
[INFO|integrations.py:521] 2021-01-16 20:47:40,853 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: stason (use `wandb login --relogin` to force relogin)
2021-01-16 20:47:42.440849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.10.14
wandb: Syncing run output_dir
wandb: ⭐️ View project at https://wandb.ai/stason/huggingface
wandb: 🚀 View run at https://wandb.ai/stason/huggingface/runs/82q4zxt2
wandb: Run data is saved locally in /mnt/nvme1/code/huggingface/transformers-master/examples/seq2seq/wandb/run-20210116_204741-82q4zxt2
wandb: Run `wandb offline` to turn off syncing.
  0%|          | 0/63 [00:00<?, ?it/s]
[...]
Training completed. Do not forget to share your model on huggingface.co/models =)

Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 297, in main
    train_result = trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 998, in train
    self.control = self.callback_handler.on_train_end(self.args, self.state, self.control)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer_callback.py", line 342, in on_train_end
    return self.call_event("on_train_end", args, state, control)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer_callback.py", line 377, in call_event
    result = getattr(callback, event)(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/integrations.py", line 565, in on_train_end
    self._wandb.log({})
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
    raise wandb.Error("You must call wandb.init() before {}()".format(name))
wandb.errors.error.Error: You must call wandb.init() before wandb.log()
                                               2021-01-16 20:47:46 | INFO | wandb.sdk.internal.internal | Internal process exited

@sgugger
Copy link
Collaborator

sgugger commented Jan 19, 2021

I'm not sure I understand your first error. Could you give us more details? Are you saying that importlib.from_spec finds some weird "wandb" module but only in a distributed setting? I don't have wandb installed so I can't reproduce this at all.

For the last error, pinging @borisdayma

@borisdayma
Copy link
Contributor

borisdayma commented Jan 19, 2021

I had a similar issue recently with python 3.8 but it worked with 3.7. It was due to a function from "importlib" which changed name. Is it the same?

@stas00
Copy link
Contributor Author

stas00 commented Jan 20, 2021

@borisdayma, I have just installed python-3.7.9 and have the same issue there. Perhaps you had it working with python < 3.7.9?
The issue occurs with python-3.6.12 too.

@sgugger yes, the problem occurs only when there is DDP. If I drop -m torch.distributed.launch the problem goes away so it has to do with forking/multi-processes. If you remember there was an Issue where someone also had the problem of using some transformers models because they were importing apex at load time and then it was crushing under torch.mp - this is definitely a totally different issue, but it's related that it has to do with multiproc.

To reproduce:

pip install wandb
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 50 --n_train 50

which results in:

wandb.errors.error.Error: You must call wandb.init() before wandb.log()

If you then remove wand:

pip uninstall wandb -y

The 2nd error happens:

AttributeError: module 'wandb' has no attribute 'ensure_configured'

The full traces are in the OP.

Please let me know if you need any other info.

@tristandeleu
Copy link
Contributor

I am running into the same issue with DDP @stas00 has #9623 (comment)
I believe this might be due to the call to on_train_end, which calls wandb.log({}) on all processes, and not just on world process 0, while wandb.init was called only on world process 0:

self._wandb.log({})

@borisdayma
Copy link
Contributor

Interesting, can you check it solves the issue on your side @tristandeleu ?
If so I'll be happy to make a PR.

@tristandeleu
Copy link
Contributor

tristandeleu commented Jan 26, 2021

It does work for me when I replace it with

if state.is_world_process_zero:
    self._wandb.log({})

There is also another thing I ran into at the same time: _log_model was not initialized on processes other than world 0, making the following check fail because it didn't know self._log_model. Adding self._log_model = False to __init__ solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue #9623 (comment)

@sgugger
Copy link
Collaborator

sgugger commented Jan 26, 2021

Don't hesitate to suggest a PR with your fix @tristandeleu

@lkk12014402
Copy link

lkk12014402 commented Jan 27, 2021

It does work for me when I replace it with

if state.is_world_process_zero:
    self._wandb.log({})

There is also another thing I ran into at the same time: `_log_model` was not initialized on processes other than world 0, making the following check fail because it didn't know `self._log_model`. Adding `self._log_model = False` to `__init__` solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue [#9623 (comment)](https://github.com/huggingface/transformers/issues/9623#issue-787077821)

I had the same problem. and I just use > if state.is_world_process_zero: self._wandb.log({}), forget self._log_model = False. Thanks !!!

@lkk12014402
Copy link

It does work for me when I replace it with

if state.is_world_process_zero:
    self._wandb.log({})

There is also another thing I ran into at the same time: _log_model was not initialized on processes other than world 0, making the following check fail because it didn't know self._log_model. Adding self._log_model = False to __init__ solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue #9623 (comment)

Even with revising these codes, the program(with TPU) doesn't seem to stop at the end

stas00 added a commit that referenced this issue Jan 30, 2021
This PR solves part of #9623

It tries to actually do what #9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb.

The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}`

I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere.

@sgugger
LysandreJik pushed a commit that referenced this issue Feb 1, 2021
* [t5 doc] typos

a few run away backticks

@sgugger

* style

* [trainer] put fp16 args together

this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read

@sgugger

* style

* [wandb] make WANDB_DISABLED disable wandb with any value

This PR solves part of #9623

It tries to actually do what #9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb.

The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}`

I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere.

@sgugger

* WANDB_DISABLED=true to disable; make tf trainer consistent

* style
@borisdayma
Copy link
Contributor

@lkk12014402 can you confirm it still happens with latest HF master branch?
If so do you have a reproducible example you could share?

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants