wandb breaks tests - importlib.util.find_spec-related under forked process #9623

stas00 · 2021-01-15T17:50:40Z

This has to do with a forked process environment:

I was running:

pytest -sv examples/seq2seq/test_finetune_trainer.py -k deepspeed

and was getting:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 367, in <module>
stderr:     main()
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 297, in main
stderr:     train_result = trainer.train(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer.py", line 998, in train
stderr:     self.control = self.callback_handler.on_train_end(self.args, self.state, self.control)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 342, in on_train_end
stderr:     return self.call_event("on_train_end", args, state, control)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 377, in call_event
    result = getattr(callback, event)(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/integrations.py", line 565, in on_train_end
100%|██████████| 1/1 [00:00<00:00,  1.88it/s]    self._wandb.log({})
stderr:   File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb/sdk/lib/preinit.py", line 37, in preinit_wrapper
stderr:     raise wandb.Error("You must call wandb.init() before {}()".format(name))
stderr: wandb.errors.error.Error: You must call wandb.init() before wandb.log()
stderr: 2021-01-15 09:38:11 | INFO | wandb.sdk.internal.internal | Internal process exited

I tried to remove wandb and while pip uninstall wandb worked, wandb left code behind and I had to remove it manually:

rm -r /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb

But the problem continued without having any wandb installed:

stderr: Traceback (most recent call last):
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 367, in <module>
stderr:     main()
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/examples/seq2seq/finetune_trainer.py", line 282, in main
stderr:     trainer = Seq2SeqTrainer(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer.py", line 304, in __init__
stderr:     self.callback_handler = CallbackHandler(
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 282, in __init__
stderr:     self.add_callback(cb)
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/trainer_callback.py", line 299, in add_callback
stderr:     cb = callback() if isinstance(callback, type) else callback
stderr:   File "/mnt/nvme1/code/huggingface/transformers-ds-optim-fix/src/transformers/integrations.py", line 488, in __init__
stderr:     wandb.ensure_configured()
stderr: AttributeError: module 'wandb' has no attribute 'ensure_configured'

The strange stderr prefix is from our multiprocess testing setup which requires special handling as pytest can't handle DDP and a like on its own.

The only way I was able to overcome this is with:

export WANDB_DISABLED=true

I'm on transformers master.

The text was updated successfully, but these errors were encountered:

stas00 · 2021-01-15T18:04:50Z

@sgugger, I think the culprit for the 2nd error, when I uninstalled wandb is:

def is_wandb_available():
    if os.getenv("WANDB_DISABLED"):
        return False
    return importlib.util.find_spec("wandb") is not None

as it returns True, when it shouldn't since:

ls -l /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb
ls: cannot access '/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb': No such file or directory

You can see it with any ddp test, so you don't need to install deepspeed or fairscale to see it, e.g. this fails too:

pytest -sv examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_ddp

But a single unforked process test works just fine:

pytest -sv examples/seq2seq/test_finetune_trainer.py::TestFinetuneTrainer::test_finetune_trainer_dp

and then there is another problem which occurs with wandb installed. See the first error in OP.

stas00 · 2021-01-17T04:51:16Z

But with wandb installed the 1st error I get with DDP too, w/o needing to fork a process in tests:

python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500
[...]
[INFO|integrations.py:521] 2021-01-16 20:47:40,853 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: stason (use `wandb login --relogin` to force relogin)
2021-01-16 20:47:42.440849: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: Tracking run with wandb version 0.10.14
wandb: Syncing run output_dir
wandb: ⭐️ View project at https://wandb.ai/stason/huggingface
wandb: 🚀 View run at https://wandb.ai/stason/huggingface/runs/82q4zxt2
wandb: Run data is saved locally in /mnt/nvme1/code/huggingface/transformers-master/examples/seq2seq/wandb/run-20210116_204741-82q4zxt2
wandb: Run `wandb offline` to turn off syncing.
  0%|          | 0/63 [00:00<?, ?it/s]
[...]
Training completed. Do not forget to share your model on huggingface.co/models =)

Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 297, in main
    train_result = trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 998, in train
    self.control = self.callback_handler.on_train_end(self.args, self.state, self.control)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer_callback.py", line 342, in on_train_end
    return self.call_event("on_train_end", args, state, control)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer_callback.py", line 377, in call_event
    result = getattr(callback, event)(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/integrations.py", line 565, in on_train_end
    self._wandb.log({})
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
    raise wandb.Error("You must call wandb.init() before {}()".format(name))
wandb.errors.error.Error: You must call wandb.init() before wandb.log()
                                               2021-01-16 20:47:46 | INFO | wandb.sdk.internal.internal | Internal process exited

sgugger · 2021-01-19T17:10:14Z

I'm not sure I understand your first error. Could you give us more details? Are you saying that importlib.from_spec finds some weird "wandb" module but only in a distributed setting? I don't have wandb installed so I can't reproduce this at all.

For the last error, pinging @borisdayma

borisdayma · 2021-01-19T21:48:10Z

I had a similar issue recently with python 3.8 but it worked with 3.7. It was due to a function from "importlib" which changed name. Is it the same?

stas00 · 2021-01-20T01:36:51Z

@borisdayma, I have just installed python-3.7.9 and have the same issue there. Perhaps you had it working with python < 3.7.9?
The issue occurs with python-3.6.12 too.

@sgugger yes, the problem occurs only when there is DDP. If I drop -m torch.distributed.launch the problem goes away so it has to do with forking/multi-processes. If you remember there was an Issue where someone also had the problem of using some transformers models because they were importing apex at load time and then it was crushing under torch.mp - this is definitely a totally different issue, but it's related that it has to do with multiproc.

To reproduce:

pip install wandb
cd examples/seq2seq
python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 4 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 50 --n_train 50

which results in:

wandb.errors.error.Error: You must call wandb.init() before wandb.log()

If you then remove wand:

pip uninstall wandb -y

The 2nd error happens:

AttributeError: module 'wandb' has no attribute 'ensure_configured'

The full traces are in the OP.

Please let me know if you need any other info.

tristandeleu · 2021-01-26T12:20:09Z

I am running into the same issue with DDP @stas00 has #9623 (comment)
I believe this might be due to the call to on_train_end, which calls wandb.log({}) on all processes, and not just on world process 0, while wandb.init was called only on world process 0:

transformers/src/transformers/integrations.py

Line 586 in 897a24c

self._wandb.log({})

borisdayma · 2021-01-26T13:08:59Z

Interesting, can you check it solves the issue on your side @tristandeleu ?
If so I'll be happy to make a PR.

tristandeleu · 2021-01-26T13:20:44Z

It does work for me when I replace it with

if state.is_world_process_zero:
    self._wandb.log({})

There is also another thing I ran into at the same time: _log_model was not initialized on processes other than world 0, making the following check fail because it didn't know self._log_model. Adding self._log_model = False to __init__ solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue #9623 (comment)

sgugger · 2021-01-26T13:26:54Z

Don't hesitate to suggest a PR with your fix @tristandeleu

lkk12014402 · 2021-01-27T06:16:01Z

It does work for me when I replace it with
if state.is_world_process_zero:
    self._wandb.log({})


There is also another thing I ran into at the same time: `_log_model` was not initialized on processes other than world 0, making the following check fail because it didn't know `self._log_model`. Adding `self._log_model = False` to `__init__` solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue [#9623 (comment)](https://github.com/huggingface/transformers/issues/9623#issue-787077821)

I had the same problem. and I just use > if state.is_world_process_zero: self._wandb.log({}), forget self._log_model = False. Thanks !!!

lkk12014402 · 2021-01-27T08:38:26Z

It does work for me when I replace it with
if state.is_world_process_zero:
    self._wandb.log({})
There is also another thing I ran into at the same time: _log_model was not initialized on processes other than world 0, making the following check fail because it didn't know self._log_model. Adding self._log_model = False to __init__ solved the issue.

EDIT: This solves the issue with DDP though, I don't know if it also solves the original issue #9623 (comment)

Even with revising these codes, the program(with TPU) doesn't seem to stop at the end

@sgugger

This PR solves part of #9623 It tries to actually do what #9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb. The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}` I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere. @sgugger

@sgugger

* [t5 doc] typos a few run away backticks @sgugger * style * [trainer] put fp16 args together this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read @sgugger * style * [wandb] make WANDB_DISABLED disable wandb with any value This PR solves part of #9623 It tries to actually do what #9699 requested/discussed and that is any value of `WANDB_DISABLED` should disable wandb. The current behavior is that it has to be one of `ENV_VARS_TRUE_VALUES = {"1", "ON", "YES"}` I have been using `WANDB_DISABLED=true` everywhere in scripts as it was originally advertised. I have no idea why this was changed to a sub-set of possible values. And it's not documented anywhere. @sgugger * WANDB_DISABLED=true to disable; make tf trainer consistent * style

borisdayma · 2021-02-03T04:09:26Z

@lkk12014402 can you confirm it still happens with latest HF master branch?
If so do you have a reproducible example you could share?

github-actions · 2021-03-06T00:12:52Z

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

stas00 changed the title ~~wandb breaks tests~~ wandb breaks tests - importlib.util.find_spec-related under forked process Jan 15, 2021

tristandeleu mentioned this issue Jan 26, 2021

Commit the last step on world_process_zero in WandbCallback #9805

Merged

5 tasks

stas00 mentioned this issue Jan 30, 2021

[wandb] restore WANDB_DISABLED=true to disable wandb #9896

Merged

lhoestq mentioned this issue Feb 1, 2021

multiprocessing in dataset map "can only test a child process" huggingface/datasets#847

Closed

github-actions bot added the wontfix label Mar 6, 2021

github-actions bot closed this as completed Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wandb breaks tests - importlib.util.find_spec-related under forked process #9623

wandb breaks tests - importlib.util.find_spec-related under forked process #9623

stas00 commented Jan 15, 2021 •

edited

Loading

stas00 commented Jan 15, 2021 •

edited

Loading

stas00 commented Jan 17, 2021 •

edited

Loading

sgugger commented Jan 19, 2021

borisdayma commented Jan 19, 2021 •

edited

Loading

stas00 commented Jan 20, 2021 •

edited

Loading

tristandeleu commented Jan 26, 2021

borisdayma commented Jan 26, 2021

tristandeleu commented Jan 26, 2021 •

edited

Loading

sgugger commented Jan 26, 2021

lkk12014402 commented Jan 27, 2021 •

edited

Loading

lkk12014402 commented Jan 27, 2021

borisdayma commented Feb 3, 2021

github-actions bot commented Mar 6, 2021

wandb breaks tests - importlib.util.find_spec-related under forked process #9623

wandb breaks tests - importlib.util.find_spec-related under forked process #9623

Comments

stas00 commented Jan 15, 2021 • edited Loading

stas00 commented Jan 15, 2021 • edited Loading

stas00 commented Jan 17, 2021 • edited Loading

sgugger commented Jan 19, 2021

borisdayma commented Jan 19, 2021 • edited Loading

stas00 commented Jan 20, 2021 • edited Loading

tristandeleu commented Jan 26, 2021

borisdayma commented Jan 26, 2021

tristandeleu commented Jan 26, 2021 • edited Loading

sgugger commented Jan 26, 2021

lkk12014402 commented Jan 27, 2021 • edited Loading

lkk12014402 commented Jan 27, 2021

borisdayma commented Feb 3, 2021

github-actions bot commented Mar 6, 2021

stas00 commented Jan 15, 2021 •

edited

Loading

stas00 commented Jan 15, 2021 •

edited

Loading

stas00 commented Jan 17, 2021 •

edited

Loading

borisdayma commented Jan 19, 2021 •

edited

Loading

stas00 commented Jan 20, 2021 •

edited

Loading

tristandeleu commented Jan 26, 2021 •

edited

Loading

lkk12014402 commented Jan 27, 2021 •

edited

Loading