Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 8-bit multi-gpu training bug #1353

Merged
merged 4 commits into from
Feb 23, 2024
Merged

Conversation

fancyerii
Copy link
Contributor

see #1348

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @fancyerii ! I left one comment about making gradient_checkpointing_kwargs configurable, I think after that we can ship your PR ! 🚀

Comment on lines 78 to 81
device_map_local_process_idx: Optional[bool] = field(
default=True, metadata={"help": "whether to device map model to local process index, see "
"https://github.com/huggingface/trl/issues/1348"}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_map_local_process_idx: Optional[bool] = field(
default=True, metadata={"help": "whether to device map model to local process index, see "
"https://github.com/huggingface/trl/issues/1348"}
)

I think we can remove that and always create device_map = "device_map": {"": Accelerator().local_process_index}

Can you just make gradient_checkpointing_kwargs configurable here? 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

make gradient_checkpointing_kwargs configurable.
remote unnecessary config of device_map
@younesbelkada
Copy link
Contributor

Thanks @fancyerii !
Could you run the styling checks? make precommit

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@fancyerii
Copy link
Contributor Author

Thanks @fancyerii ! Could you run the styling checks? make precommit

done.
btw, when I run make test, there are many warnings about "wait for wandb.init"

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again !

@younesbelkada younesbelkada merged commit ca90cba into huggingface:main Feb 23, 2024
9 checks passed
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
* fix 8-bit multi-gpu training bug see huggingface#1348

* Update dpo_llama2.py

make gradient_checkpointing_kwargs configurable.

* Update dpo_llama2.py

remote unnecessary config of device_map

* format with make precommit

---------

Co-authored-by: ubuntu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants