Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M1 GPU mps device integration #596

Merged
merged 14 commits into from
Aug 3, 2022

Conversation

pacman100
Copy link
Contributor

@pacman100 pacman100 commented Aug 2, 2022

What does this PR do?

  1. Enables users to leverage Apple M1 GPUs via mps device type in PyTorch for faster training and inference than CPU.
  2. Sample config after running the accelerate config command:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MPS
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
use_cpu: false
  1. run cv_example.py with and without MPS to gauge the speedup. The speedup being ~7.5X over CPU. This experiment is done on a Mac Pro with M1 chip having 8 CPU performance cores (+2 efficiency cores), 14 GPU cores and 16GB of unified memory.
#mps
time accelerate launch --config_file ~/mps_config.yaml ~/Code/accelerate/examples/cv_example.py --data_dir images

#cpu because of `--cpu` arg
time accelerate launch --config_file ~/mps_config.yaml ~/Code/accelerate/examples/cv_example.py --cpu --data_dir images
Device Training + Evaluation Time (minutes) Accuracy post training %
CPU 38:19.86 89.38
MPS 5:03.98 89.38

Note: Pre-requisites: Installing torch with mps support

# installing torch with m1 support on mac
# install python 3.10.5
# check the platform
import platform
platform.platform()
'macOS-12.5-arm64-arm-64bit' 
# (This is compatible as the macOS version is above 12.3 and it is the ARM64 version) 
# install torch 1.12 via the below command
# pip3 install torch torchvision torchaudio
# test the `mps` device support
>>> import torch
>>> torch.has_mps
True
>>> a = torch.Tensor([10,11])
>>> a.to("mps")
/Users/mac/ml/lib/python3.10/site-packages/torch/_tensor_str.py:103: UserWarning: The operator 'aten::bitwise_and.Tensor_out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.)
  nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
tensor([10.0000, 11.0000], device='mps:0')

Attaching plots showing GPU usage and CPU usage when they are enabled correspondingly:

GPU M1 mps enabled:
Screenshot 2022-08-02 at 4 29 58 PM

Only CPU training:
Screenshot 2022-08-02 at 6 58 39 PM

Note: For nlp_example.py the time saving is 30% over CPU but the metrics are too bad when compared to CPU-only training. This means certain operations in BERT model are going wrong using mps device and this needs to be fixed by PyTorch.
Screenshot 2022-08-02 at 4 13 15 PM

@pacman100 pacman100 requested review from muellerzr and sgugger August 2, 2022 13:34
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 2, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Left a suggestion to make sure that we get GPU tests actually running and passing, as I assume that's the right move here :)

@muellerzr muellerzr linked an issue Aug 2, 2022 that may be closed by this pull request
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition! I left some comments and we should also have some documentation around that integration (flagging that BERT has a loss of performance for instance).

Tests can be added in other PRs once we have better access to a machine with M1.

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/commands/launch.py Outdated Show resolved Hide resolved
src/accelerate/state.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments for now until the spacing nits are fixed and I can view it better on the website :)

docs/source/usage_guides/mps.mdx Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
docs/source/usage_guides/mps.mdx Show resolved Hide resolved
docs/source/usage_guides/mps.mdx Outdated Show resolved Hide resolved
Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I left some final doc nits for you 😄

@pacman100 pacman100 merged commit afa7490 into huggingface:main Aug 3, 2022
@pacman100 pacman100 deleted the smangrul/mps-support branch August 4, 2022 05:47
@faraday
Copy link

faraday commented Sep 9, 2022

notebook_launcher doesn't seem to be modified in order to utilize this change that supports MPS device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Apple's M1 metal integration
6 participants