Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's support naive Pipeline Parallelism #210

Merged
merged 11 commits into from
Mar 15, 2023
Merged

Let's support naive Pipeline Parallelism #210

merged 11 commits into from
Mar 15, 2023

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Mar 10, 2023

What does this PR do?

Trying to load a model in a single device is cool, but what if we can split the model across multiple devices?
Users will just have to pass a custom device_map when loading the model, and it should work out of the box.

This PR adds the support of "Sequential Parallel" - termed as naive Pipeline Parallelism as the real Pipeline parallelism involves dealing with multi-processing and gradients synchronisation that cannot be handled easily.

This PR depends on the following PRs:

TODOs:

  • users should NOT apply DP and/or DeepSpeed with this approach as it remains untested
  • should we introduce multi-gpu tests that we optionally run?
  • test with 8bit models
  • update docs

cc @lvwerra @edbeeching

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 10, 2023

The documentation is not available anymore as the PR was closed or merged.

@younesbelkada
Copy link
Contributor Author

Experiments of gpt-neo-1b int8 + peft multi-GPU : https://wandb.ai/distill-bloom/trl/runs/x3d6fig6?workspace=user-younesbelkada
Single GPU baseline with peft and int8: https://wandb.ai/distill-bloom/trl/runs/rgcqxtfd?workspace=user-younesbelkada

@younesbelkada
Copy link
Contributor Author

Ran a DP script with accelerate launch gpt2-sentiment.py to make sure nothing is broken in DP and seems to work like charm!
@lvwerra @edbeeching this is ready for review

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. One main things that I think we need to fix soon is the way different approaches are loaded (peft, PP, int8). This would also allow us to test compatibility of different methods at loading time. Loading a model twice is not very intuitive but we can fix this in a dedicated PR.

"The model is offloaded on CPU or disk - CPU & disk offloading is not supported for ValueHead models."
)

first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sets do not necessarily preserve order, this is an issue here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in b9f75eb


first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]

self.v_head = self.v_head.to(first_device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the head on the first device? naively i would have put it on the last device because it's called last, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the lm_head is usually on the first device, I modified a bit to use the lm_head device instead

Comment on lines 136 to 138
pretrained_model = AutoModelForCausalLM.from_pretrained(
config.model_name, load_in_8bit=True, device_map="balanced", max_memory={0: "800MB", 1: "800MB"}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking mid-term we should integrate that into the model classes as well. It's not very intuitive to load AutoModelForCausalLM and later AutoModelForCausalLMWithValueHead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with peft. We could just pass the configs as kwargs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm for now we cant as we need to do it in 2 stages,
1- load the transformers model
2- pass it to get_peft_model
We can open a follow up PR for that to make it simpler

Comment on lines 324 to 334
def set_device_hook(module, input, outputs):
new_output = ()
for output in outputs:
if isinstance(output, torch.Tensor):
new_output += (output.to(first_device),)
else:
new_output += (output,)
return new_output

self.register_forward_hook(set_device_hook)
self.is_sequential_parallel = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an explanation of what this what would be useful. maybe some comments :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -99,6 +101,7 @@ def __init__(
accelerator_kwargs: Optional[dict] = {},
tracker_project_name: Optional[str] = "trl",
max_grad_norm: Optional[float] = None,
optimize_cuda_cache: Optional[bool] = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there drawbacks to setting it to true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the order in the docstring and the kwargs is different, i think it's better to be consistent :)

Copy link
Contributor Author

@younesbelkada younesbelkada Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the order!
The drawback is maybe about the computational time of the step function, didn't benchmarked that though

@younesbelkada younesbelkada requested a review from lvwerra March 14, 2023 11:56
@younesbelkada younesbelkada merged commit 03d9844 into main Mar 15, 2023
@younesbelkada younesbelkada deleted the temp-3 branch March 15, 2023 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants