Let's support naive Pipeline Parallelism #210

younesbelkada · 2023-03-10T11:10:26Z

What does this PR do?

Trying to load a model in a single device is cool, but what if we can split the model across multiple devices?
Users will just have to pass a custom device_map when loading the model, and it should work out of the box.

This PR adds the support of "Sequential Parallel" - termed as naive Pipeline Parallelism as the real Pipeline parallelism involves dealing with multi-processing and gradients synchronisation that cannot be handled easily.

This PR depends on the following PRs:

accelerate: [Accelerator] We should not call to on modules that wraps accelerate loaded models accelerate#1172
peft: [core] Fix peft multi-gpu issue peft#145

TODOs:

users should NOT apply DP and/or DeepSpeed with this approach as it remains untested
should we introduce multi-gpu tests that we optionally run?
test with 8bit models
update docs

cc @lvwerra @edbeeching

HuggingFaceDocBuilderDev · 2023-03-10T11:13:26Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada · 2023-03-13T13:42:17Z

Experiments of gpt-neo-1b int8 + peft multi-GPU : https://wandb.ai/distill-bloom/trl/runs/x3d6fig6?workspace=user-younesbelkada
Single GPU baseline with peft and int8: https://wandb.ai/distill-bloom/trl/runs/rgcqxtfd?workspace=user-younesbelkada

younesbelkada · 2023-03-13T15:50:06Z

Ran a DP script with accelerate launch gpt2-sentiment.py to make sure nothing is broken in DP and seems to work like charm!
@lvwerra @edbeeching this is ready for review

lvwerra

Looks good overall. One main things that I think we need to fix soon is the way different approaches are loaded (peft, PP, int8). This would also allow us to test compatibility of different methods at loading time. Loading a model twice is not very intuitive but we can fix this in a dedicated PR.

lvwerra · 2023-03-13T16:51:50Z

trl/models/modeling_value_head.py

+                    "The model is offloaded on CPU or disk - CPU & disk offloading is not supported for ValueHead models."
+                )
+
+            first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]


sets do not necessarily preserve order, this is an issue here, no?

fixed in b9f75eb

lvwerra · 2023-03-13T16:52:26Z

trl/models/modeling_value_head.py

+
+            first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]
+
+            self.v_head = self.v_head.to(first_device)


why is the head on the first device? naively i would have put it on the last device because it's called last, no?

Because the lm_head is usually on the first device, I modified a bit to use the lm_head device instead

examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py

lvwerra · 2023-03-13T16:59:17Z

examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py

+pretrained_model = AutoModelForCausalLM.from_pretrained(
+    config.model_name, load_in_8bit=True, device_map="balanced", max_memory={0: "800MB", 1: "800MB"}
+)


I am thinking mid-term we should integrate that into the model classes as well. It's not very intuitive to load AutoModelForCausalLM and later AutoModelForCausalLMWithValueHead.

Same with peft. We could just pass the configs as kwargs, right?

Hmm for now we cant as we need to do it in 2 stages,
1- load the transformers model
2- pass it to get_peft_model
We can open a follow up PR for that to make it simpler

lvwerra · 2023-03-13T17:14:27Z

trl/models/modeling_value_head.py

+            def set_device_hook(module, input, outputs):
+                new_output = ()
+                for output in outputs:
+                    if isinstance(output, torch.Tensor):
+                        new_output += (output.to(first_device),)
+                    else:
+                        new_output += (output,)
+                return new_output
+
+            self.register_forward_hook(set_device_hook)
+            self.is_sequential_parallel = True


an explanation of what this what would be useful. maybe some comments :)

lvwerra · 2023-03-13T17:17:05Z

trl/trainer/ppo_config.py

@@ -99,6 +101,7 @@ def __init__(
        accelerator_kwargs: Optional[dict] = {},
        tracker_project_name: Optional[str] = "trl",
        max_grad_norm: Optional[float] = None,
+        optimize_cuda_cache: Optional[bool] = False,


are there drawbacks to setting it to true?

also the order in the docstring and the kwargs is different, i think it's better to be consistent :)

Fixed the order!
The drawback is maybe about the computational time of the step function, didn't benchmarked that though

younesbelkada added 2 commits March 10, 2023 10:47

add fixes in to support PP

38891c1

add same logic for enc-dec

97b4335

younesbelkada mentioned this pull request Mar 10, 2023

[Gpt-neo-x] Fix gpt neo-x multi gpu training huggingface/transformers#22089

Closed

younesbelkada added 3 commits March 10, 2023 14:58

add more checks

82d69dd

fix 20b issues

d224dfb

clean up

ccba521

younesbelkada added 2 commits March 13, 2023 15:02

update scripts

af0d4c2

dp safety checker

8a24888

younesbelkada requested review from lvwerra and edbeeching March 13, 2023 15:50

lvwerra reviewed Mar 13, 2023

View reviewed changes

younesbelkada added 4 commits March 14, 2023 11:40

added multi gpu tests

f8eacd9

fix order

b749d9d

change

b9f75eb

fix script

9ea1212

younesbelkada requested a review from lvwerra March 14, 2023 11:56

lvwerra approved these changes Mar 14, 2023

View reviewed changes

younesbelkada merged commit 03d9844 into main Mar 15, 2023

younesbelkada deleted the temp-3 branch March 15, 2023 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's support naive Pipeline Parallelism #210

Let's support naive Pipeline Parallelism #210

younesbelkada commented Mar 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 10, 2023 •

edited

Loading

younesbelkada commented Mar 13, 2023

younesbelkada commented Mar 13, 2023

lvwerra left a comment

lvwerra Mar 13, 2023

younesbelkada Mar 14, 2023

lvwerra Mar 13, 2023

younesbelkada Mar 14, 2023

lvwerra Mar 13, 2023

lvwerra Mar 13, 2023

younesbelkada Mar 14, 2023

lvwerra Mar 13, 2023

younesbelkada Mar 14, 2023

lvwerra Mar 13, 2023

lvwerra Mar 13, 2023

younesbelkada Mar 14, 2023 •

edited

Loading


		first_device = list(set(self.pretrained_model.hf_device_map.values()))[0]

		self.v_head = self.v_head.to(first_device)

Let's support naive Pipeline Parallelism #210

Let's support naive Pipeline Parallelism #210

Conversation

younesbelkada commented Mar 10, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 10, 2023 • edited Loading

younesbelkada commented Mar 13, 2023

younesbelkada commented Mar 13, 2023

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

younesbelkada commented Mar 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 10, 2023 •

edited

Loading

younesbelkada Mar 14, 2023 •

edited

Loading