-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's support naive Pipeline Parallelism #210
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Experiments of gpt-neo-1b int8 + peft multi-GPU : https://wandb.ai/distill-bloom/trl/runs/x3d6fig6?workspace=user-younesbelkada |
Ran a DP script with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. One main things that I think we need to fix soon is the way different approaches are loaded (peft, PP, int8). This would also allow us to test compatibility of different methods at loading time. Loading a model twice is not very intuitive but we can fix this in a dedicated PR.
"The model is offloaded on CPU or disk - CPU & disk offloading is not supported for ValueHead models." | ||
) | ||
|
||
first_device = list(set(self.pretrained_model.hf_device_map.values()))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sets do not necessarily preserve order, this is an issue here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in b9f75eb
|
||
first_device = list(set(self.pretrained_model.hf_device_map.values()))[0] | ||
|
||
self.v_head = self.v_head.to(first_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the head on the first device? naively i would have put it on the last device because it's called last, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the lm_head is usually on the first device, I modified a bit to use the lm_head device instead
examples/sentiment/scripts/gpt-neo-1b-multi-gpu/gpt-neo-1b_peft.py
Outdated
Show resolved
Hide resolved
pretrained_model = AutoModelForCausalLM.from_pretrained( | ||
config.model_name, load_in_8bit=True, device_map="balanced", max_memory={0: "800MB", 1: "800MB"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking mid-term we should integrate that into the model classes as well. It's not very intuitive to load AutoModelForCausalLM
and later AutoModelForCausalLMWithValueHead
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with peft. We could just pass the configs as kwargs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm for now we cant as we need to do it in 2 stages,
1- load the transformers model
2- pass it to get_peft_model
We can open a follow up PR for that to make it simpler
trl/models/modeling_value_head.py
Outdated
def set_device_hook(module, input, outputs): | ||
new_output = () | ||
for output in outputs: | ||
if isinstance(output, torch.Tensor): | ||
new_output += (output.to(first_device),) | ||
else: | ||
new_output += (output,) | ||
return new_output | ||
|
||
self.register_forward_hook(set_device_hook) | ||
self.is_sequential_parallel = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an explanation of what this what would be useful. maybe some comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
trl/trainer/ppo_config.py
Outdated
@@ -99,6 +101,7 @@ def __init__( | |||
accelerator_kwargs: Optional[dict] = {}, | |||
tracker_project_name: Optional[str] = "trl", | |||
max_grad_norm: Optional[float] = None, | |||
optimize_cuda_cache: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there drawbacks to setting it to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the order in the docstring and the kwargs is different, i think it's better to be consistent :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the order!
The drawback is maybe about the computational time of the step
function, didn't benchmarked that though
What does this PR do?
Trying to load a model in a single device is cool, but what if we can split the model across multiple devices?
Users will just have to pass a custom
device_map
when loading the model, and it should work out of the box.This PR adds the support of "Sequential Parallel" - termed as naive Pipeline Parallelism as the real Pipeline parallelism involves dealing with multi-processing and gradients synchronisation that cannot be handled easily.
This PR depends on the following PRs:
accelerate
: [Accelerator
] We should not callto
on modules that wrapsaccelerate
loaded models accelerate#1172peft
: [core
] Fix peft multi-gpu issue peft#145TODOs:
cc @lvwerra @edbeeching