Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] feat: support offload parameter and optimizer during rollout #284

Merged
merged 4 commits into from
Feb 17, 2025

Conversation

PeterSH6
Copy link
Collaborator

@PeterSH6 PeterSH6 commented Feb 15, 2025

  • Fixed FSDP1 model offload
  • With actor_rollout_ref.actor.fsdp_config.param_offload=True \ and actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ . The GPU memory utilization can increase to 0.9
  • With actor, critic and reference offload all enabled, there will only be one model copy at a time in the GPU memory. Therefore, we can further increase the micro_batch_size_per_gpu or max_token_per_gpu

Specifically:

  • During rollout, only rollout model and KVCache are in the GPU memory.
  • During critic compute values, only the critic model will stay in the GPU memory while its optimizer and other model states are in CPU main memory
  • During actor update, the actor model, optimizer are stored on GPU while the reference model and critic model, critic optimizer are offloaded to CPU.

param.grad = param.grad.to("cpu", non_blocking=True)
torch.cuda.empty_cache()
@torch.no_grad()
def offload_fsdp_model_to_cpu(model: FSDP, empty_cache: bool = True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we write a unit test for these two functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a CI in the future

@vermouth1992 vermouth1992 merged commit 9db5232 into main Feb 17, 2025
15 checks passed
@vermouth1992 vermouth1992 deleted the gm/fix_offload branch February 17, 2025 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants