[misc] feat: support offload parameter and optimizer during rollout #284

PeterSH6 · 2025-02-15T14:24:10Z

Fixed FSDP1 model offload
With actor_rollout_ref.actor.fsdp_config.param_offload=True \ and actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ . The GPU memory utilization can increase to 0.9
With actor, critic and reference offload all enabled, there will only be one model copy at a time in the GPU memory. Therefore, we can further increase the micro_batch_size_per_gpu or max_token_per_gpu

Specifically:

During rollout, only rollout model and KVCache are in the GPU memory.
During critic compute values, only the critic model will stay in the GPU memory while its optimizer and other model states are in CPU main memory
During actor update, the actor model, optimizer are stored on GPU while the reference model and critic model, critic optimizer are offloaded to CPU.

vermouth1992 · 2025-02-16T04:36:45Z

verl/utils/fsdp_utils.py

-            param.grad = param.grad.to("cpu", non_blocking=True)
-    torch.cuda.empty_cache()
+@torch.no_grad()
+def offload_fsdp_model_to_cpu(model: FSDP, empty_cache: bool = True):


Shall we write a unit test for these two functions?

Will add a CI in the future

PeterSH6 added 4 commits February 15, 2025 08:30

fix fsdp actor critic load/offload

1b13d15

delete grad_offload option

d6b1f6e

add offload when rollout

1c8dbaa

lint

ee00b59

PeterSH6 requested a review from vermouth1992 February 16, 2025 03:58

vermouth1992 reviewed Feb 16, 2025

View reviewed changes

vermouth1992 approved these changes Feb 17, 2025

View reviewed changes

vermouth1992 merged commit 9db5232 into main Feb 17, 2025
15 checks passed

vermouth1992 deleted the gm/fix_offload branch February 17, 2025 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] feat: support offload parameter and optimizer during rollout #284

[misc] feat: support offload parameter and optimizer during rollout #284

PeterSH6 commented Feb 15, 2025 •

edited

Loading

vermouth1992 Feb 16, 2025

PeterSH6 Feb 17, 2025

[misc] feat: support offload parameter and optimizer during rollout #284

[misc] feat: support offload parameter and optimizer during rollout #284

Conversation

PeterSH6 commented Feb 15, 2025 • edited Loading

vermouth1992 Feb 16, 2025

Choose a reason for hiding this comment

PeterSH6 Feb 17, 2025

Choose a reason for hiding this comment

PeterSH6 commented Feb 15, 2025 •

edited

Loading