-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] several features for veRL integration #2736
Comments
Quick update: A POC that can run with TP=4 on 4 GPU cards. The code is super hacky - will rigorously do refactors on SGLang later. Code: https://github.com/fzyzcjy/sglang/tree/feat/add_verl, more specifically https://github.com/fzyzcjy/sglang/blob/feat/add_verl/examples/runtime/engine/offline_batch_inference_torchrun.py Experiment: Run llama 70B on 4 GPUs. (If the code is buggy such that it does not enable TP, then we will see OOM). Output:
|
Quick update: Refactor in progress: #2747 Question: Is the following API looks good? As we know, users originally create one An alternative API is to allow exposing different output types or doing hacky conversions. For example, by directly exposing Scheduler class with some kind of thin wrapping. That would be faster to implement, but in my humble opinion maybe a bit uglier. So I personally like the proposed one above. |
Quick update: PR to SGLang that seems to support TP. Code: https://github.com/fzyzcjy/sglang/tree/feat/process_coordinator, more specifically https://github.com/fzyzcjy/sglang/blob/feat/process_coordinator/examples/runtime/engine/offline_batch_inference_torchrun.py Draft PR: #2749 It seems to work now, but I will need to do more checks later. llama 70B on 4xH100 outputs something seemingly reasonable: |
Checklist
Motivation
TL;DR: Introducing several features that would be beneficial for integrating SGLang into veRL and may also be beneficial for other Post-Training frameworks.
Provide an inference script that is started by torchrun (support SPMD)
Currently, the offline inference script is launched by
sgl.Engine
. Internally, it spawns multipleScheduler
.With
torchrun
, theScheduler
is launched bytorchrun
and the tp_rank can be obtained from the environ.In veRL, the Data Parallel dimension is managed by our
WorkerGroup
and the dp_rank of each Scheduler should be None.More specifically, if the current
WorkerGroup
has 8 GPUs while we set the Rollout TP size to 2. All the GPUs in thisWorkerGroup
will build the distributed world and the generation engine and training engine will construct its own TP/PP groups. veRL'sdata_protocol
will partition and dispatch the prompts to each TP/PP group without the generation engine is aware of the DP dimension.A general picture of a torchrun script that can simulate the HybridEngine behavior.
Expose an API that can load weights in TP/PP format
inference_engine.sync_model_weights(actor_weights=state_dict, load_format='dtensor')
in the above code.We may need two different load formats with different weight loaders:
Expose an API that can free/re-init kv cache, and offload/load model weights
inference_engine.free_kvcache()
andinference_engine.init_kvcache()
;inference_engine.offload_model_weights()
andinference_engine.load_model_weights()
It would be better to support CUDAGraph although we offload kvcache and model weights. Reference: #2542
Disable detokenize during generation.
In RL training, we only need token_ids in most training scenarios and we can perform batch detokenize when we really need tokens. We don't care about the ITL metric.
After being disabled, we can check whether there are any opportunities to improve the throughput
3D-HybridEngine parallel state construction (TP/PP group generation logic should be different from Megatron-LM when using 3D-HybridEngine)
With our 3D-HybridEngine design in paper and code, the grouping strategy for TP/PP in SGLang shall be aware of the TP/PP size in training framework.
We consider that SGLang is not necessarily to be aware of the TP/PP size in the training framework.
So, we can build the TP/PP groups for SGLang before SGLang initialization and then update these TP/PP groups to the SGLEngine. See [Optional] in the above code.
Output post-process to torch.Tensor (token_ids).
A small feature, if not supported, we can implement some post-process in veRL. No worries.
Related resources
No response
The text was updated successfully, but these errors were encountered: