Merge commit 'df8939818d2b3694d14120d8fb07eea96e5b99a8' into feat/unl…

…soth_fast_grpo * commit 'df8939818d2b3694d14120d8fb07eea96e5b99a8': (24 commits) GRPO+LMDeploy 0.7 (modelscope#3277) fix lmdeploy (modelscope#3274) compat lmdeploy 0.7 (modelscope#3256) Fix typos (modelscope#3266) Support the base64 format of generated images for JanusPro (modelscope#3265) grpo_countdown & fix format reward (modelscope#3269) fix grpo compat transformers==4.47.* (modelscope#3252) save val_dataset (modelscope#3248) fix grpo single gpu(modelscope#3246) fix grpo npu vllm (modelscope#3242) update docs (modelscope#3243) support muon optimizer (modelscope#3234) support moonlight (modelscope#3232) fix deepseek_vl2 (modelscope#3233) fix docs zh (modelscope#3231) Speed up GRPO (modelscope#3229) update docs (modelscope#3230) fix load args (modelscope#3226) Update the JanusPro-generation (modelscope#3221) Support the generation of JanusPro models (modelscope#3218) ...
tastelikefeet · Feb 26, 2025 · ddedb66 · ddedb66
2 parents 1a06843 + df89398
commit ddedb66
Show file tree

Hide file tree

Showing 85 changed files with 2,182 additions and 278 deletions.
diff --git a/README.md b/README.md
@@ -63,7 +63,7 @@ You can contact us and communicate with us by adding our group:
 
 - 🍎 **Model Types**: Supports 450+ pure text large models, **150+ multi-modal large models**, as well as All-to-All multi-modal models, sequence classification models, and embedding models, **covering the entire process from training to deployment**.
 - **Dataset Types**: Comes with 150+ pre-training, fine-tuning, human alignment, multi-modal datasets, and supports custom datasets.
-- **Hardware Support**: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, etc.
+- **Hardware Support**: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, MPS, etc.
 - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
 - **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
 - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
@@ -78,6 +78,8 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.02.21: We test the speed performance of GRPO，and with some tricks to [speed up to 300%](examples/train/grpo/full_lmdeploy.sh). WanDB charts can be found [here](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
+- 🎁 2025.02.21: Support distill from LLM API，Please check[this example](examples/sampler/distill/distill.sh)
 - 🎁 2025.02.17: Support SwanLab, just add [a few of arguments](docs/source_en/Instruction/Command-line-parameters.md#swanlab) you can use swanlab to analysis your training results
 - 🎁 2025.02.16: Support LMDeploy in GRPO, use `--use_lmdeploy true`. Please check [this script](examples/train/grpo/full_lmdeploy.sh)
 - 🔥 2025.02.12: Support for GRPO(Group Relative Policy Optimization) algorithm for llm and mllm, document can be found in [here](docs/source_en/Instruction/GRPO.md)
@@ -113,13 +115,13 @@ Running Environment:
 | python       | >=3.9                | 3.10        |                                           |
 | cuda         |                      | cuda12      | No need to install if using CPU, NPU, MPS |
 | torch        | >=2.0                |             |                                           |
-| transformers | >=4.33               | 4.48.3      |                                           |
+| transformers | >=4.33               | 4.49      |                                           |
 | modelscope   | >=1.19               |             |                                           |
-| peft         | >=0.11.0,<0.15.0     |             |                                           |
-| trl          | >=0.13,<0.16         | 0.15      | RLHF                                      |
-| deepspeed    | >=0.14 |  | Training                                  |
-| vllm         | >=0.5.1              | 0.7.2       | Inference/Deployment/Evaluation           |
-| lmdeploy     | lmdeploy>=0.5,<0.6.5 | 0.6.4       | Inference/Deployment/Evaluation           |
+| peft | >=0.11,<0.15 | ||
+| trl | >=0.13,<0.17 | 0.15 |RLHF|
+| deepspeed    | >=0.14 | 0.14.5 | Training                                  |
+| vllm         | >=0.5.1              | 0.7.3       | Inference/Deployment/Evaluation           |
+| lmdeploy     | lmdeploy>=0.5 | 0.7.0.post3       | Inference/Deployment/Evaluation           |
 | evalscope | | >=0.11 | Evaluation |
 
 For more optional dependencies, you can refer to [here](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh).
@@ -164,7 +166,7 @@ swift sft \
 
 Tips:
 
-- If you want to train with a custom dataset, you can refer to [this guide](../Customization/Custom-dataset.md) to organize your dataset format and specify `--dataset <dataset_path>`.
+- If you want to train with a custom dataset, you can refer to [this guide](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html) to organize your dataset format and specify `--dataset <dataset_path>`.
 - The `--model_author` and `--model_name` parameters are only effective when the dataset includes `swift/self-cognition`.
 - To train with a different model, simply modify `--model <model_id/model_path>`.
 - By default, ModelScope is used for downloading models and datasets. If you want to use HuggingFace, simply specify `--use_hf true`.

diff --git a/README_CN.md b/README_CN.md
@@ -60,7 +60,7 @@
 **为什么选择ms-swift？**
 - 🍎 **模型类型**：支持450+纯文本大模型、**150+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**。
 - **数据集类型**：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
-- **硬件支持**：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU等。
+- **硬件支持**：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
 - 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - **量化训练**：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
@@ -74,6 +74,8 @@
 - **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.02.21: 我们测试了GRPO算法的性能，并且使用一些tricks使[训练速度提高到300%](examples/train/grpo/full_lmdeploy.sh). WanDB表格请查看[这里](https://wandb.ai/tastelikefeet/grpo_perf_test?nw=nwuseryuzezyz)
+- 🎁 2025.02.21: 支持大模型API蒸馏采样，请查看[示例](examples/sampler/distill/distill.sh)
 - 🎁 2025.02.17: 支持SwanLab, 仅需添加[几个新的参数](docs/source/Instruction/命令行参数.md#swanlab)就可以在swanlab上验证你的训练效果
 - 🎁 2025.02.16: 在GRPO算法中支持LMDeploy, 请查看`--use_lmdeploy true`. 具体参考[这个脚本](examples/train/grpo/full_lmdeploy.sh)
 - 🔥 2025.02.12: 支持GRPO(Group Relative Policy Optimization) 训练算法，训练脚本可以在[这里](docs/source/Instruction/GRPO.md)找到
@@ -108,13 +110,13 @@ pip install -e .
 | python | >=3.9 | 3.10 ||
 | cuda |  | cuda12 |使用cpu、npu、mps则无需安装|
 | torch | >=2.0 |  ||
-| transformers | >=4.33 | 4.48.3 ||
+| transformers | >=4.33 | 4.49 ||
 | modelscope | >=1.19 |  ||
-| peft | >=0.11.0,<0.15.0 | ||
-| trl | >=0.13,<0.16 | 0.15 |RLHF|
-| deepspeed | >=0.14 |  |训练|
-| vllm | >=0.5.1 | 0.7.2 |推理/部署/评测|
-| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
+| peft | >=0.11,<0.15 | ||
+| trl | >=0.13,<0.17 | 0.15 |RLHF|
+| deepspeed | >=0.14 | 0.14.5 |训练|
+| vllm | >=0.5.1 | 0.7.3 |推理/部署/评测|
+| lmdeploy | lmdeploy>=0.5 | 0.7.0.post3 |推理/部署/评测|
 | evalscope |  | >=0.11 |评测|
 
 更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)。

diff --git a/docs/resources/grpo_countdown.png b/docs/resources/grpo_countdown.png
diff --git a/docs/resources/grpo_countdown_1.png b/docs/resources/grpo_countdown_1.png
diff --git a/docs/source/.readthedocs.yaml b/docs/source/.readthedocs.yaml
@@ -9,7 +9,7 @@ version: 2
 build:
   os: ubuntu-22.04
   tools:
-    python: "3.12"
+    python: "3.10"
 
 # Build documentation in the "docs/" directory with Sphinx
 sphinx:

diff --git a/docs/source/GetStarted/SWIFT安装.md b/docs/source/GetStarted/SWIFT安装.md
@@ -57,13 +57,13 @@ pip install ms-swift==2.*
 | python | >=3.9 | 3.10 ||
 | cuda |  | cuda12 |使用cpu、npu、mps则无需安装|
 | torch | >=2.0 |  ||
-| transformers | >=4.33 | 4.48.3 ||
+| transformers | >=4.33 | 4.49 ||
 | modelscope | >=1.19 |  ||
-| peft | >=0.11.0,<0.15.0 | ||
-| trl | >=0.13,<0.16 | 0.15 |RLHF|
-| deepspeed | >=0.14 |  |训练|
-| vllm | >=0.5.1 | 0.7.2 |推理/部署/评测|
-| lmdeploy | lmdeploy>=0.5,<0.6.5 | 0.6.4 |推理/部署/评测|
+| peft | >=0.11,<0.15 | ||
+| trl | >=0.13,<0.17 | 0.15 |RLHF|
+| deepspeed | >=0.14 | 0.14.5 |训练|
+| vllm | >=0.5.1 | 0.7.3 |推理/部署/评测|
+| lmdeploy | lmdeploy>=0.5 | 0.7.0.post3 |推理/部署/评测|
 | evalscope |  | >=0.11 |评测|
 
 更多可选依赖可以参考[这里](https://github.com/modelscope/ms-swift/blob/main/requirements/install_all.sh)。

diff --git a/docs/source/GetStarted/快速开始.md b/docs/source/GetStarted/快速开始.md
@@ -4,7 +4,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 
 - 🍎 模型类型：支持450+纯文本大模型、150+多模态大模型以及All-to-All全模态模型、序列分类模型、Embedding模型训练到部署全流程。
 - 数据集类型：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
-- 硬件支持：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU等。
+- 硬件支持：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
 - 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - 量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。

diff --git a/docs/source/Instruction/GRPO.md b/docs/source/Instruction/GRPO.md
@@ -7,7 +7,7 @@
 环境安装
 ```bash
 pip install math_verify # reward function
-pip install "trl>=0.15"
+pip install git+https://github.com/huggingface/trl.git"
 ```
 
 **注意**：训练过程中 loss 接近0 是正常情况， 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
@@ -95,6 +95,8 @@ A conversation between User and Assistant. The user asks a question, and the Ass
 - vllm_gpu_memory_utilization: vLLM透传参数
 - vllm_max_model_len: vLLM透传参数
 - reward_model: 同model, 使用奖励模型作为奖励函数，与reward_funcs至少需要指定一个
+- num_iterations: 每个批次代更新次数，默认为1.
+- epsilon: clip 系数
 
 奖励函数超参，见[内置奖励函数](#内置奖励函数)