RAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
Figure: Rollout and Update Pipeline
During rollout, we have two types of tokens
-
Environment tokens (shown in blue): Generated by the simulator/env, including states
$s$ and rewards$r$ -
LLM-generated tokens (shown in red): Including both thinking tokens
$t$ and action tokens$a$
The input consists of a sequence
The process flow is as follows:
- Given
$s_0,A_0,r_0,s_1..s_t$ , the LLM tries to generate$A_t,s_{t+1}...s_k$ - A forced truncation is performed to get
$A_t$ , which contains reasoning (<think>...</think>
) and answer (<ans>...</ans>
) -
$a_t$ is extracted from$A_t$ and fed into the simulator to obtain$r_t$ and$s_{t+1}$ -
$A_t$ ,$r_t$ and$s_{t+1}$ are appended to the existing trajectory to form the new input - After
$k$ rounds of rollout, obtaining the sequence$s_0,A_0,r_0,s_1...s_k$ to train the model - Rollouts are Generated in batch
During the update phase:
- Compute and back propagate loss for the tokens in orange
- Reward calculation: parsing
$r_0,...r_{k-1}$ from the trajectory tokens using regex-based rules - Final reward computation:
$r = {\rm sum}(r_0,...r_{k-1})$ for each rollout generated
- Unified Multi-round Processing: Maintains consistency by avoiding new instance creation that could destabilize batch sizes
- World Modeling: Potentially enables world modeling (state and reward prediction), helps LLM-agent to plan
We run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.
About the sokoban task (from the official repo): Sokoban is Japanese for warehouse keeper and a traditional video game. The game is a transportation puzzle, where the player has to push all boxes in the room on the storage locations/ targets. The possibility of making irreversible mistakes makes these puzzles so challenging especially for Reinforcement Learning algorithms, which mostly lack the ability to think ahead.
NOTE: See Visualization Section for details. The maximum reward of this environment is 10.9. Action spaces are 0-4 (0: Stand, 1: Up, 2: Down, 3: Left, 4: Right).
The loss curve have not converged (since our compute is currently limited...). But we already see some trends:
- Instruct-finetuned models are not significantly advantaged ahead Pretrained-only models, although they are better at start.
- 3B models are performing better than 0.5B models as well, but the advantages are also not that obvious at around 40 steps.
- Interestingly, R1-distilled 1.5B model do less well than 0.5B models for now.
We prepare to release a complete wandb plot for these experiment runs, although you can try it your own and it may even be faster than our run (reasons above).
To setup environment and download data (7MB), you can run:
bash scripts/setup_ragen.sh
python scripts/download_data.py
if it fails, you can try to run the lines in scripts/setup_ragen.md
manually.
On the Gym-Sokoban and FrozenLake tasks, We create 10k first-round-observation data for training, respectively.
Click here to see how to synthesize data manually.
You can choose to generate basic data or holisitic data for research purpose.
# basic data creation
bash scripts/create_data.sh
# holisitic data creation for research purpose
bash scripts/create_data_full.sh
If you want to upload data to huggingface:
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='ZihanWang314/ragen-datasets', repo_type='dataset')
api.upload_folder(
folder_path='data/',
repo_id='ZihanWang314/ragen-datasets',
repo_type='dataset'
)
Click here to see full data summarization.
The following table shows the different configurations available for the Sokoban environment:
Dataset Name | Grid Size (DIM_X × DIM_Y) | Number of Boxes | Search Depth | Description |
---|---|---|---|---|
sokoban | 6 × 6 | 1 | 30 | Standard settings |
sokoban_hard | 6 × 6 | 1 | 100 | Harder puzzles |
sokoban_xhard | 6 × 6 | 1 | 500 | Very challenging puzzles |
sokoban_large | 8 × 8 | 1 | 30 | Increased spatial complexity |
sokoban_xlarge | 10 × 10 | 1 | 30 | Very challenging spatial complexity |
sokoban_multi | 6 × 6 | 2 | 30 | Strategic complexity |
Common settings across all Sokoban variants:
- MAX_STEPS: 10
- Train size: 10,000 examples
- Test size: 10 examples
- Seed: 10000
FrozenLake environment maintains a single configuration:
- Grid Size: 6 × 6
- Frozen tile percentage (P): 0.8
- Train size: 10,000 examples
- Test size: 10 examples
- Seed: 100000
We provide a default config file in verl/trainer/config/ppo_trainer.yaml
. You can change the parameters in the file. Below scripts would train two Agents on these two tasks, respectively.
To understand and reproduce our experiments, please checkout cmd.md
for the command we use for each experiment.
NOTE: All possible arguments are in config/base.yaml and other yaml files.
bash train.sh sokoban \
model.experiment_name=new_test
# override config
bash train.sh sokoban \
model.experiment_name=new_test_debug \
training.train_batch_size=128 \
training.ppo_batch_size=64
# For developers, if you want to add your own config keys, please check [ base.yaml | train.sh | ragen/train.py | verl/trainer/config/ppo_trainer.yaml | and the main_ppo.py in verl/trainer/ppo ] to make sure the changes are reflected coherently.
NOTE: Only tested with 1 GPU
- Create supervised finetuning data, and parquet files will be saved in
sft/data/<env_type>/
- BFS is used to generate shortest action path for a given sokoban environment
- The data then fomulated as chat dataset.
bash sft/generate_data.sh <env_type>
- Finetune the model (with LoRA)
- setting arguments in
sft/finetune_lora.sh
- By setting
model.lora_rank=0
, we can turn off lora finetuning
bash sft/finetune_lora.sh <env_type> <num_gpus> <save_path>
- Merge the LoRA weights with the base model
- Currently verl main_ppo.py seems not support loading LoRA weights, so we need to merge them with the base model.
python sft/utils/merge_lora.py \
--base_model_name <base_model_name> \
--lora_model_path <lora_model_path> \
--output_path <output_path>
- Use SFT model for RL training
- By setting
BASE_MODEL
to the merged model path, we can use the SFT model for RL training.
- By setting arguments in
train.sh
, you can visualize the trajectory:
logging.log_images=True # set to True to log images
logging.log_image_dir=log/trajectory # set to the directory to save images
logging.log_image_step_size=4 # save image every _ steps
logging.log_n_image_per_batch=32 # save _ images per batch
You may use this command to visualize the trajectory:
cd log/trajectory
python -m http.server 8000
# check http://localhost:8000/[EXP_NAME]/step_[STEP_NUM]/trajectory_data_[ID].html
- You may also need to install fonts to make the figures displayed correctly:
sudo apt-get install fonts-noto-cjk
- Example image for one trajectory:
- Download visualization data from wandb:
from ragen.utils.wandb import download_wandb
download_wandb("RUN_ID") # e.g., 9o465jqj
Please see cases/ file. There are only limited cases for now, including reward hacking and the suck moment. we will add more cases recently.
We welcome all sorts of feedback! Please just raise an issue, no matter if it's any bugs you find or specific questions / suggestions regarding the project, so our team members won't be answering similar problems multiple times and thus would lead to more productive and efficient community building. Cheers!
*: Project Lead; †: Advising. Remaining authors are alphabetical order.
We thank DeepSeek for providing the DeepSeek-R1 model and ideas. We thank the veRL team for their infrastructure. We thank the TinyZero team for their discoveries that inspired our early exploration. We thank Yiping Lu, Runxin Xu, Kyunghyun Cho for insightful discussions with them.
@misc{RAGEN,
author = {Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Manling Li},
title = {RAGEN: A General-Purpose Reasoning Agent Training Framework},
year = {2025},
organization = {GitHub},
url = {https://github.com/ZihanWang314/ragen},
}