RAGEN: A General-Purpose Reasoning Agent Training Framework

RAGEN is the first reproduction of the DeepSeek-R1(-Zero) methods for training agentic models.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.

Framework

Figure: Rollout and Update Pipeline

Rollout Phase

During rollout, we have two types of tokens

Environment tokens (shown in blue): Generated by the simulator/env, including states $s$ and rewards $r$
LLM-generated tokens (shown in red): Including both thinking tokens $t$ and action tokens $a$

The input consists of a sequence $s_0,A_0,r_0...s_t$, and the output is $A_t$, which contains both thinking $t_t$ and answer $a_t$, where only $a_t$ will be sent to the simulator. While the LLM could potentially generate the entire trajectory given the current state and history trajectory information, we implement a forced truncation after the first generated answer.

The process flow is as follows:

Given $s_0,A_0,r_0,s_1..s_t$, the LLM tries to generate $A_t,s_{t+1}...s_k$
A forced truncation is performed to get $A_t$, which contains reasoning (<think>...</think>) and answer (<ans>...</ans>)
$a_t$ is extracted from $A_t$ and fed into the simulator to obtain $r_t$ and $s_{t+1}$
$A_t$, $r_t$ and $s_{t+1}$ are appended to the existing trajectory to form the new input
After $k$ rounds of rollout, obtaining the sequence $s_0,A_0,r_0,s_1...s_k$ to train the model
Rollouts are Generated in batch

Update Phase

During the update phase:

Compute and back propagate loss for the tokens in orange
Reward calculation: parsing $r_0,...r_{k-1}$ from the trajectory tokens using regex-based rules
Final reward computation: $r = {\rm sum}(r_0,...r_{k-1})$ for each rollout generated

Benefits

Unified Multi-round Processing: Maintains consistency by avoiding new instance creation that could destabilize batch sizes
World Modeling: Potentially enables world modeling (state and reward prediction), helps LLM-agent to plan

Performance

We run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.

About the sokoban task (from the official repo): Sokoban is Japanese for warehouse keeper and a traditional video game. The game is a transportation puzzle, where the player has to push all boxes in the room on the storage locations/ targets. The possibility of making irreversible mistakes makes these puzzles so challenging especially for Reinforcement Learning algorithms, which mostly lack the ability to think ahead.

NOTE: See Visualization Section for details. The maximum reward of this environment is 10.9. Action spaces are 0-4 (0: Stand, 1: Up, 2: Down, 3: Left, 4: Right).

The loss curve have not converged (since our compute is currently limited...). But we already see some trends:

Instruct-finetuned models are not significantly advantaged ahead Pretrained-only models, although they are better at start.
3B models are performing better than 0.5B models as well, but the advantages are also not that obvious at around 40 steps.
Interestingly, R1-distilled 1.5B model do less well than 0.5B models for now.

We prepare to release a complete wandb plot for these experiment runs, although you can try it your own and it may even be faster than our run (reasons above).

Environment Setup

To setup environment and download data (7MB), you can run:

bash scripts/setup_ragen.sh

python scripts/download_data.py

if it fails, you can try to run the lines in scripts/setup_ragen.md manually.

Train Models

Create data

On the Gym-Sokoban and FrozenLake tasks, We create 10k first-round-observation data for training, respectively.

Click here to see how to synthesize data manually.

You can choose to generate basic data or holisitic data for research purpose.

# basic data creation
bash scripts/create_data.sh

# holisitic data creation for research purpose
bash scripts/create_data_full.sh

If you want to upload data to huggingface:

from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='ZihanWang314/ragen-datasets', repo_type='dataset')
api.upload_folder(
    folder_path='data/',
    repo_id='ZihanWang314/ragen-datasets',
    repo_type='dataset'
)

Click here to see full data summarization.

Sokoban Dataset Variants

The following table shows the different configurations available for the Sokoban environment:

Dataset Name	Grid Size (DIM_X × DIM_Y)	Number of Boxes	Search Depth	Description
sokoban	6 × 6	1	30	Standard settings
sokoban_hard	6 × 6	1	100	Harder puzzles
sokoban_xhard	6 × 6	1	500	Very challenging puzzles
sokoban_large	8 × 8	1	30	Increased spatial complexity
sokoban_xlarge	10 × 10	1	30	Very challenging spatial complexity
sokoban_multi	6 × 6	2	30	Strategic complexity

Common settings across all Sokoban variants:

MAX_STEPS: 10
Train size: 10,000 examples
Test size: 10 examples
Seed: 10000

FrozenLake Dataset

FrozenLake environment maintains a single configuration:

Grid Size: 6 × 6
Frozen tile percentage (P): 0.8
Train size: 10,000 examples
Test size: 10 examples
Seed: 100000

Export variables and train

We provide a default config file in verl/trainer/config/ppo_trainer.yaml. You can change the parameters in the file. Below scripts would train two Agents on these two tasks, respectively.

To understand and reproduce our experiments, please checkout cmd.md for the command we use for each experiment.

NOTE: All possible arguments are in config/base.yaml and other yaml files.

bash train.sh sokoban \
    model.experiment_name=new_test

# override config
bash train.sh sokoban \
    model.experiment_name=new_test_debug \
    training.train_batch_size=128 \
    training.ppo_batch_size=64

# For developers, if you want to add your own config keys, please check [ base.yaml | train.sh | ragen/train.py | verl/trainer/config/ppo_trainer.yaml | and the main_ppo.py in verl/trainer/ppo ] to make sure the changes are reflected coherently.

Supervised Finetuning (Optional)

NOTE: Only tested with 1 GPU

Create supervised finetuning data, and parquet files will be saved in sft/data/<env_type>/

BFS is used to generate shortest action path for a given sokoban environment
The data then fomulated as chat dataset.

bash sft/generate_data.sh <env_type>

Finetune the model (with LoRA)

setting arguments in sft/finetune_lora.sh
By setting model.lora_rank=0, we can turn off lora finetuning

bash sft/finetune_lora.sh <env_type> <num_gpus> <save_path>

Merge the LoRA weights with the base model

Currently verl main_ppo.py seems not support loading LoRA weights, so we need to merge them with the base model.

python sft/utils/merge_lora.py \
    --base_model_name <base_model_name> \
    --lora_model_path <lora_model_path> \
    --output_path <output_path>

Use SFT model for RL training

By setting BASE_MODEL to the merged model path, we can use the SFT model for RL training.

All Experiments

Visualization

By setting arguments in train.sh, you can visualize the trajectory:

logging.log_images=True # set to True to log images
logging.log_image_dir=log/trajectory # set to the directory to save images
logging.log_image_step_size=4 # save image every _ steps
logging.log_n_image_per_batch=32 # save _ images per batch

You may use this command to visualize the trajectory:

cd log/trajectory
python -m http.server 8000
# check http://localhost:8000/[EXP_NAME]/step_[STEP_NUM]/trajectory_data_[ID].html

You may also need to install fonts to make the figures displayed correctly:

sudo apt-get install fonts-noto-cjk

Example image for one trajectory:

Download visualization data from wandb:

from ragen.utils.wandb import download_wandb
download_wandb("RUN_ID") # e.g., 9o465jqj

Cases

Please see cases/ file. There are only limited cases for now, including reward hacking and the suck moment. we will add more cases recently.

Feedback

We welcome all sorts of feedback! Please just raise an issue, no matter if it's any bugs you find or specific questions / suggestions regarding the project, so our team members won't be answering similar problems multiple times and thus would lead to more productive and efficient community building. Cheers!

Authors

*: Project Lead; †: Advising. Remaining authors are alphabetical order.

Acknowledgements

We thank DeepSeek for providing the DeepSeek-R1 model and ideas. We thank the veRL team for their infrastructure. We thank the TinyZero team for their discoveries that inspired our early exploration. We thank Yiping Lu, Runxin Xu, Kyunghyun Cho for insightful discussions with them.

Citation

@misc{RAGEN,
  author       = {Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Manling Li},
  title        = {RAGEN: A General-Purpose Reasoning Agent Training Framework},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/ZihanWang314/ragen},
}

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
cases		cases
config		config
public		public
ragen		ragen
scripts		scripts
sft		sft
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
VERL_README.md		VERL_README.md
cmd.md		cmd.md
cmd_analysis.md		cmd_analysis.md
ragen_cmd.md		ragen_cmd.md
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh
train_frozenlake.sh		train_frozenlake.sh
train_sokoban.sh		train_sokoban.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGEN: A General-Purpose Reasoning Agent Training Framework

Framework

Rollout Phase

Update Phase

Benefits

Performance

Environment Setup

Train Models

Create data

Sokoban Dataset Variants

FrozenLake Dataset

Export variables and train

Supervised Finetuning (Optional)

All Experiments

Visualization

Cases

Feedback

Authors

Acknowledgements

Citation

About

Releases

Packages

Contributors 5

Languages

License

ZihanWang314/RAGEN

Folders and files

Latest commit

History

Repository files navigation

RAGEN: A General-Purpose Reasoning Agent Training Framework

Framework

Rollout Phase

Update Phase

Benefits

Performance

Environment Setup

Train Models

Create data

Sokoban Dataset Variants

FrozenLake Dataset

Export variables and train

Supervised Finetuning (Optional)

All Experiments

Visualization

Cases

Feedback

Authors

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages