The 24 Game is a classic mathematical game that requires using 4 numbers and basic operations (addition, subtraction, multiplication, division) to obtain the result 24. This project aims to enhance the reasoning and self-verification capabilities of Large Language Models (LLMs) in the 24 Game through different training methods (Zero-RL, SFT, SFT+RL).
- Given a deck of cards, the goal is to use the numbers from four cards and arithmetic operations to reach a final result of 24
- Each card must be used exactly once
- You can use addition (+), subtraction (-), multiplication (×), and division (÷)
- You can use parentheses () to change the order of operations
- No other operators or numbers can be used
- Division results can be decimal or infinite recurring numbers
For example: If the four cards are 3, 3, 8, 8, you can get the answer through (8÷3(3-8÷3))=24.
# Create conda environment
conda create -n 24game python=3.10
conda activate 24game
# Install dependencies
pip install -r requirements.txt
# If you need VLLM for accelerated inference
pip install vllm
24-Game-Reasoning/
├── data/ # Dataset directory
│ ├── 24game_grpo/ # RL dataset
│ └── 24game_sft/ # SFT dataset
├── docs/ # Documentation
├── images/ # Images directory
│ ├── examples/ # Example images
│ └── results/ # Result visualization images
├── results/ # Evaluation results
├── scripts/ # Scripts directory
│ ├── data_processing/ # Data processing scripts
│ ├── evaluation/ # Evaluation scripts
│ └── training/ # Training scripts
├── templates/ # Prompt templates
├── utils/ # Utility functions
├── verl/ # RL training framework
├── .gitignore # Git ignore file
├── README.md # English README
├── README_ZH.md # Chinese README
└── requirements.txt # Python dependencies
First, you need to generate the 24 Game dataset:
# Generate 24 Game data
python scripts/data_processing/data_preparation.py
This project implements three training methods: Zero-RL, SFT, and SFT+RL.
The Zero-RL method directly uses RL to train the base model without prior SFT:
cd verl
bash scripts/run_qwen25_math_grpo.sh
The SFT (Supervised Fine-Tuning) method uses human-annotated data for supervised fine-tuning:
cd verl
bash scripts/run_qwen25_math_sft.sh 4 None # 4 indicates using 4 GPUs
The SFT+RL method first performs SFT training, then follows with RL training:
cd verl
bash scripts/run_qwen25_math_grpo_sft_rl.sh
Use the evaluation script to assess trained models:
python scripts/evaluation/eval.py --base_model_path /path/to/model --val_data_path data/24game_sft/val.parquet
We compared three methods (Zero-RL, SFT, SFT+RL) on the 24 Game:
- Zero-RL: Directly uses RL to train the base model without prior SFT
- SFT: Uses human-annotated data for supervised fine-tuning
- SFT+RL: First performs SFT training, then follows with RL training
The experimental results show that the SFT+RL method achieved the best performance in terms of accuracy and reasoning ability, while the Zero-RL method also demonstrated good performance, especially with longer chains of thought.
Through the experiments in this project, we found that:
- RL training can effectively enhance the model's reasoning and self-verification capabilities in the 24 Game
- There is a positive correlation between the length of the chain of thought and accuracy, but excessively long chains may lead to computational resource waste
- The SFT+RL combination method achieves the best results, but the Zero-RL method is also an effective training strategy
These findings have significant implications for enhancing the mathematical reasoning and self-verification capabilities of large language models and can be applied to a wider range of mathematical problem-solving and logical reasoning tasks.
If you use this project in your research, please cite it using the following format:
@misc{24GameReasoning2024,
author = {Wei, Shaohang},
title = {24-Game-Reasoning: Enhancing LLM's Reasoning and Self-Verification Capabilities},
year = {2025},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/sylvain-wei/24-Game-Reasoning}}
}
This project is licensed under the MIT License. See the LICENSE file for details.