GitHub - divelab/Sys2Bench: Sys2Bench is a benchmarking suite designed to evaluate reasoning and planning capabilities of large language models across algorithmic, logical, arithmetic, and common-sense reasoning tasks.

Sys2Bench

A curated benchmark evaluating the reasoning and planning abilities of Large Language Models.

Annoucements

[Feb 27, 2025] We release our new inference time technique, AutoHD code at methods/AutoHD. The paper is available at https://arxiv.org/abs/2502.19295.

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
Contributing
License
Contact
Acknowledgments

About The Project

We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference.

Key Contributions:

We explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance.
To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning.

(back to top)

Tasks in Sys2Bench

The table below provides an overview of the 11 datasets included in Sys2Bench, categorized into Algorithmic Reasoning, Planning, Arithmetic Reasoning, Logical Reasoning, and Common Sense Reasoning, along with their respective tasks, inputs, and outputs.

	Algorithmic Reasoning		Planning
Dataset	Game of 24	Binpacking	Blocksworld	Trip Plan	Calendar Plan	Rubik's Cube
Task	Propose an arithmetic expression to reach 24.	Pack items into the fewest bins.	Plan actions to transform blocks from initial to goal state.	Plan a trip across cities for a set number of days.	Schedule a meeting considering time constraints of people.	Unscramble a scrambled 2×2 Rubik's Cube.
Input	A list of 4 numbers.	List of item weights and bin capacity.	Initial state of blocks and goal state.	Cities, days per city, total days, and possible flights.	Calendars with meetings and time constraints.	A scrambled 2×2 Rubik's Cube.
Output	An arithmetic expression.	Final list with items arranged in bins.	A sequence of actions as the plan.	A trip itinerary.	A meeting time fitting all schedules.	A sequence of rotations that unscramble the cube.

	Arithmetic Reasoning		Logical Reasoning	Common Sense Reasoning
Task	GSM8K	AQuA	ProntoQA	StrategyQA	HotPotQA
Task	Solve high school arithmetic problems.	Solve algebraic problems.	Draw a logical conclusion from a set of predicates.	Answer general knowledge questions.	Answer general knowledge questions using provided facts.
Input	Arithmetic problem description.	Algebraic problem description.	A clause to verify as true or false using logical predicates.	A yes/no question.	General knowledge question with supporting facts.
Output	A numerical value.	A multiple-choice option.	True or False, and reasoning.	Yes or No.	Short answer of 1 or 2 words.

Getting Started

Follow the steps below to set up the repository and start using it.

Prerequisites

Before using this repository, ensure you have the following installed:

Conda or Miniconda - Installation Guide
Python >= 3.10
CUDA >= 12.0

You can install Miniconda with Python 3.10 or later on Linux using:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init
source ~/.bashrc

API Keys

To use OpenAI's models, you need an API Key. We also use DeepInfra for running LLaMa models, but this is optional since these models can be run locally as well.

OpenAI API Key (Required)
- Sign up or login at OpenAI
- Navigate to API keys and create a new key.

Export the API key (Linux)

export OPENAI_API_KEY="your-api-key"
echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc

The same can be done for DeepInfra as well, if you do make sure to export the token export DEEPINFRA_TOKEN='your token here' same way as OpenAI.

Installation

To make the installation easy, we provide a step by step instruction of how to setup our repo and get started.

Clone the repo

git clone https://github.com/divelab/sys2bench.git

Set Python Path

export PYTHONPATH=/path/to/repo/sys2bench:$PYTHONPATH

Run setup.sh, this file takes care of conda environment installation, and environment variables.

(back to top)

Usage

This section explains how to run all experiments at once or execute specific methods individually, either via shell scripts or Python.

Running the Complete Sys2Bench Suite

To run all experiments in one go, execute the sys2bench.sh script. This will automatically run all experiments together:

bash sys2bench.sh

Running Specific Methods via Shell Scripts

If you prefer to run specific methods, each method has a corresponding shell script that outlines the necessary arguments. You can simply execute these scripts from the terminal for a quick setup.

Example: Running Chain of Thought (CoT) on the GSM8K dataset:

bash methods/CoT/gsm8k/cot.sh

Running Specific Methods via Python

Alternatively, you can run each method by passing the required arguments to the inference.py script associated with the dataset and method. This approach allows customization of parameters for experimentation.

Example: Running Tree of Thoughts (ToT) on the Game 24 dataset:

python methods/ToT/game24/inference.py --base_lm openai --n_beam 5 --depth_limit 4 --openai_model gpt-4o-mini

Feel free to tweak the arguments to experiment with different configurations! 🚀

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the Apache-2.0 License. See LICENSE for more information.

(back to top)

Contact

Shubham Parashar - [email protected]
Blake Olson - [email protected]
Eric Li - [email protected]
Hongyi Ling - [email protected]

(back to top)

Acknowledgments

This work was supported in part by National Institutes of Health under grant U01AG070112 and National Science Foundation under grant CNS-2328395.

We also acknowledge previous contributions from maitrix-org/llm-reasoners and karthikv792/LLMs-Planning. Additionally, we appreciate the Awesome ReadMe Template for providing a clean and structured README design.

Citation

If you found our work useful, please consider citing our preprint - "Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights"

@article{parashar2025inference,
  title={Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights},
  author={Parashar, Shubham and Olson, Blake and Khurana, Sambhav and Li, Eric and Ling, Hongyi and Caverlee, James and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2502.12521},
  year={2025}
}

If you found AutoHD useful, please consider citing our paper - "Complex LLM Planning via Automated Heuristics Discovery"

@article{ling2025complex,
  title={Complex LLM Planning via Automated Heuristics Discovery},
  author={Ling, Hongyi and Parashar, Shubham and Khurana, Sambhav and Olson, Blake and Basu, Anwesha and Sinha, Gaurangi and Tu, Zhengzhong and Caverlee, James and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2502.19295},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
data		data
methods		methods
prompts		prompts
reasoners		reasoners
.gitignore		.gitignore
LICENSE		LICENSE
LLMs_Planning.zip		LLMs_Planning.zip
README.md		README.md
setup.sh		setup.sh
sys2bench.sh		sys2bench.sh
sys2bench.yaml		sys2bench.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sys2Bench

Annoucements

About The Project

Tasks in Sys2Bench

Getting Started

Prerequisites

Installation

Usage

Running the Complete Sys2Bench Suite

Running Specific Methods via Shell Scripts

Running Specific Methods via Python

Contributing

License

Contact

Acknowledgments

Citation

About

Releases 2

Packages

Contributors 4

Languages

License

divelab/Sys2Bench

Folders and files

Latest commit

History

Repository files navigation

Sys2Bench

Annoucements

About The Project

Tasks in Sys2Bench

Getting Started

Prerequisites

Installation

Usage

Running the Complete Sys2Bench Suite

Running Specific Methods via Shell Scripts

Running Specific Methods via Python

Contributing

License

Contact

Acknowledgments

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages