Skip to content

Sys2Bench is a benchmarking suite designed to evaluate reasoning and planning capabilities of large language models across algorithmic, logical, arithmetic, and common-sense reasoning tasks.

License

Notifications You must be signed in to change notification settings

divelab/Sys2Bench

Repository files navigation

Contributors Forks Stargazers Issues Unlicense License LinkedIn


Logo

Sys2Bench

A curated benchmark evaluating the reasoning and planning abilities of Large Language Models.

Annoucements

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project

We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference.

Key Contributions:

  • We explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance.
  • To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning.

(back to top)

Tasks in Sys2Bench

The table below provides an overview of the 11 datasets included in Sys2Bench, categorized into Algorithmic Reasoning, Planning, Arithmetic Reasoning, Logical Reasoning, and Common Sense Reasoning, along with their respective tasks, inputs, and outputs.

Algorithmic Reasoning Planning
Dataset Game of 24 Binpacking Blocksworld Trip Plan Calendar Plan Rubik's Cube
Task Propose an arithmetic expression to reach 24. Pack items into the fewest bins. Plan actions to transform blocks from initial to goal state. Plan a trip across cities for a set number of days. Schedule a meeting considering time constraints of people. Unscramble a scrambled 2×2 Rubik's Cube.
Input A list of 4 numbers. List of item weights and bin capacity. Initial state of blocks and goal state. Cities, days per city, total days, and possible flights. Calendars with meetings and time constraints. A scrambled 2×2 Rubik's Cube.
Output An arithmetic expression. Final list with items arranged in bins. A sequence of actions as the plan. A trip itinerary. A meeting time fitting all schedules. A sequence of rotations that unscramble the cube.
Arithmetic Reasoning Logical Reasoning Common Sense Reasoning
Task GSM8K AQuA ProntoQA StrategyQA HotPotQA
Task Solve high school arithmetic problems. Solve algebraic problems. Draw a logical conclusion from a set of predicates. Answer general knowledge questions. Answer general knowledge questions using provided facts.
Input Arithmetic problem description. Algebraic problem description. A clause to verify as true or false using logical predicates. A yes/no question. General knowledge question with supporting facts.
Output A numerical value. A multiple-choice option. True or False, and reasoning. Yes or No. Short answer of 1 or 2 words.

Getting Started

Follow the steps below to set up the repository and start using it.

Prerequisites

Before using this repository, ensure you have the following installed:

You can install Miniconda with Python 3.10 or later on Linux using:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init
source ~/.bashrc 

API Keys

To use OpenAI's models, you need an API Key. We also use DeepInfra for running LLaMa models, but this is optional since these models can be run locally as well.

  • OpenAI API Key (Required)
    • Sign up or login at OpenAI
    • Navigate to API keys and create a new key.
  • Export the API key (Linux)
    export OPENAI_API_KEY="your-api-key"
    echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.bashrc
    source ~/.bashrc
  • The same can be done for DeepInfra as well, if you do make sure to export the token export DEEPINFRA_TOKEN='your token here' same way as OpenAI.

Installation

To make the installation easy, we provide a step by step instruction of how to setup our repo and get started.

  1. Clone the repo
    git clone https://github.com/divelab/sys2bench.git
  2. Set Python Path
    export PYTHONPATH=/path/to/repo/sys2bench:$PYTHONPATH
  3. Run setup.sh, this file takes care of conda environment installation, and environment variables.

(back to top)

Usage

This section explains how to run all experiments at once or execute specific methods individually, either via shell scripts or Python.

Running the Complete Sys2Bench Suite

To run all experiments in one go, execute the sys2bench.sh script. This will automatically run all experiments together:

bash sys2bench.sh

Running Specific Methods via Shell Scripts

If you prefer to run specific methods, each method has a corresponding shell script that outlines the necessary arguments. You can simply execute these scripts from the terminal for a quick setup.

Example: Running Chain of Thought (CoT) on the GSM8K dataset:

bash methods/CoT/gsm8k/cot.sh

Running Specific Methods via Python

Alternatively, you can run each method by passing the required arguments to the inference.py script associated with the dataset and method. This approach allows customization of parameters for experimentation.

Example: Running Tree of Thoughts (ToT) on the Game 24 dataset:

python methods/ToT/game24/inference.py --base_lm openai --n_beam 5 --depth_limit 4 --openai_model gpt-4o-mini

Feel free to tweak the arguments to experiment with different configurations! 🚀

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the Apache-2.0 License. See LICENSE for more information.

(back to top)

Contact

(back to top)

Acknowledgments

This work was supported in part by National Institutes of Health under grant U01AG070112 and National Science Foundation under grant CNS-2328395.

We also acknowledge previous contributions from maitrix-org/llm-reasoners and karthikv792/LLMs-Planning. Additionally, we appreciate the Awesome ReadMe Template for providing a clean and structured README design.

Citation

If you found our work useful, please consider citing our preprint - "Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights"

@article{parashar2025inference,
  title={Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights},
  author={Parashar, Shubham and Olson, Blake and Khurana, Sambhav and Li, Eric and Ling, Hongyi and Caverlee, James and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2502.12521},
  year={2025}
}

If you found AutoHD useful, please consider citing our paper - "Complex LLM Planning via Automated Heuristics Discovery"

@article{ling2025complex,
  title={Complex LLM Planning via Automated Heuristics Discovery},
  author={Ling, Hongyi and Parashar, Shubham and Khurana, Sambhav and Olson, Blake and Basu, Anwesha and Sinha, Gaurangi and Tu, Zhengzhong and Caverlee, James and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2502.19295},
  year={2025}
}

About

Sys2Bench is a benchmarking suite designed to evaluate reasoning and planning capabilities of large language models across algorithmic, logical, arithmetic, and common-sense reasoning tasks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •