- [Feb 27, 2025] We release our new inference time technique, AutoHD code at
methods/AutoHD
. The paper is available at https://arxiv.org/abs/2502.19295.
Table of Contents
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference.
Key Contributions:
- We explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance.
- To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning.
The table below provides an overview of the 11 datasets included in Sys2Bench, categorized into Algorithmic Reasoning, Planning, Arithmetic Reasoning, Logical Reasoning, and Common Sense Reasoning, along with their respective tasks, inputs, and outputs.
Algorithmic Reasoning | Planning | |||||
---|---|---|---|---|---|---|
Dataset | Game of 24 | Binpacking | Blocksworld | Trip Plan | Calendar Plan | Rubik's Cube |
Task | Propose an arithmetic expression to reach 24. | Pack items into the fewest bins. | Plan actions to transform blocks from initial to goal state. | Plan a trip across cities for a set number of days. | Schedule a meeting considering time constraints of people. | Unscramble a scrambled 2×2 Rubik's Cube. |
Input | A list of 4 numbers. | List of item weights and bin capacity. | Initial state of blocks and goal state. | Cities, days per city, total days, and possible flights. | Calendars with meetings and time constraints. | A scrambled 2×2 Rubik's Cube. |
Output | An arithmetic expression. | Final list with items arranged in bins. | A sequence of actions as the plan. | A trip itinerary. | A meeting time fitting all schedules. | A sequence of rotations that unscramble the cube. |
Arithmetic Reasoning | Logical Reasoning | Common Sense Reasoning | |||
---|---|---|---|---|---|
Task | GSM8K | AQuA | ProntoQA | StrategyQA | HotPotQA |
Task | Solve high school arithmetic problems. | Solve algebraic problems. | Draw a logical conclusion from a set of predicates. | Answer general knowledge questions. | Answer general knowledge questions using provided facts. |
Input | Arithmetic problem description. | Algebraic problem description. | A clause to verify as true or false using logical predicates. | A yes/no question. | General knowledge question with supporting facts. |
Output | A numerical value. | A multiple-choice option. | True or False, and reasoning. | Yes or No. | Short answer of 1 or 2 words. |
Follow the steps below to set up the repository and start using it.
Before using this repository, ensure you have the following installed:
- Conda or Miniconda - Installation Guide
- Python >= 3.10
- CUDA >= 12.0
You can install Miniconda with Python 3.10 or later on Linux using:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda3
$HOME/miniconda3/bin/conda init
source ~/.bashrc
API Keys
To use OpenAI's models, you need an API Key. We also use DeepInfra for running LLaMa models, but this is optional since these models can be run locally as well.
- OpenAI API Key (Required)
- Sign up or login at OpenAI
- Navigate to API keys and create a new key.
- Export the API key (Linux)
export OPENAI_API_KEY="your-api-key" echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.bashrc source ~/.bashrc
- The same can be done for DeepInfra as well, if you do make sure to export the token
export DEEPINFRA_TOKEN='your token here'
same way as OpenAI.
To make the installation easy, we provide a step by step instruction of how to setup our repo and get started.
- Clone the repo
git clone https://github.com/divelab/sys2bench.git
- Set Python Path
export PYTHONPATH=/path/to/repo/sys2bench:$PYTHONPATH
- Run
setup.sh
, this file takes care of conda environment installation, and environment variables.
This section explains how to run all experiments at once or execute specific methods individually, either via shell scripts or Python.
To run all experiments in one go, execute the sys2bench.sh
script. This will automatically run all experiments together:
bash sys2bench.sh
If you prefer to run specific methods, each method has a corresponding shell script that outlines the necessary arguments. You can simply execute these scripts from the terminal for a quick setup.
Example: Running Chain of Thought (CoT) on the GSM8K dataset:
bash methods/CoT/gsm8k/cot.sh
Alternatively, you can run each method by passing the required arguments to the inference.py
script associated with the dataset and method. This approach allows customization of parameters for experimentation.
Example: Running Tree of Thoughts (ToT) on the Game 24 dataset:
python methods/ToT/game24/inference.py --base_lm openai --n_beam 5 --depth_limit 4 --openai_model gpt-4o-mini
Feel free to tweak the arguments to experiment with different configurations! 🚀
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the Apache-2.0 License. See LICENSE
for more information.
-
Shubham Parashar - [email protected]
-
Blake Olson - [email protected]
-
Eric Li - [email protected]
-
Hongyi Ling - [email protected]
This work was supported in part by National Institutes of Health under grant U01AG070112 and National Science Foundation under grant CNS-2328395.
We also acknowledge previous contributions from maitrix-org/llm-reasoners and karthikv792/LLMs-Planning. Additionally, we appreciate the Awesome ReadMe Template for providing a clean and structured README design.
If you found our work useful, please consider citing our preprint - "Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights"
@article{parashar2025inference,
title={Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights},
author={Parashar, Shubham and Olson, Blake and Khurana, Sambhav and Li, Eric and Ling, Hongyi and Caverlee, James and Ji, Shuiwang},
journal={arXiv preprint arXiv:2502.12521},
year={2025}
}
If you found AutoHD useful, please consider citing our paper - "Complex LLM Planning via Automated Heuristics Discovery"
@article{ling2025complex,
title={Complex LLM Planning via Automated Heuristics Discovery},
author={Ling, Hongyi and Parashar, Shubham and Khurana, Sambhav and Olson, Blake and Basu, Anwesha and Sinha, Gaurangi and Tu, Zhengzhong and Caverlee, James and Ji, Shuiwang},
journal={arXiv preprint arXiv:2502.19295},
year={2025}
}