Skip to content
This repository has been archived by the owner on Jun 9, 2024. It is now read-only.

Commit

Permalink
Merge pull request #40 from Significant-Gravitas/feat/basics
Browse files Browse the repository at this point in the history
addition of basic challenges, easier challenge creation, --mock flag, adding mini-agi
  • Loading branch information
waynehamadi authored Jun 28, 2023
2 parents a7972ad + 76ee994 commit 11303e2
Show file tree
Hide file tree
Showing 26 changed files with 569 additions and 218 deletions.
3 changes: 3 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
AGENT_NAME=mini-agi
AGENT_TIMEOUT=60
MOCK_TEST=False
124 changes: 62 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,73 +2,94 @@

A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work

## As a user

1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to run and kill agent
3. `agbenchmark start`
- `--category challenge_category` to run tests in a specific category
- `--mock` to only run mock tests if they exists for each test
- `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
4. We call boilerplate code for your agent
5. Show pass rate of tests, logs, and any other metrics

## Contributing

##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x

### To run the basic existing mock (June 21)
### To run the existing mocks

1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `agbenchmark start`
5. `cp .env_example .env`
6. `agbenchmark start --mock`
Keep config the same and watch the logs :)

### To run with mini-agi

1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
2. `pip install -r requirements.txt`
3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`

- To add requirements `poetry add requirement`.

Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.

If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit

Let people know what beautiful code you write does, document everything well

Share your progress :)

## How this works
### Pytest

1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
4. We call the server to run the agent for each test
5. Show pass rate of tests, logs, and any other metrics
an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic

### To run the basic existing mock (June 21)
```python
import pytest
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os

1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `agbenchmark start`
Keep config the same and watch the logs :)

#### Bonuses
class TestWriteFile(BasicChallenge):
"""Testing if LLM can write to a file"""

- You can adds tests by git cloning auto-gpt-benchmarks to your repo
- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
- Simple, easy to use
- Don't have to deal with cloud or parallelization yet
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")

### Pytest
@pytest.mark.depends(on=[], name="basic_write_file")
def test_method(self, workspace):
# implement scoring logic by looking at workspace
```

to create a test:
All challenges will inherit from parent class which has the mark and any specific methods for their category

```python
@pytest.mark.basic
class BasicChallenge(Challenge):
pass
```
@pytest.mark.parametrize(
"server_response",
["VARIABLE"], # VARIABLE = the query/goal you provide to the model
indirect=True,
)
@pytest.mark.(VARIABLE) # VARIABLE = category of the test
def test_file_in_workspace(workspace): # VARIABLE = the actual test that asserts
assert os.path.exists(os.path.join(workspace, "file_to_check.txt"))
```

## Api
To create a file to test a challenge, add this to the challenge file which will create a file before running the server

```python
@pytest.fixture(
scope="module", autouse=True
) # this is specific to setting up a file for the test, not all tests have this
def setup_module(self, workspace):
Challenge.write_to_file(
workspace, self.data.ground.files[0], "this is how we're doing"
)
```

FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)

## Workspace

Defined by the user on config
If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config

#### Dataset

Expand All @@ -80,9 +101,9 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
|-- auto-gpt-benchmarks/ **main project directory**
| |-- metrics.py **combining scores, metrics, final evaluation**
| |-- start_benchmark.py **entry point from cli**
| |-- conftest.py **shared fixtures across all tests**
| |-- Challenge.py **easy challenge creation class?**
| |-- config.json **hostname, port, workspace folder**
| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
| |-- Challenge.py **easy challenge creation class**
| |-- config.json **workspace folder**
| |-- challenges/ **challenges across different domains**
| | |-- adaptability/
| | |-- basic_abilities/
Expand All @@ -91,28 +112,7 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
| | |-- retrieval/
| | |-- web_navigation/
| | |-- writing/
| |-- tests/ **challenges across different metrics**
| | |-- basic_abilities/
| | |-- interface/
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion**
| |-- tests/
| | |-- basic_abilities/ **every llm should pass these challenges**
| | |-- regression/ **challenges that already passed**
```

### Easy Challenge Creation

tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval

#### Written Challenges

For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore

#### Validators

Designed to handle specific types of output (e.g., text, code, structured data)

#### Logging

Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc

Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility
97 changes: 95 additions & 2 deletions agbenchmark/Challenge.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,90 @@
import os
from typing import Optional
import glob
import pytest
from abc import ABC, abstractmethod
from agbenchmark.challenges.define_task_types import Ground
from agbenchmark.challenges.define_task_types import ChallengeData
from dotenv import load_dotenv, set_key

load_dotenv()

class Challenge:
mock_test_str = os.getenv("MOCK_TEST")
MOCK_TEST = mock_test_str.lower() == "true" if mock_test_str else False


class Challenge(ABC):
"""The parent class to all specific challenges classes.
Defines helper methods for running a challenge"""

@abstractmethod
def get_file_path(self) -> str:
"""This should be implemented by any class which inherits from BasicChallenge"""
pass

@property
def data(self) -> ChallengeData:
return ChallengeData.deserialize(self.get_file_path())

@property
def mock(self):
return self.data.mock.mock_func if self.data.mock else None

@property
def task(self):
return (
self.data.mock.mock_task if self.data.mock and MOCK_TEST else self.data.task
)

@property
def dependencies(self) -> list:
print("self.data.dependencies", self.data.dependencies)
return self.data.dependencies

@property
def name(self) -> str:
print("self.data.name", self.data.name)
return self.data.name

@pytest.mark.parametrize(
"run_agent",
[(task, mock)],
indirect=True,
)
@pytest.mark.parametrize(
"challenge_data",
[data],
indirect=True,
)
def test_method(self, workspace):
raise NotImplementedError

@staticmethod
def open_file(workspace: str, filename: str):
script_dir = os.path.abspath(workspace)
workspace_dir = os.path.join(script_dir, filename)
with open(workspace_dir, "r") as f:
return f.read()

@staticmethod
def open_files(workspace: str, file_patterns: list):
script_dir = os.path.abspath(workspace)
files_contents = []

for file_pattern in file_patterns:
# Check if it is a file extension
if file_pattern.startswith("."):
# Find all files with the given extension in the workspace
matching_files = glob.glob(os.path.join(script_dir, "*" + file_pattern))
else:
# Otherwise, it is a specific file
matching_files = [os.path.join(script_dir, file_pattern)]

for file_path in matching_files:
with open(file_path, "r") as f:
files_contents.append(f.read())

return files_contents

@staticmethod
def write_to_file(workspace: str, filename: str, content: str):
script_dir = os.path.abspath(workspace)
Expand All @@ -30,3 +102,24 @@ def get_filenames_in_workspace(self, workspace: str):
for filename in os.listdir(workspace)
if os.path.isfile(os.path.join(workspace, filename))
]

def scoring(self, content: str, ground: Ground):
if ground.should_contain:
for should_contain_word in ground.should_contain:
if should_contain_word not in content:
return 0.0
else:
print(
f"Word that should exist: {should_contain_word} exists in the content"
)

if ground.should_not_contain:
for should_not_contain_word in ground.should_not_contain:
if should_not_contain_word in content:
return 0.0
else:
print(
f"Word that should not exist: {should_not_contain_word} does not exist in the content"
)

return 1.0
51 changes: 30 additions & 21 deletions agbenchmark/challenges/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,40 +4,49 @@

Input:

- **category** (str): information-retrieval
- **difficulty**(str): the difficulty of this query. choices from

## Information-retrieval challenges

Input:

- **category** (str): information-retrieval
- **task** (str): the question the agent needs to be solve.
- **name** (str): Name of the challenge.
- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
- **task** (str): The task that the agent needs to solve.
- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
- **ground** (dict): The ground truth.
- **answer** (str): The raw text of ground truth answer
- **should_contain** (list): the exact strings that is required in the final answer
- **should_not_contain** (list): the exact strings that should not be in the final answer
- **files**: files that the are used for retrieval. Can specify file here or an extension **TODO:** like .txt
- **difficulty**(str): the difficulty of this query. choices from
- **mock_func**: function to mock the agent's response. This is used for testing purposes
- **answer** (str): The raw text of the ground truth answer.
- **should_contain** (list): The exact strings that are required in the final answer.
- **should_not_contain** (list): The exact strings that should not be in the final answer.
- **files** (list): Files that are used for retrieval. Can specify file here or an extension.
- **mock** (dict): Mock response for testing.
- **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
- **mock_task** (str): Task to provide for the mock function.
- **info** (dict): Additional info about the challenge.
- **difficulty** (str): The difficulty of this query.
- **description** (str): Description of the challenge.
- **side_effects** (str[]): Describes the effects of the challenge.

Example:

```python
{
"category": "retrieval",
"task": "What is the capital of America?",
"name": "basic_write_file",
"category": ["basic"],
"task": "Print the the capital of America to a .txt file",
"dependencies": [],
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": ["file_to_check.txt"]
"files": [".txt"]
},
"mock": {
"mock_func": "basic_write_file_mock",
"mock_task": "What is the capital of America?"
},
"difficulty": "easy"
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}

```

Output:
Current Output:

- **score** (float): scores range from [0, 1]
Loading

0 comments on commit 11303e2

Please sign in to comment.