Merge pull request #40 from Significant-Gravitas/feat/basics

addition of basic challenges, easier challenge creation, --mock flag, adding mini-agi
Significant-Gravitas · Jun 28, 2023 · 11303e2 · 11303e2
2 parents a7972ad + 76ee994
commit 11303e2
Show file tree

Hide file tree

Showing 26 changed files with 569 additions and 218 deletions.
diff --git a/.env.example b/.env.example
@@ -0,0 +1,3 @@
+AGENT_NAME=mini-agi
+AGENT_TIMEOUT=60
+MOCK_TEST=False
diff --git a/README.md b/README.md
@@ -2,73 +2,94 @@
 
 A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
 
+## As a user
+
+1. `pip install auto-gpt-benchmarks`
+2. Add boilerplate code to run and kill agent
+3. `agbenchmark start`
+   - `--category challenge_category` to run tests in a specific category
+   - `--mock` to only run mock tests if they exists for each test
+   - `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
+4. We call boilerplate code for your agent
+5. Show pass rate of tests, logs, and any other metrics
+
+## Contributing
+
 ##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
 
-### To run the basic existing mock (June 21)
+### To run the existing mocks
 
 1. clone the repo `auto-gpt-benchmarks`
 2. `pip install poetry`
 3. `poetry shell`
 4. `poetry install`
-5. `agbenchmark start`
+5. `cp .env_example .env`
+6. `agbenchmark start --mock`
    Keep config the same and watch the logs :)
 
+### To run with mini-agi
+
+1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
+2. `pip install -r requirements.txt`
+3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
+4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`
+
 - To add requirements `poetry add requirement`.
 
 Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.
 
-If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
+If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit
 
 Let people know what beautiful code you write does, document everything well
 
 Share your progress :)
 
-## How this works
+### Pytest
 
-1. `pip install auto-gpt-benchmarks`
-2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
-3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
-4. We call the server to run the agent for each test
-5. Show pass rate of tests, logs, and any other metrics
+an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic
 
-### To run the basic existing mock (June 21)
+```python
+import pytest
+from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
+import os
 
-1. clone the repo `auto-gpt-benchmarks`
-2. `pip install poetry`
-3. `poetry shell`
-4. `poetry install`
-5. `agbenchmark start`
-   Keep config the same and watch the logs :)
 
-#### Bonuses
+class TestWriteFile(BasicChallenge):
+    """Testing if LLM can write to a file"""
 
-- You can adds tests by git cloning auto-gpt-benchmarks to your repo
-- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
-- Simple, easy to use
-- Don't have to deal with cloud or parallelization yet
+    def get_file_path(self) -> str:  # all tests must implement this method
+        return os.path.join(os.path.dirname(__file__), "w_file_data.json")
 
-### Pytest
+    @pytest.mark.depends(on=[], name="basic_write_file")
+    def test_method(self, workspace):
+        # implement scoring logic by looking at workspace
+```
 
-to create a test:
+All challenges will inherit from parent class which has the mark and any specific methods for their category
 
+```python
+@pytest.mark.basic
+class BasicChallenge(Challenge):
+    pass
 ```
-@pytest.mark.parametrize(
-"server_response",
-["VARIABLE"], # VARIABLE = the query/goal you provide to the model
-indirect=True,
-)
-@pytest.mark.(VARIABLE) # VARIABLE = category of the test
-def test_file_in_workspace(workspace): # VARIABLE = the actual test that asserts
-assert os.path.exists(os.path.join(workspace, "file_to_check.txt"))
-```
 
-## Api
+To create a file to test a challenge, add this to the challenge file which will create a file before running the server
+
+```python
+@pytest.fixture(
+        scope="module", autouse=True
+    )  # this is specific to setting up a file for the test, not all tests have this
+    def setup_module(self, workspace):
+        Challenge.write_to_file(
+            workspace, self.data.ground.files[0], "this is how we're doing"
+        )
+```
 
-FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
+#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)
 
 ## Workspace
 
-Defined by the user on config
+If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config
 
 #### Dataset
 
@@ -80,9 +101,9 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
 |-- auto-gpt-benchmarks/ **main project directory**
 | |-- metrics.py **combining scores, metrics, final evaluation**
 | |-- start_benchmark.py **entry point from cli**
-| |-- conftest.py **shared fixtures across all tests**
-| |-- Challenge.py **easy challenge creation class?**
-| |-- config.json **hostname, port, workspace folder**
+| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
+| |-- Challenge.py **easy challenge creation class**
+| |-- config.json **workspace folder**
 | |-- challenges/ **challenges across different domains**
 | | |-- adaptability/
 | | |-- basic_abilities/
@@ -91,28 +112,7 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
 | | |-- retrieval/
 | | |-- web_navigation/
 | | |-- writing/
-| |-- tests/ **challenges across different metrics**
-| | |-- basic_abilities/
-| | |-- interface/
-| |-- workspace/ **workspace related func**
-| | |-- **init**.py
-| | |-- workspace_manager.py **creation, deletion**
+| |-- tests/
+| | |-- basic_abilities/ **every llm should pass these challenges**
+| | |-- regression/ **challenges that already passed**
 ```
-
-### Easy Challenge Creation
-
-tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval
-
-#### Written Challenges
-
-For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
-
-#### Validators
-
-Designed to handle specific types of output (e.g., text, code, structured data)
-
-#### Logging
-
-Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
-
-Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility
diff --git a/agbenchmark/Challenge.py b/agbenchmark/Challenge.py
@@ -1,18 +1,90 @@
 import os
-from typing import Optional
+import glob
+import pytest
+from abc import ABC, abstractmethod
+from agbenchmark.challenges.define_task_types import Ground
+from agbenchmark.challenges.define_task_types import ChallengeData
+from dotenv import load_dotenv, set_key
 
+load_dotenv()
 
-class Challenge:
+mock_test_str = os.getenv("MOCK_TEST")
+MOCK_TEST = mock_test_str.lower() == "true" if mock_test_str else False
+
+
+class Challenge(ABC):
     """The parent class to all specific challenges classes.
     Defines helper methods for running a challenge"""
 
+    @abstractmethod
+    def get_file_path(self) -> str:
+        """This should be implemented by any class which inherits from BasicChallenge"""
+        pass
+
+    @property
+    def data(self) -> ChallengeData:
+        return ChallengeData.deserialize(self.get_file_path())
+
+    @property
+    def mock(self):
+        return self.data.mock.mock_func if self.data.mock else None
+
+    @property
+    def task(self):
+        return (
+            self.data.mock.mock_task if self.data.mock and MOCK_TEST else self.data.task
+        )
+
+    @property
+    def dependencies(self) -> list:
+        print("self.data.dependencies", self.data.dependencies)
+        return self.data.dependencies
+
+    @property
+    def name(self) -> str:
+        print("self.data.name", self.data.name)
+        return self.data.name
+
+    @pytest.mark.parametrize(
+        "run_agent",
+        [(task, mock)],
+        indirect=True,
+    )
+    @pytest.mark.parametrize(
+        "challenge_data",
+        [data],
+        indirect=True,
+    )
+    def test_method(self, workspace):
+        raise NotImplementedError
+
     @staticmethod
     def open_file(workspace: str, filename: str):
         script_dir = os.path.abspath(workspace)
         workspace_dir = os.path.join(script_dir, filename)
         with open(workspace_dir, "r") as f:
             return f.read()
 
+    @staticmethod
+    def open_files(workspace: str, file_patterns: list):
+        script_dir = os.path.abspath(workspace)
+        files_contents = []
+
+        for file_pattern in file_patterns:
+            # Check if it is a file extension
+            if file_pattern.startswith("."):
+                # Find all files with the given extension in the workspace
+                matching_files = glob.glob(os.path.join(script_dir, "*" + file_pattern))
+            else:
+                # Otherwise, it is a specific file
+                matching_files = [os.path.join(script_dir, file_pattern)]
+
+            for file_path in matching_files:
+                with open(file_path, "r") as f:
+                    files_contents.append(f.read())
+
+        return files_contents
+
     @staticmethod
     def write_to_file(workspace: str, filename: str, content: str):
         script_dir = os.path.abspath(workspace)
@@ -30,3 +102,24 @@ def get_filenames_in_workspace(self, workspace: str):
             for filename in os.listdir(workspace)
             if os.path.isfile(os.path.join(workspace, filename))
         ]
+
+    def scoring(self, content: str, ground: Ground):
+        if ground.should_contain:
+            for should_contain_word in ground.should_contain:
+                if should_contain_word not in content:
+                    return 0.0
+                else:
+                    print(
+                        f"Word that should exist: {should_contain_word} exists in the content"
+                    )
+
+        if ground.should_not_contain:
+            for should_not_contain_word in ground.should_not_contain:
+                if should_not_contain_word in content:
+                    return 0.0
+                else:
+                    print(
+                        f"Word that should not exist: {should_not_contain_word} does not exist in the content"
+                    )
+
+        return 1.0
diff --git a/agbenchmark/challenges/README.md b/agbenchmark/challenges/README.md
@@ -4,40 +4,49 @@
 
 Input:
 
-- **category** (str): information-retrieval
-- **difficulty**(str): the difficulty of this query. choices from
-
-## Information-retrieval challenges
-
-Input:
-
-- **category** (str): information-retrieval
-- **task** (str): the question the agent needs to be solve.
+- **name** (str): Name of the challenge.
+- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
+- **task** (str): The task that the agent needs to solve.
+- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
 - **ground** (dict): The ground truth.
-  - **answer** (str): The raw text of ground truth answer
-  - **should_contain** (list): the exact strings that is required in the final answer
-  - **should_not_contain** (list): the exact strings that should not be in the final answer
-  - **files**: files that the are used for retrieval. Can specify file here or an extension **TODO:** like .txt
-- **difficulty**(str): the difficulty of this query. choices from
-- **mock_func**: function to mock the agent's response. This is used for testing purposes
+  - **answer** (str): The raw text of the ground truth answer.
+  - **should_contain** (list): The exact strings that are required in the final answer.
+  - **should_not_contain** (list): The exact strings that should not be in the final answer.
+  - **files** (list): Files that are used for retrieval. Can specify file here or an extension.
+- **mock** (dict): Mock response for testing.
+  - **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
+  - **mock_task** (str): Task to provide for the mock function.
+- **info** (dict): Additional info about the challenge.
+  - **difficulty** (str): The difficulty of this query.
+  - **description** (str): Description of the challenge.
+  - **side_effects** (str[]): Describes the effects of the challenge.
 
 Example:
 
 ```python
 {
-  "category": "retrieval",
-  "task": "What is the capital of America?",
+  "name": "basic_write_file",
+  "category": ["basic"],
+  "task": "Print the the capital of America to a .txt file",
+  "dependencies": [],
   "ground": {
     "answer": "Washington",
     "should_contain": ["Washington"],
     "should_not_contain": ["New York", "Los Angeles", "San Francisco"],
-    "files": ["file_to_check.txt"]
+    "files": [".txt"]
+  },
+  "mock": {
+    "mock_func": "basic_write_file_mock",
+    "mock_task": "What is the capital of America?"
   },
-  "difficulty": "easy"
+  "info": {
+    "difficulty": "basic",
+    "description": "Tests the writing to file",
+    "side_effects": ["tests if there is in fact an LLM attached"]
+  }
 }
-
 ```
 
-Output:
+Current Output:
 
 - **score** (float): scores range from [0, 1]