safety challenges, adaptability challenges, suite same_task #177

SilenNaihin · 2023-07-21T21:15:26Z

Background

Smooth challenge scaling was lacking before this point. The only way to make a challenge smooth was to copy and paste it and slightly tweak the task description. This is the first pr which sets up the backbone for suites, which allows for a suite.json file to be defined (below).

Changes

Added suite.json
Artifacts in, out, custom_python defined at suite level for same_task
Can be run through the --suite suite_config.prefix. Generates a single test with 3 regression infos and internal infos. Should work with all other flags and plain agbenchmark start (has been tested but may be edge cases).
Reports for suites including % pass rate and colorful logs
Added no_dep to run without dependencies, can be useful for testing
Added 3 adaptability tests, 3 safety tests, and 4 code tests
Lots of bug fixes, and quality of life

All tests within a suite folder must all start with the prefix defined in suite.json. There are two types of suites.

same_task

If same_task is set to true, all of the data.jsons are combined into one test. A single test runs, but multiple regression tests, internal_infos, dependencies, and reports are created. The artifacts_in/out and custom python should be in the suite folder as it's shared between tests. An example of this can be found in "agbenchmark/challenges/retrieval/r2_search_suite_1"

{
  "same_task": true,
  "prefix": "TestRevenueRetrieval",
  "dependencies": ["TestBasicRetrieval"],
  "cutoff": 60,
  "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
  "shared_category": ["retrieval"]
}

The structure for a same_task report looks like this:

"TestRevenueRetrieval": {
            "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1",
            "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
            "category": [
                "retrieval"
            ],
            "metrics": {
                "percentage": 100.0,
                "highest_difficulty": "intermediate",
                "run_time": "0.016 seconds"
            },
            "tests": {
                "TestRevenueRetrieval_1.0": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/1_tesla_revenue/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "A no guardrails search for info",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "non_mock_success_%": 100.0
                    }
                },
                "TestRevenueRetrieval_1.1": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/2_specific/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "This one checks the accuracy of the information over r2",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "non_mock_success_%": 0
                    }
                },
            },
            "reached_cutoff": false
        },

same_task

If same_task is set to false, the main functionality added is being able to run via the --suite flag, and the ability to run the test in reverse order (can't work). Also, this should generate a single report similar to the above also with a %

{
  "same_task": false,
  "reverse_order": true,
  "prefix": "TestReturnCode"
}

The structure for a non same_task report looks like this:

"TestReturnCode": {
            "data_path": "agbenchmark/challenges/code/c1_writing_suite_1",
            "metrics": {
                "percentage": 0.0,
                "highest_difficulty": "No successful tests",
                "run_time": "15.972 seconds"
            },
            "tests": {
                "TestReturnCode_Simple": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Simple test if a simple code instruction can be executed",
                    "metrics": {
                        "difficulty": "basic",
                        "success": false,
                        "fail_reason": "assert 1 in [0.0]",
                        "success_%": 0.0,
                        "run_time": "15.96 seconds"
                    },
                    "reached_cutoff": false
                },
                "TestReturnCode_Write": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Small step up, just writing the function with a name as well as the return statement.",
                    "metrics": {
                        "difficulty": "novice",
                        "success": false,
                        "fail_reason": "agbenchmark/challenges/test_all.py::TestReturnCode_Write::test_method[challenge_data0] depends on agbenchmark/challenges/test_all.py::TestReturnCode_Simple::test_method[challenge_data0]",
                        "success_%": 0.0,
                        "run_time": "0.004 seconds"
                    },
                    "reached_cutoff": false
                },
            }
        }

TODO:

Add suites to all challenges. Including coding and retrieval. The adaptability challenges should just be within each respective suite with a safety category marker

PR Quality Checklist

I have run the following commands against my code to ensure it passes our linters:

black . --exclude test.py
isort .
mypy .
autoflake --remove-all-unused-imports --recursive --ignore-init-module-imports --ignore-pass-after-docstring --in-place agbenchmark

…nt-Gravitas/Auto-GPT-Benchmarks into feat/smooth-challenges

SilenNaihin added 30 commits July 18, 2023 08:47

adding code, adaptability challenges, etc

b860699

regression for mini-agi

8625fc3

generate only needed tests. suites & suite config

c735b5d

--test flag working, suites recognized. setup left

94bb7e8

add 3 safety & 3 adaptability tests. mocks passing

f130aa4

same_task suite working! reports and everything

1b1fcfd

works but intra suite dep don't if suite fails

5109eef

updating babyagi commit sha

b6685cb

Merge remote-tracking branch 'origin/master' into feat/smooth-challenges

c75fc07

add tests, types validation, same_task suites, etc

4910cf3

Merge branch 'master' into feat/smooth-challenges

a74ef58

dependencies work!!!

47d8d9e

fixing dependency bugs & TestDebugMultipleTypo

e3e6b97

better logging

e59cddd

dependencies workaround

5918426

reports for non same_task working

52e7b86

making linter happy

cf09c5a

bug fixes

5429711

exclude code dir from flake8

7e3e35a

fixing path bug

b503756

removing unused imports

0655999

bug fixes

29407ed

linter

d418055

no space

0788bb7

fixing file issues, --test, --improve, --maintain

5d04e9a

remove HOME_ENV & transfer files

b279910

Merge branch 'master' into feat/smooth-challenges

bce5e6a

fixing previous file bug

c442385

Merge branch 'feat/smooth-challenges' of https://github.com/Significa…

ee7c048

…nt-Gravitas/Auto-GPT-Benchmarks into feat/smooth-challenges

adding back default files start with file

2e630b6

SilenNaihin added 3 commits July 24, 2023 18:08

regression fix commit smol-dev

ce56719

catching 5 exit code

35c7a73

catching error

ad102c3

waynehamadi approved these changes Jul 24, 2023

View reviewed changes

waynehamadi merged commit d9b3d7d into master Jul 24, 2023

waynehamadi deleted the feat/smooth-challenges branch July 24, 2023 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safety challenges, adaptability challenges, suite same_task #177

safety challenges, adaptability challenges, suite same_task #177

SilenNaihin commented Jul 21, 2023 •

edited

Loading

safety challenges, adaptability challenges, suite same_task #177

safety challenges, adaptability challenges, suite same_task #177

Conversation

SilenNaihin commented Jul 21, 2023 • edited Loading

Background

Changes

same_task

same_task

PR Quality Checklist

SilenNaihin commented Jul 21, 2023 •

edited

Loading