Skip to content
This repository has been archived by the owner on Jun 9, 2024. It is now read-only.

safety challenges, adaptability challenges, suite same_task #177

Merged
merged 33 commits into from
Jul 24, 2023

Conversation

SilenNaihin
Copy link
Contributor

@SilenNaihin SilenNaihin commented Jul 21, 2023

Background

Smooth challenge scaling was lacking before this point. The only way to make a challenge smooth was to copy and paste it and slightly tweak the task description. This is the first pr which sets up the backbone for suites, which allows for a suite.json file to be defined (below).

Changes

  • Added suite.json
  • Artifacts in, out, custom_python defined at suite level for same_task
  • Can be run through the --suite suite_config.prefix. Generates a single test with 3 regression infos and internal infos. Should work with all other flags and plain agbenchmark start (has been tested but may be edge cases).
  • Reports for suites including % pass rate and colorful logs
  • Added no_dep to run without dependencies, can be useful for testing
  • Added 3 adaptability tests, 3 safety tests, and 4 code tests
  • Lots of bug fixes, and quality of life

All tests within a suite folder must all start with the prefix defined in suite.json. There are two types of suites.

same_task

If same_task is set to true, all of the data.jsons are combined into one test. A single test runs, but multiple regression tests, internal_infos, dependencies, and reports are created. The artifacts_in/out and custom python should be in the suite folder as it's shared between tests. An example of this can be found in "agbenchmark/challenges/retrieval/r2_search_suite_1"

{
  "same_task": true,
  "prefix": "TestRevenueRetrieval",
  "dependencies": ["TestBasicRetrieval"],
  "cutoff": 60,
  "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
  "shared_category": ["retrieval"]
}

The structure for a same_task report looks like this:

"TestRevenueRetrieval": {
            "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1",
            "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
            "category": [
                "retrieval"
            ],
            "metrics": {
                "percentage": 100.0,
                "highest_difficulty": "intermediate",
                "run_time": "0.016 seconds"
            },
            "tests": {
                "TestRevenueRetrieval_1.0": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/1_tesla_revenue/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "A no guardrails search for info",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "non_mock_success_%": 100.0
                    }
                },
                "TestRevenueRetrieval_1.1": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/2_specific/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "This one checks the accuracy of the information over r2",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "non_mock_success_%": 0
                    }
                },
            },
            "reached_cutoff": false
        },

same_task

If same_task is set to false, the main functionality added is being able to run via the --suite flag, and the ability to run the test in reverse order (can't work). Also, this should generate a single report similar to the above also with a %

{
  "same_task": false,
  "reverse_order": true,
  "prefix": "TestReturnCode"
}

The structure for a non same_task report looks like this:

"TestReturnCode": {
            "data_path": "agbenchmark/challenges/code/c1_writing_suite_1",
            "metrics": {
                "percentage": 0.0,
                "highest_difficulty": "No successful tests",
                "run_time": "15.972 seconds"
            },
            "tests": {
                "TestReturnCode_Simple": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Simple test if a simple code instruction can be executed",
                    "metrics": {
                        "difficulty": "basic",
                        "success": false,
                        "fail_reason": "assert 1 in [0.0]",
                        "success_%": 0.0,
                        "run_time": "15.96 seconds"
                    },
                    "reached_cutoff": false
                },
                "TestReturnCode_Write": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Small step up, just writing the function with a name as well as the return statement.",
                    "metrics": {
                        "difficulty": "novice",
                        "success": false,
                        "fail_reason": "agbenchmark/challenges/test_all.py::TestReturnCode_Write::test_method[challenge_data0] depends on agbenchmark/challenges/test_all.py::TestReturnCode_Simple::test_method[challenge_data0]",
                        "success_%": 0.0,
                        "run_time": "0.004 seconds"
                    },
                    "reached_cutoff": false
                },
            }
        }

TODO:

  • Add suites to all challenges. Including coding and retrieval. The adaptability challenges should just be within each respective suite with a safety category marker

PR Quality Checklist

  • I have run the following commands against my code to ensure it passes our linters:
    black . --exclude test.py
    isort .
    mypy .
    autoflake --remove-all-unused-imports --recursive --ignore-init-module-imports --ignore-pass-after-docstring --in-place agbenchmark

@waynehamadi waynehamadi merged commit d9b3d7d into master Jul 24, 2023
@waynehamadi waynehamadi deleted the feat/smooth-challenges branch July 24, 2023 20:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants