This repository has been archived by the owner on Jun 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 74
safety challenges, adaptability challenges, suite same_task #177
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nt-Gravitas/Auto-GPT-Benchmarks into feat/smooth-challenges
waynehamadi
approved these changes
Jul 24, 2023
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Smooth challenge scaling was lacking before this point. The only way to make a challenge smooth was to copy and paste it and slightly tweak the task description. This is the first pr which sets up the backbone for suites, which allows for a suite.json file to be defined (below).
Changes
--suite suite_config.prefix
. Generates a single test with 3 regression infos and internal infos. Should work with all other flags and plainagbenchmark start
(has been tested but may be edge cases).no_dep
to run without dependencies, can be useful for testingAll tests within a suite folder must all start with the prefix defined in
suite.json
. There are two types of suites.same_task
If same_task is set to true, all of the data.jsons are combined into one test. A single test runs, but multiple regression tests, internal_infos, dependencies, and reports are created. The artifacts_in/out and custom python should be in the suite folder as it's shared between tests. An example of this can be found in "agbenchmark/challenges/retrieval/r2_search_suite_1"
The structure for a same_task report looks like this:
same_task
If same_task is set to false, the main functionality added is being able to run via the --suite flag, and the ability to run the test in reverse order (can't work). Also, this should generate a single report similar to the above also with a %
The structure for a non same_task report looks like this:
TODO:
PR Quality Checklist