Roadmap for v0.6.0 #195

zimmski · 2024-06-17T07:21:51Z

Tasks/Goals:

Development & Management 🛠️
- Demo scrip to run models sequentially in separate evaluations on the "light" repository by @ahumenberger Script for sequentially evaluating common models with "light" repository #189
Documentation 📚
- Document roadmaps and release schedule by @bauersimon Roadmap issue template and README section about releases #196
Evaluation ⏱️
- Isolated Execution Isolation of evaluations #198, Sandbox execution #17
  - Docker Support
    - Build Docker image for every release by @Munsio Build a docker image to run the evaluation in an isolated environment #199
    - Docker evaluation runtime by @Munsio Docker runtime #211, Early merger for "docker runtime" follow up #238, Docker runtime follow up #234, Revert "Docker runtime follow up" #252
    - Parallel execution of containerized evaluations by @Munsio Parallel execution of containerized evaluations #221
    - Run docker image generation on each push by @Munsio Run docker image generation on each Push #247
    - fix, Use main revision docker tag by default by @Munsio Use "main" image if no image was specified #249, Docker runtime is using the wrong container image #242
    - fix, Add commit revision to docker and reports by @Munsio Add the current commit revision to the binary, Docker image and reports #207, Add the commit revision to binary/reports/docker #255
    - fix, IO error when multiple Containers use the same result path by @Munsio unable to create temporary repository path: exec: WaitDelay expired before I/O complete #219, Docker containers may use the same result-path #273, Always numerize the result path of containerized runs to avoid I/O sync problems #274
    - Test docker in GitHub Actions by @Munsio Follow up - Isolated evaluations #224, Testing docker runtime #260
    - fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio Only run model/provider as well as testdata checks on the host if the runtime is not containerized #290
    - fix, Pass environment tokes into container by @Munsio fix, Pass provider token env to docker container, and bump Symflower version for more symflower fix rules #250
    - fix, Use a pinned Java 11 version by @Munsio Use a pinned version for Java 11 dependency #279
    - Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio Docker runtime broken on main #302, Copying docker results did not work without setting the result-path parameter #308
  - Kubernetes Support
    - Kubernetes evaluation runtime by @Munsio Kubernetes runtime #231
    - Copy back results from the cluster to the initial host by @Munsio Copy cluster data and documentation update #272
    - fix, Only use valid characters in Kubernetes job names by @Munsio Satisfy Kubernetes convention to only have alphanumeric characters and "-" in job names #292
- Timeouts for test execution and symflower test generation by @ruiAzevedo19 Fixed timeouts for symflower unit-tests and symflower test #167, Add timeout to symflower test #185, Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed #232, https://github.com/symflower/eval-dev-quality/issues/, fix, Handle inconsistent timout error on Windows #277, fix, Define a timeout for the "symflower unit-tests" command, so ensure the execution does not take too much time #267, fix, Define a timeout for the "symflower test" command, so ensure the execution does not take too much time #188
- Clarify prompt that code responses must be in code fences by @ruiAzevedo19 Infer if a model actually returned source code #43, Change all prompts to enforce code fences #257, Change the "code-repair" and "write-tests" task prompt to enforce the generated code to be inside code fences, to ensure we can extract the code from the LLMs response #259
- fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski fix, Use a backoff for retrying LLM queries because it seems that some LLMs need longer to recover #172
Models 🤖
- Pull Ollama models if they are selected for evaluation by @Munsio Pull ollama models #283, Pull ollama models #284
- Model Selection
  - Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon Exclude openrouter/auto since it is just a random model #126, Write openrouter models to CSV and reject models that we want to ignore automatically #288
  - Exclude the perplexicty online models because they have a "per request" cost Write openrouter models to CSV and reject models that we want to ignore automatically #288 (automatically excluded as online models)
  - Additional Models
    - Snowflake
    - DeepSeek V2
    - CodeQwen 7B
    - Gemma 2
    - Cohere Aya
    - Yi 1.5
    - Phi 3
    - Falcon
    - Mistral 7B 0.3
    - Codegemma
- fix, Retry openrouter models query cause it sometimes just errors by @bauersimon Openrouter returns 524 when querying models #186, fix, Retry querying Openrouter models cause that sometimes fails #191
- fix, Default to all repositories if none are explicitly selected by @bauersimon Automatic selection of repositories is broken #163, fix, Default to all repositories if none are selected in CLI #182
- fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 Do not start the ollama server if not needed #225, Shut down the Ollama service if there are no Ollama models to evaluate, to avoid unnecessary processes being opened #269
- fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 The prompt uses different paths depending on the OS #152, Use Linux file paths on prompts even if the evaluation is running on a Windows machine, so the prompt is always the same #268
Reports & Metrics 🗒️
- Logging
  - refactor, Structural logging by @ahumenberger Structural logging #245
  - Store model responses in separate files for easier lookup by @ahumenberger Log model responses directly to file and reuse them for debugging #181, Log model responses as artifact in separate file #278
  - Store coverage objects by @ruiAzevedo19 Log the coverage objects, so this information is available and not lost #223
- Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 Dump the assessments in the CSV files once they happen and not in the end of all executions #237, Early merger for "Dump the assessments to the evaluation CSV right after running a task" #243
- refactor, Abstract the storage of assessments by @ahumenberger Improve maintainability of assessments #169, Improve maintainability of assessments by abstracting away details of how assessments are stored #178
- fix, Do not overwrite results but create a separate result directory by @bauersimon If results folder already exists, add suffix but don't overwrite or error #176, fix, Rename result directory if it already exists #179
- New report subcommand for postprocessing report data
  - report subcommand to compare multiple evaluations into one by @ruiAzevedo19 Tool/command to combine multiple evaluations into one #205, Introduce the "report" command to combine multiple evaluations into a single file #271
  - Let report command also combine markdown reports by @ruiAzevedo19 Let the "report" command also generate a markdown report for the combined evaluations #258
- Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility Use a JSON configuration file to set up an evaluation run #282
  - Store models for the evaluation in JSON configuration report by @bauersimon Write available and selected models into a configuration file for documentation/reproducibility #285
  - Store repositories for the evaluation in JSON configuration report by @bauersimon Store repositories in JSON #287
  - Load models and repositories that were used from JSON configuration by @ruiAzevedo19 Load selected models and repositories from the JSON configuration file, to set up an evaluation run #291
- Report maximum of executable files by @ruiAzevedo19 Report the maximum theoretically reachable #files-executed #215, New assessment to hold the maximum reachable executable files #261
- Experiment with human-readable model names and costs to prepare for data visualization
  - Generate the summed model files from the evaluation.csv by @ruiAzevedo19 Generate the "models-summed.csv" and "language-summed.csv" files based on the "evaluation.csv" file #241
  - Extract human-readable names of models by @ruiAzevedo19 Extract human-readable names for models #206, Extract model names, to obtain a human-readable name for each model #217
  - Extract model costs by @ruiAzevedo19 Extract model costs into log and CSVs #210, Extract model costs into log and CSVs, so the pricing information is always available #216
  - Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 Remove the summed CSVs and model's cost and human-readable name, since they will be handled afterwards with tooling #256
- Use new Symflower version which reduces error output of the "fix" command by @bauersimon in Use new Symflower version which reduces error output of the "fix" command #323
Operating Systems 🖥️
- More tests for Windows
  - Explicitly test Java test path logic on Windows by @bauersimon https://github.com/symflower/eval-dev-quality/pull/155/files missing a test #159, Explicit windows test case for Java test path logic #184
  - Extend temporary repository tests to Windows by @bauersimon Follow-Up from using Git to reset the temporary directory #141
Tools 🧰
- symflower fix auto-repair of common LLM mistakes
  - Integrate symflower fix into evaluation by @ruiAzevedo19, @bauersimon Apply symflower fix to a "write-test" result of a model #213, Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed #229
  - Do not run symflower fix when there is a timeout of the LLM by @ruiAzevedo19 Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed #232, Do not run "symflower fix" if the original response failed with a timeout, so the model and the fix assessments are consistent #236
  - Update symflower to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio Update to latest Symflower version for improved static code repair #294, Bump symflower version to stay on the latest version possible #303
Tasks 🔢
- Infrastructure for different Task types
  - Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger Task interface to accommodate different types of tasks #197, Support multiple evaluation tasks #165, Introduce the concept of "tasks" to prepare for different evaluation tasks like "write tests" and "repair code" #166
  - fix, CSV header missing the task identifier by @bauersimon CSV report header is missing the task identifier #187, fix, Missing CSV header for task #190
  - Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 New task to check for Go and Java compilation errors #160, Check for Java and Go compilation errors when building a project, for further compile error code repairing task #162
  - refactor, Share logging setup between multiple tasks by @bauersimon Follow-up "Code repairing task to enable models to fix code with compilation errors" #200, Share logging setup between tasks #202
  - fix, Missing return statements when checking model capabilities by @bauersimon Missing return statements when checking model capabilities #239
  - Validate task repositories before evaluation by @ruiAzevedo19 Check if all testdata repositories are well-formed just once, and not in every task run #263, Check if the testdata repository is valid before running the evaluation, so it is checked just once #265, Check if the repository for the transpile task is valid before running the evaluation, so it is checked just once #306
- New task types
  - Evaluation task for code repair by @ruiAzevedo19 Evaluation task: Code repair #168, Code repairing task to enable models to fix code with compilation errors #170, Early merger for code repair task #192
    - fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, ruiAzevedo19 fix, Ignore git and Maven directories when validating the code repair repository, since they do not need any validation #281
    - fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 Correct the tests of the "variable unknown" mistakes case #212
    - fix, Score with passing tests in code-repair task cause coverage can be cheated by @bauersimon Code repair should only consider #(passing tests) and never coverage #320, fix, Score with passing tests in code-repair task cause coverage can be cheated #321
  - Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 Evaluation task: Transpile #201, Testdata for transpiling Go into Java and Java into Go #246, Task for code transpilation, so models can transpile Go code to Java and back #226
    - Early merger for transpilation task by @ruiAzevedo19 Early merger for the transpilation task and update README #264
- fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 Make the Knapsack.java case easier to solve for models #230, Make the Java Knapsack inner class static, so it is easier for LLMs to solve #262
- Internal management of Testdata repositories as temporary Git repositories
  - fix, Create temporary repositories just once by @bauersimon Logic for "Create temporary repositories for each language so the repository is copied only once per language." copies more than needed #157, fix, Create temporary repositories once #180
  - fix, Fail tests immediately if outdated tools are installed by @bauersimon Running Ollama tests with the wrong Ollama binary should fail hard #156, fix, Fail tests immediately in case tool is outdated or unusable #171
- fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 Malformed Maven version #270, fix, Use the correct Maven snapshot format in the Java test data, to have a cleaner output without warnings #275

Release version of this roadmap issue:

❓ When should a release happen? Check the README!

Leftover TODOs were moved to #301.

The text was updated successfully, but these errors were encountered:

zimmski added the enhancement New feature or request label Jun 17, 2024

zimmski added this to the v0.6.0 milestone Jun 17, 2024

zimmski self-assigned this Jun 17, 2024

zimmski changed the title ~~Roadmap for v0.5.0~~ Roadmap for v0.6.0 Jun 17, 2024

zimmski added roadmap Collection of issues for a release and removed enhancement New feature or request labels Jun 17, 2024

bauersimon mentioned this issue Aug 19, 2024

fix, Score with passing tests in code-repair task cause coverage can be cheated #321

Merged

bauersimon closed this as completed Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap for v0.6.0 #195

Roadmap for v0.6.0 #195

zimmski commented Jun 17, 2024 •

edited

Loading

Roadmap for v0.6.0 #195

Roadmap for v0.6.0 #195

Comments

zimmski commented Jun 17, 2024 • edited Loading

zimmski commented Jun 17, 2024 •

edited

Loading