Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for run-bug-run runbugrun #39 WIP #166

Merged
merged 25 commits into from
Mar 19, 2025

Conversation

cadddr
Copy link
Contributor

@cadddr cadddr commented Oct 30, 2024

#39 WIP

@andre15silva
Copy link
Member

Hi @cadddr !

There is currently a failure in the RunBugRun tests.

See https://github.com/ASSERT-KTH/repairbench-framework/actions/runs/12126690735/job/33810455158?pr=166#step:13:337

Seems to be an error in loading the dataframe.

@cadddr
Copy link
Contributor Author

cadddr commented Dec 3, 2024 via email

@andre15silva
Copy link
Member

What version is being installed, so I can reproduce?

When you run poetry install it will use the version that is defined in the poetry.lock file.

Right now that is 2.2.3

@cadddr
Copy link
Contributor Author

cadddr commented Dec 16, 2024

I also have pandas==2.2.3 In the log there is a deprecation warning for using path string as argument to read_json. Wrapped into file stream. Hopefully this passes, otherwise, not sure how to debug this.

@andre15silva
Copy link
Member

Now the problem seems to be related with a FileNotFound.

@cadddr
Copy link
Contributor Author

cadddr commented Dec 18, 2024

Now the problem seems to be related with a FileNotFound.

My path: /workspaces/elle-elle-aime/benchmarks/run_bug_run/python_valid0.jsonl

Double checked commands in setup.sh download and unpack the file correctly:

mkdir benchmarks/run_bug_run
cd benchmarks/run_bug_run
wget https://github.com/giganticode/run_bug_run_data/releases/download/v0.0.1/python_valid0.jsonl.gz
wget https://github.com/giganticode/run_bug_run_data/releases/download/v0.0.1/tests_all.jsonl.gz

gzip -d python_valid0.jsonl.gz

Why is the working dir in the log repairbench-framework/repairbench-framework different from elle-elle-aime ? Could this be the problem?

@andre15silva
Copy link
Member

Trying to fix path, and rebased to latest master. Let's see if we can fix this.

@andre15silva
Copy link
Member

Fixed the file not found problem by changing the benchmark directory to a submodule.

We not get another error, during the execution of a RunBugRun bug.

@cadddr
Copy link
Contributor Author

cadddr commented Dec 27, 2024

Thanks for fixing the paths. Bug-related errors do not consistently reproduce since we're taking 3 bugs from an unordered dict. After fixing the order (and running 20 bugs instead of 3), I'm getting the first failure fail on p02273_118997. The reason is fixed solution isn't passing due to the lack of an exact string match. (I mentioned earlier that inputs/outputs in run bug run are being passed via standard io as strings). Example:

print (result)
0 0
11.111111111111112 0.0
16.666666666666668 9.622504486493764
22.222222222222225 0.0
33.333333333333336 0.0
38.88888888888889 9.622504486493762
33.333333333333336 19.24500897298752
44.44444444444444 19.245008972987524
50.0 28.867513459481287
55.55555555555556 19.245008972987527
66.66666666666667 19.245008972987527
61.111111111111114 9.622504486493764
66.66666666666667 0.0
77.77777777777779 0.0
83.33333333333334 9.622504486493753
88.88888888888889 0.0
100 0

print (test_output)
0.00000000 0.00000000
11.11111111 0.00000000
16.66666667 9.62250449
22.22222222 0.00000000
33.33333333 0.00000000
38.88888889 9.62250449
33.33333333 19.24500897
44.44444444 19.24500897
50.00000000 28.86751346
55.55555556 19.24500897
66.66666667 19.24500897
61.11111111 9.62250449
66.66666667 0.00000000
77.77777778 0.00000000
83.33333333 9.62250449
88.88888889 0.00000000
100.00000000 0.00000000

@andre15silva
Copy link
Member

Thanks for noticing the randomness bug! I fixed the issue for all benchmarks and rebased this PR with the latest commits from master.

For the problem of comparing outputs, the straigh-forward solution would be to eval the strings and compare the values.
Are the outputs always floats or can they be of other types?

@cadddr
Copy link
Contributor Author

cadddr commented Mar 11, 2025

@andre15silva PR back from the dead. I imported output post-processing logic from runbugrun, now tests pass. Please check if this is mergeable.

@andre15silva
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the RunBugRun benchmark to the elle-elle-aime project, along with necessary components like an output matcher and a bug representation. The changes also include integration of this benchmark into the existing framework and a new prompting strategy. Overall, the implementation seems well-structured and addresses the issue #39. However, there are a few areas that could benefit from further attention.

Summary of Findings

  • Potential caching issue in get_failing_tests: The get_failing_tests function in runbugrun.py has a section that appears to handle caching of test results. However, the logic for determining whether to execute tests or use cached results could be improved to ensure consistent behavior. Specifically, the condition if not already_cached only applies when there isn't a runtime exception, potentially leading to missed caching opportunities or unnecessary test executions.
  • Inconsistent file handling in RunBugRunBug.checkout: The RunBugRunBug.checkout method in runbugrunbug.py uses a hardcoded path 'buggy' when copying files, which might not be flexible enough if the benchmark structure changes in the future. Consider making this more configurable or dynamically determined.
  • Missing error handling in RunBugRunBug.execute_test_case: The RunBugRunBug.execute_test_case function in runbugrunbug.py catches OSError and subprocess.TimeoutExpired, but it might be beneficial to log these exceptions or provide more context in the returned error message to aid in debugging.

Merge Readiness

The pull request introduces significant new functionality and requires careful review. While the code appears well-structured, the potential caching issue and file handling inconsistencies should be addressed before merging. I am unable to approve this pull request, and recommend that others review and approve this code before merging. At a minimum, the high severity issues should be addressed before merging.

@ASSERT-KTH ASSERT-KTH deleted a comment from gemini-code-assist bot Mar 11, 2025
@andre15silva
Copy link
Member

Thanks @cadddr !

I went over the code again and also ran the Gemini review.
Linting is failing, could you run the linter (there is a pre-commit hook for that now)?

Left some comments and waiting for the tests to run.

Copy link

Thanks @cadddr!

I'm just responding here to note that I don't have the ability to run the linter or take any actions on the pull request. I can only respond to comments and provide code suggestions. It looks like you're asking the author to run the linter locally, which is the correct approach.

@cadddr
Copy link
Contributor Author

cadddr commented Mar 11, 2025

Thanks @cadddr !

I went over the code again and also ran the Gemini review. Linting is failing, could you run the linter (there is a pre-commit hook for that now)?

Left some comments and waiting for the tests to run.

@andre15silva sure I can run linting and remove FIXME comments. Tests are failing with run bug run json file being not found. I noticed download commands for that have been removed from setup.sh -- why is that? Should I simply rollback the file to the version that worked?

Copy link

Hi @andre15silva,

Regarding the missing RunBugRun JSON file, it appears the download commands were removed from setup.sh. Restoring those commands should resolve the issue. Additionally, extracting the buggy_test_results.tgz archive in setup.sh will ensure the test results are available.

I noticed that the benchmark directory was changed to a submodule, but then commented out in .gitmodules. Is this intentional? If the intention is to use a submodule, the commented lines in .gitmodules should be uncommented, and the submodule should be properly initialized.

Here's a suggestion to add the extraction of the test results in setup.sh:

--- a/setup.sh
+++ b/setup.sh
@@ -22,3 +22,7 @@
 if [ -z "$CI" ]; then
  poetry run ./gitbug-java setup;
 fi
+
+cd benchmarks/run_bug_run;
+tar -xvzf buggy_test_results.tgz;
+cd ../..;

@cadddr
Copy link
Contributor Author

cadddr commented Mar 11, 2025

@andre15silva Data download commands for runbugrun were removed in this commit: cadddr@b83f855

Initial setting up of benchmarks/run_bug_run as submodule was a misunderstanding as it linked to rbugr original ruby repo, whereas the intention for this directory is to simply house the data files. Let me know how best to resolve this.

@cadddr
Copy link
Contributor Author

cadddr commented Mar 11, 2025

@andre15silva also the submodule entry was already taken out here: cadddr@1622d83
Not sure why it was brought back. Could all that be overwritten by merge from ASSERT-KTH:master ?

@andre15silva
Copy link
Member

andre15silva commented Mar 11, 2025

Could have been problems with the merge, not sure what happened.

We can fix the setup.sh, the version in main doesn't have this issue.
As for the submodule or not, if the files are not the same as the repo then we can simply add the in a directory inside benchmarks.

WDYT?

@cadddr
Copy link
Contributor Author

cadddr commented Mar 12, 2025

@andre15silva made all changes, let's see those tests pass.

@cadddr
Copy link
Contributor Author

cadddr commented Mar 13, 2025

@andre15silva
Tests are failing due to file paths issues. File not found here: '/home/runner/work/repairbench-framework/repairbench-framework/benchmarks/run_bug_run/python_valid0.jsonl'

When running set up this file gets downloaded but not to the right directory as the preceding cd command fails:
./setup.sh: line 27: cd: benchmarks/run_bug_run: No such file or directory

This directory exists in the repo e.g., https://github.com/cadddr/elle-elle-aime/blob/master/benchmarks/run_bug_run/buggy_test_results.tgz

Is that because setup for GitBugs-Java forgets to exit from its directory? cd benchmarks/gitbug-java;

For me pwd shows: /workspaces/elle-elle-aime, I recall we had this issue before e.g., #166 (comment)
How did you fix it?

@andre15silva andre15silva self-requested a review March 14, 2025 08:44
@cadddr
Copy link
Contributor Author

cadddr commented Mar 14, 2025

@andre15silva look like all tests pass! we're all set to merge?

@andre15silva
Copy link
Member

LGTM!

Let's wait for @t-sorger to take a look and give his opinion and then we can merge. He's also been working on integrating another Python benchmark.

@t-sorger
Copy link
Collaborator

The code generally looks good to me.

I do have some concerns about the Python utilities, as I ran this code previously with slight modifications while developing my own version. It might not be fully compatible with the BugsInPy benchmark, so it’s something to keep in mind when integrating BugsInPy to ensure the utilities don’t break for run-bug-run (and vice versa). That said, it should be fine for now.

@cadddr
Copy link
Contributor Author

cadddr commented Mar 18, 2025

Admittedly, I've put in the bare minimum needed just to make Run bug run work. Agreed they will have to be generalized as we onboard more python datasets, perhaps in the next PR. With this second dataset BugsInPy we should get a better idea of what parts are general vs dataset-specific.

@andre15silva
Copy link
Member

Sounds good, merging it.

Good work @cadddr, thanks for the PR :))

@andre15silva andre15silva merged commit 6e0e212 into ASSERT-KTH:master Mar 19, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants