add support for run-bug-run runbugrun #39 WIP #166

cadddr · 2024-10-30T01:28:00Z

setup.sh

andre15silva · 2024-12-03T08:28:26Z

There is currently a failure in the RunBugRun tests.

See https://github.com/ASSERT-KTH/repairbench-framework/actions/runs/12126690735/job/33810455158?pr=166#step:13:337

Seems to be an error in loading the dataframe.

cadddr · 2024-12-03T17:33:15Z

Since this isn’t file not found, could be a version/deprecation issue with pandas? What version is being installed, so I can reproduce? Thanks

…

On Mon, Dec 2, 2024 at 10:28 PM André Silva ***@***.***> wrote: Hi @cadddr <https://github.com/cadddr> ! There is currently a failure in the RunBugRun tests. See https://github.com/ASSERT-KTH/repairbench-framework/actions/runs/12126690735/job/33810455158?pr=166#step:13:337 Seems to be an error in loading the dataframe. — Reply to this email directly, view it on GitHub <#166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWMCVO4CSBZWZCTHME2NUT2DVTUBAVCNFSM6AAAAABQ3CF6BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJTHA2TCNZQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

andre15silva · 2024-12-03T17:37:58Z

What version is being installed, so I can reproduce?

When you run poetry install it will use the version that is defined in the poetry.lock file.

Right now that is 2.2.3

cadddr · 2024-12-16T22:47:38Z

I also have pandas==2.2.3 In the log there is a deprecation warning for using path string as argument to read_json. Wrapped into file stream. Hopefully this passes, otherwise, not sure how to debug this.

andre15silva · 2024-12-17T08:44:59Z

Now the problem seems to be related with a FileNotFound.

cadddr · 2024-12-18T00:16:37Z

Now the problem seems to be related with a FileNotFound.

My path: /workspaces/elle-elle-aime/benchmarks/run_bug_run/python_valid0.jsonl

Double checked commands in setup.sh download and unpack the file correctly:

mkdir benchmarks/run_bug_run
cd benchmarks/run_bug_run
wget https://github.com/giganticode/run_bug_run_data/releases/download/v0.0.1/python_valid0.jsonl.gz
wget https://github.com/giganticode/run_bug_run_data/releases/download/v0.0.1/tests_all.jsonl.gz

gzip -d python_valid0.jsonl.gz

Why is the working dir in the log repairbench-framework/repairbench-framework different from elle-elle-aime ? Could this be the problem?

andre15silva · 2024-12-18T11:02:01Z

Trying to fix path, and rebased to latest master. Let's see if we can fix this.

andre15silva · 2024-12-18T13:28:27Z

Fixed the file not found problem by changing the benchmark directory to a submodule.

We not get another error, during the execution of a RunBugRun bug.

cadddr · 2024-12-27T21:55:09Z

Thanks for fixing the paths. Bug-related errors do not consistently reproduce since we're taking 3 bugs from an unordered dict. After fixing the order (and running 20 bugs instead of 3), I'm getting the first failure fail on p02273_118997. The reason is fixed solution isn't passing due to the lack of an exact string match. (I mentioned earlier that inputs/outputs in run bug run are being passed via standard io as strings). Example:

print (result)
0 0
11.111111111111112 0.0
16.666666666666668 9.622504486493764
22.222222222222225 0.0
33.333333333333336 0.0
38.88888888888889 9.622504486493762
33.333333333333336 19.24500897298752
44.44444444444444 19.245008972987524
50.0 28.867513459481287
55.55555555555556 19.245008972987527
66.66666666666667 19.245008972987527
61.111111111111114 9.622504486493764
66.66666666666667 0.0
77.77777777777779 0.0
83.33333333333334 9.622504486493753
88.88888888888889 0.0
100 0

print (test_output)
0.00000000 0.00000000
11.11111111 0.00000000
16.66666667 9.62250449
22.22222222 0.00000000
33.33333333 0.00000000
38.88888889 9.62250449
33.33333333 19.24500897
44.44444444 19.24500897
50.00000000 28.86751346
55.55555556 19.24500897
66.66666667 19.24500897
61.11111111 9.62250449
66.66666667 0.00000000
77.77777778 0.00000000
83.33333333 9.62250449
88.88888889 0.00000000
100.00000000 0.00000000

andre15silva · 2024-12-28T16:59:04Z

Thanks for noticing the randomness bug! I fixed the issue for all benchmarks and rebased this PR with the latest commits from master.

For the problem of comparing outputs, the straigh-forward solution would be to eval the strings and compare the values.
Are the outputs always floats or can they be of other types?

cadddr · 2025-03-11T00:26:50Z

@andre15silva PR back from the dead. I imported output post-processing logic from runbugrun, now tests pass. Please check if this is mergeable.

andre15silva · 2025-03-11T09:55:52Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the RunBugRun benchmark to the elle-elle-aime project, along with necessary components like an output matcher and a bug representation. The changes also include integration of this benchmark into the existing framework and a new prompting strategy. Overall, the implementation seems well-structured and addresses the issue #39. However, there are a few areas that could benefit from further attention.

Summary of Findings

Potential caching issue in get_failing_tests: The get_failing_tests function in runbugrun.py has a section that appears to handle caching of test results. However, the logic for determining whether to execute tests or use cached results could be improved to ensure consistent behavior. Specifically, the condition if not already_cached only applies when there isn't a runtime exception, potentially leading to missed caching opportunities or unnecessary test executions.
Inconsistent file handling in RunBugRunBug.checkout: The RunBugRunBug.checkout method in runbugrunbug.py uses a hardcoded path 'buggy' when copying files, which might not be flexible enough if the benchmark structure changes in the future. Consider making this more configurable or dynamically determined.
Missing error handling in RunBugRunBug.execute_test_case: The RunBugRunBug.execute_test_case function in runbugrunbug.py catches OSError and subprocess.TimeoutExpired, but it might be beneficial to log these exceptions or provide more context in the returned error message to aid in debugging.

Merge Readiness

The pull request introduces significant new functionality and requires careful review. While the code appears well-structured, the potential caching issue and file handling inconsistencies should be addressed before merging. I am unable to approve this pull request, and recommend that others review and approve this code before merging. At a minimum, the high severity issues should be addressed before merging.

elleelleaime/core/benchmarks/runbugrun/runbugrunbug.py

tests/core/benchmarks/runbugrun/test_runbugrun.py

elleelleaime/core/utils/python/python.py

elleelleaime/core/benchmarks/runbugrun/runbugrunbug.py

andre15silva · 2025-03-11T10:10:06Z

Thanks @cadddr !

I went over the code again and also ran the Gemini review.
Linting is failing, could you run the linter (there is a pre-commit hook for that now)?

Left some comments and waiting for the tests to run.

gemini-code-assist · 2025-03-11T10:10:15Z

Thanks @cadddr!

I'm just responding here to note that I don't have the ability to run the linter or take any actions on the pull request. I can only respond to comments and provide code suggestions. It looks like you're asking the author to run the linter locally, which is the correct approach.

cadddr · 2025-03-11T18:10:52Z

Thanks @cadddr !

I went over the code again and also ran the Gemini review. Linting is failing, could you run the linter (there is a pre-commit hook for that now)?

Left some comments and waiting for the tests to run.

@andre15silva sure I can run linting and remove FIXME comments. Tests are failing with run bug run json file being not found. I noticed download commands for that have been removed from setup.sh -- why is that? Should I simply rollback the file to the version that worked?

gemini-code-assist · 2025-03-11T18:10:59Z

Hi @andre15silva,

Regarding the missing RunBugRun JSON file, it appears the download commands were removed from setup.sh. Restoring those commands should resolve the issue. Additionally, extracting the buggy_test_results.tgz archive in setup.sh will ensure the test results are available.

I noticed that the benchmark directory was changed to a submodule, but then commented out in .gitmodules. Is this intentional? If the intention is to use a submodule, the commented lines in .gitmodules should be uncommented, and the submodule should be properly initialized.

Here's a suggestion to add the extraction of the test results in setup.sh:

--- a/setup.sh
+++ b/setup.sh
@@ -22,3 +22,7 @@
 if [ -z "$CI" ]; then
  poetry run ./gitbug-java setup;
 fi
+
+cd benchmarks/run_bug_run;
+tar -xvzf buggy_test_results.tgz;
+cd ../..;

cadddr · 2025-03-11T18:18:22Z

@andre15silva Data download commands for runbugrun were removed in this commit: cadddr@b83f855

Initial setting up of benchmarks/run_bug_run as submodule was a misunderstanding as it linked to rbugr original ruby repo, whereas the intention for this directory is to simply house the data files. Let me know how best to resolve this.

cadddr · 2025-03-11T18:27:17Z

@andre15silva also the submodule entry was already taken out here: cadddr@1622d83
Not sure why it was brought back. Could all that be overwritten by merge from ASSERT-KTH:master ?

andre15silva · 2025-03-11T19:49:18Z

Could have been problems with the merge, not sure what happened.

We can fix the setup.sh, the version in main doesn't have this issue.
As for the submodule or not, if the files are not the same as the repo then we can simply add the in a directory inside benchmarks.

WDYT?

cadddr · 2025-03-12T21:12:07Z

@andre15silva made all changes, let's see those tests pass.

cadddr · 2025-03-13T17:50:19Z

@andre15silva
Tests are failing due to file paths issues. File not found here: '/home/runner/work/repairbench-framework/repairbench-framework/benchmarks/run_bug_run/python_valid0.jsonl'

When running set up this file gets downloaded but not to the right directory as the preceding cd command fails:
./setup.sh: line 27: cd: benchmarks/run_bug_run: No such file or directory

This directory exists in the repo e.g., https://github.com/cadddr/elle-elle-aime/blob/master/benchmarks/run_bug_run/buggy_test_results.tgz

Is that because setup for GitBugs-Java forgets to exit from its directory? cd benchmarks/gitbug-java;

For me pwd shows: /workspaces/elle-elle-aime, I recall we had this issue before e.g., #166 (comment)
How did you fix it?

setup.sh

cadddr · 2025-03-14T17:13:25Z

@andre15silva look like all tests pass! we're all set to merge?

andre15silva · 2025-03-17T17:43:27Z

LGTM!

Let's wait for @t-sorger to take a look and give his opinion and then we can merge. He's also been working on integrating another Python benchmark.

t-sorger · 2025-03-18T22:41:41Z

The code generally looks good to me.

I do have some concerns about the Python utilities, as I ran this code previously with slight modifications while developing my own version. It might not be fully compatible with the BugsInPy benchmark, so it’s something to keep in mind when integrating BugsInPy to ensure the utilities don’t break for run-bug-run (and vice versa). That said, it should be fine for now.

cadddr · 2025-03-18T22:58:13Z

Admittedly, I've put in the bare minimum needed just to make Run bug run work. Agreed they will have to be generalized as we onboard more python datasets, perhaps in the next PR. With this second dataset BugsInPy we should get a better idea of what parts are general vs dataset-specific.

andre15silva · 2025-03-19T08:13:05Z

Sounds good, merging it.

Good work @cadddr, thanks for the PR :))

cadddr mentioned this pull request Oct 30, 2024

add support for run-bug-run runbugrun #39

Closed

cadddr force-pushed the master branch from febe8e4 to 57b2c05 Compare November 13, 2024 02:05

andre15silva reviewed Nov 27, 2024

View reviewed changes

setup.sh Outdated Show resolved Hide resolved

andre15silva force-pushed the master branch from 1b2ebea to b9b8c6f Compare December 18, 2024 11:01

andre15silva force-pushed the master branch from b36ef87 to 766bcec Compare December 28, 2024 15:29

cadddr and others added 16 commits December 28, 2024 17:50

initial run bug run

a9e7dd1

run bug run tests and prompts

1b101dc

skip running tests if error

3a12eb1

clean up prompt tests format

7264252

cache test results on first run

c161b34

fix parsing of multiline test inputs/outputs

19b4169

prompt tests for run bug run

28cc9c0

upload cached test outputs; remove submodule; uncomment setup;

1622d83

hardcode buggy subdir

a481160

run black; fix tgz

57ca277

wrap filename into stream for pd.read_json

9945af3

update setup.sh

16a8cd8

update setup.sh

4c7a31e

remove run-bug-run dir

721efa5

add run-bug-run as submodule

6959970

fix setup?

b83f855

cadddr added 3 commits March 10, 2025 12:28

Merge branch 'ASSERT-KTH:master' into master

9568d9b

output matcher from runbugrun

bbeaa0e

regenerate testcase results based on new output matching

58ee7fb

gemini-code-assist bot suggested changes Mar 11, 2025

View reviewed changes

elleelleaime/core/benchmarks/runbugrun/runbugrunbug.py Show resolved Hide resolved

elleelleaime/core/benchmarks/runbugrun/runbugrunbug.py Outdated Show resolved Hide resolved

tests/core/benchmarks/runbugrun/test_runbugrun.py Outdated Show resolved Hide resolved

ASSERT-KTH deleted a comment from gemini-code-assist bot Mar 11, 2025

andre15silva requested changes Mar 11, 2025

View reviewed changes

elleelleaime/core/utils/python/python.py Outdated Show resolved Hide resolved

elleelleaime/core/benchmarks/runbugrun/runbugrunbug.py Outdated Show resolved Hide resolved

cadddr added 2 commits March 12, 2025 20:59

restore setup, lfs; regen timed out tests; lint

a6c1892

fix typo

a91cbf2

andre15silva requested changes Mar 14, 2025

View reviewed changes

setup.sh Outdated Show resolved Hide resolved

fix: fix path in setup.sh

86a2c1d

andre15silva self-requested a review March 14, 2025 08:44

andre15silva approved these changes Mar 17, 2025

View reviewed changes

andre15silva merged commit 6e0e212 into ASSERT-KTH:master Mar 19, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for run-bug-run runbugrun #39 WIP #166

add support for run-bug-run runbugrun #39 WIP #166

cadddr commented Oct 30, 2024

andre15silva commented Dec 3, 2024

cadddr commented Dec 3, 2024 via email

andre15silva commented Dec 3, 2024

cadddr commented Dec 16, 2024

andre15silva commented Dec 17, 2024

cadddr commented Dec 18, 2024

andre15silva commented Dec 18, 2024

andre15silva commented Dec 18, 2024

cadddr commented Dec 27, 2024

andre15silva commented Dec 28, 2024

cadddr commented Mar 11, 2025

andre15silva commented Mar 11, 2025

gemini-code-assist bot left a comment

andre15silva commented Mar 11, 2025

gemini-code-assist bot commented Mar 11, 2025

cadddr commented Mar 11, 2025 •

edited

Loading

gemini-code-assist bot commented Mar 11, 2025

cadddr commented Mar 11, 2025 •

edited

Loading

cadddr commented Mar 11, 2025

andre15silva commented Mar 11, 2025 •

edited

Loading

cadddr commented Mar 12, 2025

cadddr commented Mar 13, 2025 •

edited

Loading

cadddr commented Mar 14, 2025

andre15silva commented Mar 17, 2025

t-sorger commented Mar 18, 2025

cadddr commented Mar 18, 2025

andre15silva commented Mar 19, 2025

add support for run-bug-run runbugrun #39 WIP #166

add support for run-bug-run runbugrun #39 WIP #166

Conversation

cadddr commented Oct 30, 2024

andre15silva commented Dec 3, 2024

cadddr commented Dec 3, 2024 via email

andre15silva commented Dec 3, 2024

cadddr commented Dec 16, 2024

andre15silva commented Dec 17, 2024

cadddr commented Dec 18, 2024

andre15silva commented Dec 18, 2024

andre15silva commented Dec 18, 2024

cadddr commented Dec 27, 2024

andre15silva commented Dec 28, 2024

cadddr commented Mar 11, 2025

andre15silva commented Mar 11, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

andre15silva commented Mar 11, 2025

gemini-code-assist bot commented Mar 11, 2025

cadddr commented Mar 11, 2025 • edited Loading

gemini-code-assist bot commented Mar 11, 2025

cadddr commented Mar 11, 2025 • edited Loading

cadddr commented Mar 11, 2025

andre15silva commented Mar 11, 2025 • edited Loading

cadddr commented Mar 12, 2025

cadddr commented Mar 13, 2025 • edited Loading

cadddr commented Mar 14, 2025

andre15silva commented Mar 17, 2025

t-sorger commented Mar 18, 2025

cadddr commented Mar 18, 2025

andre15silva commented Mar 19, 2025

cadddr commented Mar 11, 2025 •

edited

Loading

cadddr commented Mar 11, 2025 •

edited

Loading

andre15silva commented Mar 11, 2025 •

edited

Loading

cadddr commented Mar 13, 2025 •

edited

Loading