Adding Arena Hard Auto #65

asuvarna31 · 2025-01-28T04:07:17Z

Sample command :

python -m eval.eval\
    --model hf\
    --tasks arena_hard_auto\
    --model_args "pretrained=Qwen/Qwen1.5-1.8B-Chat" \
    --batch_size 2 \
    --output_path logs \
    --annotator_model gpt-4o-mini-2024-07-18 \

Sample Output :

"results": {
        "score": 5.56,
        "avg_tokens": 635.0
      }

…8fd0c83 git-subtree-dir: eval/chat_benchmarks/arena-hard-auto git-subtree-split: 8fd0c83047b2ccebb4c5ea993a2beb3ced2ab003

…_benchmarks/arena-hard-auto'

asuvarna31 · 2025-01-28T06:27:13Z

@neginraoof @RyanMarten feel free to review and test this.

neginraoof

Thanks a lot! Can you also update the reproduced_benchmarks.md with results?

neginraoof · 2025-01-29T05:18:08Z

eval/chat_benchmarks/arena_hard_auto/eval_instruct.py

+
+        ## save a leaderboard in leaderboard dir
+        subprocess.run(['python', '-m', 'eval.chat_benchmarks.arena_hard_auto.show_result', '--judge-name', f'{self.annotator_model}', '--output'])
+        df = pd.read_csv(f'eval/chat_benchmarks/arena_hard_auto/leaderboard/arena_hard_leaderboard_{self.annotator_model}.csv')


Thanks a lot @asuvarna31 !
Can we just avoid read/write to file? seems like we can just use the input model_results here?

Addressed your comments @neginraoof

…/evalchemy into asuvarna31/arena-hard

asuvarna31 and others added 6 commits January 27, 2025 18:01

Squashed 'eval/chat_benchmarks/arena-hard-auto/' content from commit …

11b7a13

…8fd0c83 git-subtree-dir: eval/chat_benchmarks/arena-hard-auto git-subtree-split: 8fd0c83047b2ccebb4c5ea993a2beb3ced2ab003

Merge commit '11b7a138f2ced170235f76af7d832623d84148c0' as 'eval/chat…

4650291

…_benchmarks/arena-hard-auto'

arena hard auto working

e5ddfd6

remove old arena hard dir

b5989a6

updated code for results dict

8120ac1

Update README.md

3426dbe

neginraoof reviewed Jan 29, 2025

View reviewed changes

asuvarna31 added 3 commits February 3, 2025 15:32

removed leaderboard csv saving

37abfa9

Merge branch 'asuvarna31/arena-hard' of https://github.com/asuvarna31…

494bc6f

…/evalchemy into asuvarna31/arena-hard

lint

3ab2886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Arena Hard Auto #65

Adding Arena Hard Auto #65

asuvarna31 commented Jan 28, 2025 •

edited

Loading

asuvarna31 commented Jan 28, 2025

neginraoof left a comment

neginraoof Jan 29, 2025

asuvarna31 Feb 3, 2025

Adding Arena Hard Auto #65

Are you sure you want to change the base?

Adding Arena Hard Auto #65

Conversation

asuvarna31 commented Jan 28, 2025 • edited Loading

asuvarna31 commented Jan 28, 2025

neginraoof left a comment

Choose a reason for hiding this comment

neginraoof Jan 29, 2025

Choose a reason for hiding this comment

asuvarna31 Feb 3, 2025

Choose a reason for hiding this comment

asuvarna31 commented Jan 28, 2025 •

edited

Loading