Skip to content

Latest commit

 

History

History
147 lines (114 loc) · 8.6 KB

subeval.md

File metadata and controls

147 lines (114 loc) · 8.6 KB

More Details on SubEval

Terminology in SubEval Context

Options

We formulate the evaluation as a multi-choice problem and we support 3 settings.

  • 2 options ->
    Please choose from the following 2 options based on the scoring criteria:
    A. Response 1 is better than Response 2.
    B. Response 2 is better than Response 1.

  • 3 options ->
    Please choose from the following 3 options based on the scoring criteria:
    A. Response 1 is better than Response 2.
    B. Response 2 is better than Response 1.
    C. Both Response 1 and Response 2 are good.

  • 4 options ->
    Please choose from the following 4 options based on the scoring criteria:
    A. Response 1 is good; Response 2 is not good.
    B. Response 2 is good; Response 1 is not good.
    C. Both Response 1 and Response 2 are good.
    D. Neither Response 1 nor Response 2 is good.

Consistency

Previous study has shown that the order with which the responses are presented to the LLM judge significantly affects the judge's preference, which is called the position bias (https://doi.org/10.48550/arXiv.2306.05685). To this end, we test whether the evaluation results are consistent if Response 1 and Response 2 are swapped. Then the consistent preference should be one of the four combinations:

  1. (A,B)
  2. (B,A)
  3. (C,C)
  4. (D,D)

where the first element refering to the choice of the original order and the second element refering to the choice after revising the order. These responses are both blindly provided to the LLM Judge, meaning that the judge doesn't know the models from which these responses are generated.

If all preferences from the judge are consistent, the model's output won't include this information; on the other hand, if some preferences are inconsistent, a consistant rate will be showed in the output log.txt.

Evaluation on the Software Design

Data preparation

Prepare the data (DataFrame) that includes the folllowing columns:

  1. questions:

    • The questions proposed to LLMs to generate answers.
    • Should be in a SINGLE string for each cell.
  2. index:

    • Unique for each row.
    • In a format of {project_name}_{task}.
    • e.g., idcenter_uml_class-general.
  3. evaluating_guidance:

    • The detailed content of the corresponding {task}.
    • See evaluating_guidance for more our predefined metrics.
    • Should be in a SINGLE string for each cell.
  4. task:

    • In a format of {software_design_file}-{metric}.
    • {software_design_file} in DevBench's context includes uml_class, UML_sequence, and architecture_design.
    • {metric} in DevBench's context includes general and faihfulness, but feel free to add your own evaluation metrics.
  5. reference_answer(optional):

    • The annotated/reference ground truth answers (usually manually annotated) to the questions. You may also choose a SOTA model's response as the reference_answer.
    • This columns is optional, but has to have it included for correct parse.
    • For DevBench, we leave it as blank, trusting the LLM's abilities.
  6. answer-{model}:

    • The answer/response generated by the certain {model} according to the given questions.
    • Should be in a SINGLE string for each cell
  7. Store the dataframe to examples/{df_name}.xlsx

Note: If some columns include the paths but not the content, we provided the load_file_content helper function (in smp.py) to load the content from the given paths. To use that, simply includes --fill-contents {col1} {col2} ... in the script.

Output

Old version win rate

We use the win rate for only consistent evaluations as the main metric for evaluating the quality of model-generated responses in the old version.

  • 2 options

    • win_rate = win / (win + lose)
  • 3 options

    • win_both_good_rate = (win + both_good) / (win + both_good + lose)
  • 4 options

    • win_both_good_rate = (win + both_good) / (win + both_good + lose)
    • win_half_tie_rate = (win + (both_good + both_fail) / 2 ) = (win + both_good + both_fail + lose)

Old version output files

The output files generated by running subeval/subjective/sub_eval.py are specified as below:

  • log.txt is the integrated output file that contains the results of:
    • length_stats
    • Total number of comparisons
    • Failed and non-exact (not exactly same responses) comparisons
    • Extraction number and rate (extracted answer from the {judge}'s choice), ideally should be 100% (all extracted successfully).
    • Consistency Rate if there are inconsistent pairs of evaluation
    • win+bothgood results (win_both_good_rate)
    • win_halfdraw results (win_half_tie_rate)
  • record_{judge}_{nopt}.tsv is the detailed output of evaluation for each pair of responses. It include the following columns:
    • Cmp_index: In a format of {index};{model A};{model B}.
    • Question: Same as input.
    • Answer1: Response generated by {model A}.
    • Answer2: Response generated by {model B}.
    • A: Model name of {model A}.
    • B: Model name of {model B}.
    • Reference_answer: Same as input.
    • Evaluating_guidance: Same as input.
    • Task: Same as input
    • {judge}: The evaluation choice made by the judge in a format of:
      Choice: A (or B or C or D)
      Reason:
      1. xxxxxx
      2. xxxxxx
      ......
      
  • tmp.pkl keeps track of temporary processes during evaluation. If the evalutation process is interrupted, it will read tmp.pkl and only run experiments that haven't get evaluated, avoiding reinventing the wheel.
  • length_stats.csv stores the mean and standard deviation of all reponse lengths of input models
  • win+bothgood.xlsx and win+halfdraw.xlsx record the win rates of all models against refm. If we limit the number of options nopt to 2 (only A is better or B is better), the two files will generate same results.

Note:

  • Win rates in this output is calculated only for consistent pairs of responses. If all evaluations are consistent, you can just stop here (e.g., the results of running ./scripts/run_example.sh - output/DevBench_projects_example_infer_input_2680_record0_gpt-4-1106-preview_2).
  • However, usually there will be inconsistent evaluations, so we also provide a new version of win-rate calculatation (see below) that considers the inconsistent evaluations as tie. The DevBench SubEval statistics are based on the new version.
  • Important: As you may discover, the name of the default output directory does NOT records the response-generated models and reference models you choose. Hence, running experiments on the same data, judge, and nopt but with different models or refm will replace the previous experiment results. Therefore, we highly recommend you to rename the output directories tailored to your need after the evaluation finishes to avoid conflict and economic losses.

New version win rate

As abovementioned, we provide a new version of win rate calculation that regards inconsistent evaluations (AA or BB, if nopt 2) as a "tie" for both responses. From our observations, the position bias usually occurs when the qualities of the two given responses are nearly identical, and this phenomenon is not limited to LLM-Judge - it is actually discovered from human first! In this sense, regarding the inconsistent evaluations as a "tie" is reasonable and even more comprehensive than only calculating win-rates on the consistent pairs. This approach is highly recommended to be used for nopt 2 cases.

We provide the calculation for win-rates either with or without "tie" in this new version.

Note: other than the old version that calcualtes for each task and metric, the new version calculates win rates over all software design files (i.e., uml_class, UML_sequence, architecture_design). Hence even the "without-tie" calculation is still different than that of the old version.

To calculate the new version win rate, please revise the parameters of the main function of subeval/subjective/calculate_winrate_new.py(here) accordingly to fit your need.

Then run the following script from the top-level directory:

python3 ./subeval/subjective/calculate_winrate_new.py

If you specified the directory to save the calculation results, there will be two csv files get saved:

  • win_rate_with_tie.csv
  • win_rate_without_tie.csv

with columns:

  • Metric;Model: Metric refers to either general or faithfulness. Model A and Model B are seperated by ;.
  • Winrate: the winrate of Model A against Model B.