This repository is the replication package of the research work "Using Pre-trained Models to Boost Code Review Automation".
In our work we trained several T5 models to automate three code-review tasks, each one using a specific dataset. Here we provide everything needed to replicate our experiments. We also provide all the raw data generated while running our experiments (e.g., the predictions output of the models).
There are two ways of replicating our results:
- Use our fine-tuned models to generate new predictions;
- Train your own models from scratch.
For the second option (train your own models) you will need a Google Colab pro account and a Google Cloud Storage account (details follow).
-
In the
code
folder we provide:- the Google Colab notebooks we used to:
Preprocessing.ipynb
: preprocess the pre-training dataset and train the Sentencepiece tokenizer;PreTraining.ipynb
: pre-train the T5 model;FineTuning.ipynb
: fine-tune the T5 models on different tasks.
Analyzer.py
,Cleaner.py
: the two main Python classes we used to preprocess the fine-tuning dataset. In particular the functionisCommentRelevant(...)
contained in theCleaner
class (line 1129) encloses the updated heuristic to detect irrelevant comments.utils
: folder containing some useful resources used during the fine-tuning data preprocessing.
- the Google Colab notebooks we used to:
-
manual analysis.xlsx
: contains the results of the manual analysis we performed on some non perfect predictions (see the paper for details). -
perfect_predictions.zip
: for convenience, we stored the perfect predictions generated by our model at k=1 (the model is allowed to generate one single prediction) in HTML format. Use these files if you want to have a quick look at correct predictions generated by the models. All generated predictions are instead available inresults.zip
.
Here we stored the extra materials you need in order to replicate our experiments:
-
automating_code_review.zip
contains all the material needed to successfully run our Google Colab notebooks (see section Train your T5 models for more details). -
datasets.zip
contains all the processed and split datasets we used:- pre-training
pre-training.tsv
- fine-tuning
- new_large
- code-to-code
test.tsv
,train.tsv
,val.tsv
- code-to-comment
test.tsv
,train.tsv
,val.tsv
- code&comment-to-code
test.tsv
,train.tsv
,val.tsv
- code-to-code
- Tufano_etal_ICSE21
- code-to-code
test.tsv
,train.tsv
,val.tsv
- code&comment-to-code
test.tsv
,train.tsv
,val.tsv
- code-to-code
- new_large
- pre-training
-
generate_predictions.zip
contains the scripts to successfully generate predictions using a T5 model checkpoint (see section Use our fine-tuned T5 models for more details) -
models.zip
contains the (best) checkpoints of our T5 models (pre-trained or not), for all the tasks (code-to-code, code-to-comment, code&comment-to-code) and both the datasets (new_large_dataset, Tufano_etal_dataset) we used. We also stored the checkpoint of the pre-trained model without any fine-tuning. The following is the content of themodels
folder:- T5_non_pre-trained_new_large_dataset_code-to-code
- T5_non_pre-trained_new_large_dataset_code-to-comment
- T5_non_pre-trained_new_large_dataset_code&comment-to-code
- T5_non_pre-trained_Tufano_etal_dataset_code-to-code
- T5_non_pre-trained_Tufano_etal_dataset_code&comment-to-code
- T5_pre-trained
- T5_pre-trained_new_large_dataset_code-to-code
- T5_pre-trained_new_large_dataset_code-to-comment
- T5_pre-trained_new_large_dataset_code&comment-to-code
- T5_pre-trained_Tufano_etal_dataset_code-to-code
- T5_pre-trained_Tufano_etal_dataset_code&comment-to-code
-
tokenizer.zip
contains the Sentencepiece tokenizer and the extracted vocabulary obtained by training on our pre-training dataset:TokenizerModel.model
,TokenizerModel.vocab
-
results.zip
contains for each dataset (new_large_dataset, Tufano_etal_dataset) the results obtained by each model (pre-trained or not) fine-tuned on each task (code-to-code, code-to-comment, code&comment-to-code). In particular, for each combination of dataset and model we share the following files:source.txt
: input file for the model;target.txt
: target file (expected output);predictions_<k>.txt
: generated predictions file with BEAM_SIZE = k (k=1,3,5,10).code_bleu_<k>.txt
orbleu_<k>.txt
: code_BLEU or BLEU scores file (depending on the task) with BEAM_SIZE = k (k=1,3,5,10)confidence_<k>.txt
: confidence scores file with BEAM_SIZE = k (k=1,3,5,10)
In order to generate predictions with our models you need:
- the models checkpoints stored in
models.zip
; - the content of the archive
generate_predictions.zip
; - the datasets stored in
datasets.zip
.
The folder generate_predictions
stores all the necessary code to generate the predictions of the T5 models with different beam sizes end evaluate them in terms of perfect predictions and codeBLEU (code-to-code and code&comment-to-code tasks) or BLEU (code-to-comment task) score.
First, you need to convert the checkpoint model in PyThorch. To do that you need to run the following command from the generate_prediction
folder:
python3 ./tf_2_pytorch_T5.py --tf_checkpoint_path <model_path> --config_file ./config.json --pytorch_dump_path ./dumps
where <model_path>
is the path of the checkpoint model you want to use. For example, if you want to generate the predictions for the code-to-code task on the new_large_dataset using the pre-trained T5 model, you need to run the following command:
python3 ./tf_2_pytorch_T5.py --tf_checkpoint_path ../models/T5_pre-trained_new_large_dataset_code-to-code/model.ckpt-best --config_file ./config.json --pytorch_dump_path ./dumps
In the python script generate_predictions/generate_predictions.py
set up the beam size (line 45), the task of interest (line 47) and the path to the right dataset (line 48). For example:
beam_size = 1
batch_size = 64
task = 'code2code: ' # possible options: 'code2code: ', 'code&comment2code: ', 'code2comment: '
data_dir = "../dataset/fine-tuning/new_large/code-to-code/"
The output you will get is a textual file, predictions_k.txt
(where k = beam_size), stored in the same dataset folder, containing all the generated predictions.
In order to evaluate the generated predictions in terms of perfect predictions and codeBLEU or BLEU score, you need to run one of the python scripts generate_predictions/for_codeBLEU.py
or generate_predictions/for_BLEU.py
, after you set the right paths to the target file, the predictions files and where to store the results (lines 69-71 or 17-19).
To train the T5 models we have used the Google Colab service. To replicate our training you will need a Google Colab pro account and a Google Cloud Storage (GCS) account. Once you have you GCS account you need to set up a new bucket. Please, follow the guide provided by Google.
In your GCS bucket upload the content of the archive automating_code_review.zip
. In it we have stored our datasets, our pre-trained model, our Sentencepiece tokenizer and some other utilities to replicate our work. Moreover, we have kept the same structure as our bucket to facilitate the use of the Colab notebooks.
Once everything is set you can:
- Pre-train a T5 model from scratch using our pre-training dataset following the
PreTraining.ipynb
notebook; - Fine-tune a T5 model (with or without pre-training) on one of the downstream task (code-to-code, code-to-comment, code&comment-to-code tasks) using our datasets, following the
FineTuning.ipynb
notebook;
We also provided a notebook (Preprocessing.ipynb
) with all the preprocessing steps we followed to prepare our pre-training dataset and to train on it the Sentencepiece tokenizer.