Skip to content

Commit

Permalink
ISPY first submission
Browse files Browse the repository at this point in the history
  • Loading branch information
jzySaber1996 committed Jul 10, 2021
1 parent c3bc8b4 commit 8ddd6f1
Show file tree
Hide file tree
Showing 118 changed files with 144,017 additions and 1 deletion.
163 changes: 162 additions & 1 deletion README.md

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Dataset

Please download our dataset at: https://drive.google.com/drive/folders/1YlN6sszyo9vmbjWSQiMN1vr-hHpqSu4O?usp=sharing

The whole dataset includes the Gitter-based train/test set and open-sourced issue-solution pairs.
6,420 changes: 6,420 additions & 0 deletions data/proposed_data/materialize_ispair.txt

Large diffs are not rendered by default.

10,890 changes: 10,890 additions & 0 deletions data/proposed_data/springboot_ispair.txt

Large diffs are not rendered by default.

8,469 changes: 8,469 additions & 0 deletions data/proposed_data/webpack_ispair.txt

Large diffs are not rendered by default.

19,986 changes: 19,986 additions & 0 deletions data/result_data/angular_ispair.txt

Large diffs are not rendered by default.

4,218 changes: 4,218 additions & 0 deletions data/result_data/appium_ispair.txt

Large diffs are not rendered by default.

29,457 changes: 29,457 additions & 0 deletions data/result_data/deeplearning4j_ispair.txt

Large diffs are not rendered by default.

6,765 changes: 6,765 additions & 0 deletions data/result_data/docker_ispair.txt

Large diffs are not rendered by default.

10,197 changes: 10,197 additions & 0 deletions data/result_data/ethereum_ispair.txt

Large diffs are not rendered by default.

2,778 changes: 2,778 additions & 0 deletions data/result_data/gitter_ispair.txt

Large diffs are not rendered by default.

13,161 changes: 13,161 additions & 0 deletions data/result_data/nodejs_ispair.txt

Large diffs are not rendered by default.

17,001 changes: 17,001 additions & 0 deletions data/result_data/typescript_ispair.txt

Large diffs are not rendered by default.

Binary file added diagrams/answer_classification_only.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/application.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/application2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/bad-case.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/baseline1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/baseline2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/cnn_attention_pa&-pa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/component.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/component_result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/example-conversation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/issue-solution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/issue_answer_classification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/issue_answer_classification_multi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/issue_classification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/issue_classification_only.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added diagrams/model-v5_00.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
213 changes: 213 additions & 0 deletions diagrams/plotablation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
import xlrd
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np


def draw_ablation():
workbook = xlrd.open_workbook('../data/result_data_new.xlsx')
sheet = workbook.sheet_by_name('Ablation_study')
local_attn_data = sheet.col_values(2, 1, sheet.nrows)
heu_data = sheet.col_values(3, 1, sheet.nrows)
cnn_data = sheet.col_values(4, 1, sheet.nrows)
richa_data = sheet.col_values(5, 1, sheet.nrows)

pre_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 1]
rec_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 2]
f1_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 3]
pre_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 4]
rec_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 5]
f1_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 0]
df_pre_issue = pd.DataFrame({'richa_localattn': [data[0] for data in pre_issue],
'richa_heu': [data[1] for data in pre_issue],
'richa_cnn': [data[2] for data in pre_issue],
'richa': [data[3] for data in pre_issue]})
df_rec_issue = pd.DataFrame({'richa_localattn': [data[0] for data in rec_issue],
'richa_heu': [data[1] for data in rec_issue],
'richa_cnn': [data[2] for data in rec_issue],
'richa': [data[3] for data in rec_issue]})
df_f1_issue = pd.DataFrame({'richa_localattn': [data[0] for data in f1_issue],
'richa_heu': [data[1] for data in f1_issue],
'richa_cnn': [data[2] for data in f1_issue],
'richa': [data[3] for data in f1_issue]})
df_pre_solution = pd.DataFrame({'richa_localattn': [data[0] for data in pre_solution],
'richa_heu': [data[1] for data in pre_solution],
'richa_cnn': [data[2] for data in pre_solution],
'richa': [data[3] for data in pre_solution]})
df_rec_solution = pd.DataFrame({'richa_localattn': [data[0] for data in rec_solution],
'richa_heu': [data[1] for data in rec_solution],
'richa_cnn': [data[2] for data in rec_solution],
'richa': [data[3] for data in rec_solution]})
df_f1_solution = pd.DataFrame({'richa_localattn': [data[0] for data in f1_solution],
'richa_heu': [data[1] for data in f1_solution],
'richa_cnn': [data[2] for data in f1_solution],
'richa': [data[3] for data in f1_solution]})
x_data = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8']
# plt.plot(x_data, df_pre_issue.richa)
# plt.plot(x_data, df_pre_issue.richa_localattn)
fig = plt.figure()
plt.subplot(231)
plt.plot(x_data, list(df_pre_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b', label='ISPY')
plt.xticks([])
plt.plot(x_data, list(df_pre_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b', label='ISPY-LocalAttn')
plt.plot(x_data, list(df_pre_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b', label='ISPY-Heu')
plt.plot(x_data, list(df_pre_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b', label='ISPY-CNN')
# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')

plt.ylabel('Issue-P', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
# print(stats.ttest_ind(df_pre_issue.richa_heu, df_pre_issue.richa_cnn))

plt.subplot(232)
plt.plot(x_data, list(df_rec_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b')
plt.xticks([])
plt.yticks([])


# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')

plt.ylabel('Issue-R', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
# print(stats.ttest_ind(df_rec_issue.richa, df_rec_issue.richa_cnn))
# print(stats.ttest_ind(df_rec_issue.richa, df_rec_issue.richa_localattn))
# print(stats.ttest_ind(df_rec_issue.richa_heu, df_rec_issue.richa_cnn))


plt.subplot(233)
plt.plot(x_data, list(df_f1_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b')
plt.xticks([])
plt.yticks([])


# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')

plt.ylabel('Issue-F1', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
print(stats.ttest_ind(df_f1_issue.richa, df_f1_issue.richa_cnn))
print(stats.ttest_ind(df_f1_issue.richa, df_f1_issue.richa_heu))


plt.subplot(234)
plt.plot(x_data, list(df_pre_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_pre_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_pre_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_pre_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b')
# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')

plt.ylabel('Solution-P', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
# print(stats.ttest_ind(df_pre_solution.richa_heu, df_pre_solution.richa_cnn))


plt.subplot(235)
plt.plot(x_data, list(df_rec_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_rec_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b')
# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')
plt.yticks([])


plt.ylabel('Solution-R', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
# print(stats.ttest_ind(df_rec_solution.richa_heu, df_rec_solution.richa_cnn))

plt.subplot(236)
plt.plot(x_data, list(df_f1_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
mfcalt='b')
plt.plot(x_data, list(df_f1_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
markersize=4, mfcalt='b')
# plt.grid(axis='y', linestyle='-.')
# plt.grid(axis='x', linestyle='-.')
plt.yticks([])


plt.ylabel('Solution-F1', fontdict={'family': 'Times New Roman', 'size': 16})
plt.ylim([0, 1])
plt.yticks(fontproperties='Times New Roman', size=13)
plt.xticks(fontproperties='Times New Roman', size=13)
print(stats.ttest_ind(df_f1_solution.richa, df_f1_solution.richa_cnn))
print(stats.ttest_ind(df_f1_solution.richa, df_f1_solution.richa_heu))
fig.legend(loc='upper center', ncol=4, prop={'size': 13, 'family': 'Times New Roman'})
plt.show()
# print(df_pre)


def t_return():
richa = [0.76, 0.77, 0.76, 0.75, 0.68, 0.71, 0.84, 0.74, 0.79, 0.77, 0.68, 0.72, 0.82, 0.73, 0.77, 0.80, 0.69, 0.74, 0.79, 0.70, 0.74, 0.86, 0.78, 0.82]
nb = [0.36, 0.40, 0.38, 0.41, 0.30, 0.35, 0.47, 0.36, 0.41, 0.70, 0.56, 0.62, 0.08, 0.25, 0.13, 0.22, 0.42, 0.29, 0.30, 0.50, 0.37, 0.15, 0.40, 0.22]
rf = [0.56, 0.25, 0.34, 0.69, 0.30, 0.42, 0.75, 0.23, 0.35, 0.84, 0.44, 0.58, 1.00, 0.17, 0.29, 0.50, 0.25, 0.33, 0.33, 0.13, 0.18, 0.23, 0.30, 0.26]
gdbt = [0.27, 0.75, 0.40, 0.40, 0.70, 0.51, 0.50, 0.79, 0.61, 0.73, 0.44, 0.55, 0.21, 0.76, 0.33, 0.19, 0.67, 0.29, 0.30, 0.88, 0.44, 0.18, 0.90, 0.30]
casper = [0.39, 0.35, 0.37, 0.08, 0.03, 0.05, 0.59, 0.26, 0.36, 0.46, 0.40, 0.43, 0.19, 0.42, 0.26, 0.14, 0.17, 0.15, 0.05, 0.06, 0.06, 0.15, 0.40, 0.22]
cnc = [0.20, 0.55, 0.29, 0.23, 0.50, 0.32, 0.23, 0.36, 0.28, 0.12, 0.32, 0.17, 0.24, 0.42, 0.30, 0.12, 0.42, 0.19, 0.10, 0.50, 0.17, 0.05, 0.40, 0.10]
deca = [0.33, 0.50, 0.40, 0.28, 0.37, 0.31, 0.33, 0.36, 0.34, 0.64, 0.28, 0.39, 0.42, 0.42, 0.42, 0.44, 0.67, 0.53, 0.32, 0.50, 0.39, 0.04, 0.10, 0.06]

baselines = {'nb': nb, 'rf': rf, 'gdbt': gdbt, 'casper': casper, 'cnc': cnc, 'deca': deca}
for baseline in baselines.keys():
data_temp = baselines[baseline]
richa_pre = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 1]
richa_rec = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 2]
richa_f1 = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 0]

base_pre = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 1]
base_rec = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 2]
base_f1 = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 0]
data_t = stats.ttest_ind(richa_f1, base_f1)
print(data_t)


if __name__=='__main__':
# t_return()
draw_ablation()
14 changes: 14 additions & 0 deletions disentanglement/src/LICENSE-src.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Copyright (c) 2018, Jonathan K Kummerfeld <[email protected]>

Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.

139 changes: 139 additions & 0 deletions disentanglement/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# System

This folder contains code for reproducing our disentanglement experiments.

## Requirements

The only dependency is the [DyNet library](http://dynet.readthedocs.io), which can usually be installed with:

```
pip3 install dynet
```

## Running

To see all options, run:

```
python3 disentangle.py --help
```

### Train

To train, provide the `--train` argument followed by a series of filenames.

The example command below will train a model with the same parameters as used in the ACL paper.
The model is a feedforward neural network with 2 layers, 512 dimensional hidden vectors, and softsign non-linearities.

```
python3 disentangle.py \
example-train \
--train ../data/train/*annotation.txt \
--dev ../data/dev/*annotation.txt \
--hidden 512 \
--layers 2 \
--nonlin softsign \
--word-vectors ../data/glove-ubuntu.txt \
--epochs 20 \
--dynet-autobatch \
--drop 0 \
--learning-rate 0.018804 \
--learning-decay-rate 0.103 \
--seed 10 \
--clip 3.740 \
--weight-decay 1e-07 \
--opt sgd \
> example-train.out 2>example-train.err
```

### Infer

This command will run the model trained above on the development set:

```
python3 disentangle.py \
angual_angular.1 \
--model example-train.dy.model \
--test /home/yuminz/gitter_chatmessage/angual_angular/*ascii* \
--test-start 0 \
--test-end 5000 \
--hidden 512 \
--layers 2 \
--nonlin softsign \
--word-vectors ../data/glove-ubuntu.txt \
> angual_angular.1.out 2>angual_angular.1.err
```

Note - the arguments defining the network (hiiden, layers, nonlin), must match those given in training.

### Evaluate

This command will run the output produced by the command above through the evaluation script:

```
python3 ../tools/evaluation/graph-eval.py --gold ../data/dev/*annotation* --auto example-run.1.out
```

The output should be something like:

```
g/a/m: 2607 2500 1855
p/r/f: 74.2 71.2 72.6
```

The first row is a count of the gold links, auto links, and matching links.
The second line is the precision, recall, and F-score.

Note - the values in the paper are an average over 10 runs, so they will differ slightly from what you get here.

### Running on a file

If you want to apply a model to a file, see this script for an example of how to do it: `example-running.sh`.
The script is set up so someone could call it like so (once the necessary placeholders in the script are set):

./disentangle-file.sh < sample.ascii.txt > sample.links.txt

## Ensemble

For the best results, we used a simple ensemble of multiple models.
We trained 10 models as described above, but with different random seeds (1 through to 10).
We combined their output using the `majority_vote.py` script in this directory.

The same script is used for all three ensemble methods, with slightly different input and arguments:

Union
```
./majority_vote.py example-run*graphs --method union > example-run.combined.union
```

Vote
```
./majority_vote.py example-run*graphs --method vote > example-run.combined.vote
```

Intersect
```
./majority_vote.py example-run*clusters --method intersect > example-run.combined.intersect
```

All of these assume the output files have been converted into our graph format.
Assuming you run `disentangle.py` above and save the output of each run as `example-run.1.out`, `example-run.2.out`, `example-run.3.out`, etc, then this command will use one of our tools to convert them to the graph format:
```
for name in example-run*out ; do ../tools/format-conversion/output-from-py-to-graph.py < $name > $name.graphs ; done
```

The intersect method also assumes they have been made into clusters, like this:
```
for name in example-run*out ; do ../tools/format-conversion/graph-to-cluster.py < $name.graphs > $name.clusters ; done
```

Note: An earlier version of the steps above didn't account for a change in the output of the main system. Apologies for the broken output this would have caused.

## C++ Model

As well as the main Python code, we also wrote a model in C++ that was used for DSTC 7 and the results in the 2018 arXiv version of the paper (the Python version was used for DSTC 8 and the 2019 ACL paper).
The python model has additional input features and a different text representation method.
The C++ model has support for a range of additional variations in both inference and modeling, which did not appear to improve performance.
For details on how to build and run the C++ code, see [this page](./old-cpp-version/).

[Go back](./../) to the main webpage.
Loading

0 comments on commit 8ddd6f1

Please sign in to comment.