ISPY first submission

jzySaber1996 · Jul 10, 2021 · 8ddd6f1 · 8ddd6f1
1 parent c3bc8b4
commit 8ddd6f1
Show file tree

Hide file tree

Showing 118 changed files with 144,017 additions and 1 deletion.
diff --git a/README.md b/README.md
diff --git a/data/README.md b/data/README.md
@@ -0,0 +1,5 @@
+# Dataset
+
+Please download our dataset at: https://drive.google.com/drive/folders/1YlN6sszyo9vmbjWSQiMN1vr-hHpqSu4O?usp=sharing
+
+The whole dataset includes the Gitter-based train/test set and open-sourced issue-solution pairs.
diff --git a/data/proposed_data/materialize_ispair.txt b/data/proposed_data/materialize_ispair.txt
diff --git a/data/proposed_data/springboot_ispair.txt b/data/proposed_data/springboot_ispair.txt
diff --git a/data/proposed_data/webpack_ispair.txt b/data/proposed_data/webpack_ispair.txt
diff --git a/data/result_data/angular_ispair.txt b/data/result_data/angular_ispair.txt
diff --git a/data/result_data/appium_ispair.txt b/data/result_data/appium_ispair.txt
diff --git a/data/result_data/deeplearning4j_ispair.txt b/data/result_data/deeplearning4j_ispair.txt
diff --git a/data/result_data/docker_ispair.txt b/data/result_data/docker_ispair.txt
diff --git a/data/result_data/ethereum_ispair.txt b/data/result_data/ethereum_ispair.txt
diff --git a/data/result_data/gitter_ispair.txt b/data/result_data/gitter_ispair.txt
diff --git a/data/result_data/nodejs_ispair.txt b/data/result_data/nodejs_ispair.txt
diff --git a/data/result_data/typescript_ispair.txt b/data/result_data/typescript_ispair.txt
diff --git a/diagrams/answer_classification_only.png b/diagrams/answer_classification_only.png
diff --git a/diagrams/application.png b/diagrams/application.png
diff --git a/diagrams/application2.png b/diagrams/application2.png
diff --git a/diagrams/bad-case.png b/diagrams/bad-case.png
diff --git a/diagrams/baseline1.png b/diagrams/baseline1.png
diff --git a/diagrams/baseline2.png b/diagrams/baseline2.png
diff --git a/diagrams/cnn_attention_pa&-pa.png b/diagrams/cnn_attention_pa&-pa.png
diff --git a/diagrams/component.png b/diagrams/component.png
diff --git a/diagrams/component_result.png b/diagrams/component_result.png
diff --git a/diagrams/dataset.png b/diagrams/dataset.png
diff --git a/diagrams/example-conversation.png b/diagrams/example-conversation.png
diff --git a/diagrams/issue-solution.png b/diagrams/issue-solution.png
diff --git a/diagrams/issue_answer_classification.png b/diagrams/issue_answer_classification.png
diff --git a/diagrams/issue_answer_classification_multi.png b/diagrams/issue_answer_classification_multi.png
diff --git a/diagrams/issue_classification.png b/diagrams/issue_classification.png
diff --git a/diagrams/issue_classification_only.png b/diagrams/issue_classification_only.png
diff --git a/diagrams/model-v5_00.png b/diagrams/model-v5_00.png
diff --git a/diagrams/plotablation.py b/diagrams/plotablation.py
@@ -0,0 +1,213 @@
+import xlrd
+import pandas as pd
+import matplotlib.pyplot as plt
+from scipy import stats
+import numpy as np
+
+
+def draw_ablation():
+    workbook = xlrd.open_workbook('../data/result_data_new.xlsx')
+    sheet = workbook.sheet_by_name('Ablation_study')
+    local_attn_data = sheet.col_values(2, 1, sheet.nrows)
+    heu_data = sheet.col_values(3, 1, sheet.nrows)
+    cnn_data = sheet.col_values(4, 1, sheet.nrows)
+    richa_data = sheet.col_values(5, 1, sheet.nrows)
+
+    pre_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 1]
+    rec_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 2]
+    f1_issue = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 3]
+    pre_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 4]
+    rec_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 5]
+    f1_solution = [[attn, heu, cnn, richa] for i, (attn, heu, cnn, richa)
+                 in enumerate(zip(local_attn_data, heu_data, cnn_data, richa_data)) if (i + 1) % 6 == 0]
+    df_pre_issue = pd.DataFrame({'richa_localattn': [data[0] for data in pre_issue],
+                           'richa_heu': [data[1] for data in pre_issue],
+                           'richa_cnn': [data[2] for data in pre_issue],
+                           'richa': [data[3] for data in pre_issue]})
+    df_rec_issue = pd.DataFrame({'richa_localattn': [data[0] for data in rec_issue],
+                           'richa_heu': [data[1] for data in rec_issue],
+                           'richa_cnn': [data[2] for data in rec_issue],
+                           'richa': [data[3] for data in rec_issue]})
+    df_f1_issue = pd.DataFrame({'richa_localattn': [data[0] for data in f1_issue],
+                           'richa_heu': [data[1] for data in f1_issue],
+                           'richa_cnn': [data[2] for data in f1_issue],
+                           'richa': [data[3] for data in f1_issue]})
+    df_pre_solution = pd.DataFrame({'richa_localattn': [data[0] for data in pre_solution],
+                           'richa_heu': [data[1] for data in pre_solution],
+                           'richa_cnn': [data[2] for data in pre_solution],
+                           'richa': [data[3] for data in pre_solution]})
+    df_rec_solution = pd.DataFrame({'richa_localattn': [data[0] for data in rec_solution],
+                           'richa_heu': [data[1] for data in rec_solution],
+                           'richa_cnn': [data[2] for data in rec_solution],
+                           'richa': [data[3] for data in rec_solution]})
+    df_f1_solution = pd.DataFrame({'richa_localattn': [data[0] for data in f1_solution],
+                           'richa_heu': [data[1] for data in f1_solution],
+                           'richa_cnn': [data[2] for data in f1_solution],
+                           'richa': [data[3] for data in f1_solution]})
+    x_data = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8']
+    # plt.plot(x_data, df_pre_issue.richa)
+    # plt.plot(x_data, df_pre_issue.richa_localattn)
+    fig = plt.figure()
+    plt.subplot(231)
+    plt.plot(x_data, list(df_pre_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b', label='ISPY')
+    plt.xticks([])
+    plt.plot(x_data, list(df_pre_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b', label='ISPY-LocalAttn')
+    plt.plot(x_data, list(df_pre_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b', label='ISPY-Heu')
+    plt.plot(x_data, list(df_pre_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b', label='ISPY-CNN')
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+
+    plt.ylabel('Issue-P', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    # print(stats.ttest_ind(df_pre_issue.richa_heu, df_pre_issue.richa_cnn))
+
+    plt.subplot(232)
+    plt.plot(x_data, list(df_rec_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b')
+    plt.xticks([])
+    plt.yticks([])
+
+
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+
+    plt.ylabel('Issue-R', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    # print(stats.ttest_ind(df_rec_issue.richa, df_rec_issue.richa_cnn))
+    # print(stats.ttest_ind(df_rec_issue.richa, df_rec_issue.richa_localattn))
+    # print(stats.ttest_ind(df_rec_issue.richa_heu, df_rec_issue.richa_cnn))
+
+
+    plt.subplot(233)
+    plt.plot(x_data, list(df_f1_issue.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_issue.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_issue.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_issue.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b')
+    plt.xticks([])
+    plt.yticks([])
+
+
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+
+    plt.ylabel('Issue-F1', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    print(stats.ttest_ind(df_f1_issue.richa, df_f1_issue.richa_cnn))
+    print(stats.ttest_ind(df_f1_issue.richa, df_f1_issue.richa_heu))
+
+
+    plt.subplot(234)
+    plt.plot(x_data, list(df_pre_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_pre_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_pre_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_pre_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b')
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+
+    plt.ylabel('Solution-P', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    # print(stats.ttest_ind(df_pre_solution.richa_heu, df_pre_solution.richa_cnn))
+
+
+    plt.subplot(235)
+    plt.plot(x_data, list(df_rec_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_rec_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b')
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+    plt.yticks([])
+
+
+    plt.ylabel('Solution-R', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    # print(stats.ttest_ind(df_rec_solution.richa_heu, df_rec_solution.richa_cnn))
+
+    plt.subplot(236)
+    plt.plot(x_data, list(df_f1_solution.richa), color='limegreen', linestyle='-', marker='s', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_solution.richa_localattn), color='darksalmon', linestyle='-', marker='x', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_solution.richa_heu), color='orangered', linestyle='-', marker='^', markersize=4,
+             mfcalt='b')
+    plt.plot(x_data, list(df_f1_solution.richa_cnn), color='deepskyblue', linestyle='-', marker='.', mfc='w',
+             markersize=4, mfcalt='b')
+    # plt.grid(axis='y', linestyle='-.')
+    # plt.grid(axis='x', linestyle='-.')
+    plt.yticks([])
+
+
+    plt.ylabel('Solution-F1', fontdict={'family': 'Times New Roman', 'size': 16})
+    plt.ylim([0, 1])
+    plt.yticks(fontproperties='Times New Roman', size=13)
+    plt.xticks(fontproperties='Times New Roman', size=13)
+    print(stats.ttest_ind(df_f1_solution.richa, df_f1_solution.richa_cnn))
+    print(stats.ttest_ind(df_f1_solution.richa, df_f1_solution.richa_heu))
+    fig.legend(loc='upper center', ncol=4, prop={'size': 13, 'family': 'Times New Roman'})
+    plt.show()
+    # print(df_pre)
+
+
+def t_return():
+    richa = [0.76, 0.77, 0.76, 0.75, 0.68, 0.71, 0.84, 0.74, 0.79, 0.77, 0.68, 0.72, 0.82, 0.73, 0.77, 0.80, 0.69, 0.74, 0.79, 0.70, 0.74, 0.86, 0.78, 0.82]
+    nb = [0.36, 0.40, 0.38, 0.41, 0.30, 0.35, 0.47, 0.36, 0.41, 0.70, 0.56, 0.62, 0.08, 0.25, 0.13, 0.22, 0.42, 0.29, 0.30, 0.50, 0.37, 0.15, 0.40, 0.22]
+    rf = [0.56, 0.25, 0.34, 0.69, 0.30, 0.42, 0.75, 0.23, 0.35, 0.84, 0.44, 0.58, 1.00, 0.17, 0.29, 0.50, 0.25, 0.33, 0.33, 0.13, 0.18, 0.23, 0.30, 0.26]
+    gdbt = [0.27, 0.75, 0.40, 0.40, 0.70, 0.51, 0.50, 0.79, 0.61, 0.73, 0.44, 0.55, 0.21, 0.76, 0.33, 0.19, 0.67, 0.29, 0.30, 0.88, 0.44, 0.18, 0.90, 0.30]
+    casper = [0.39, 0.35, 0.37, 0.08, 0.03, 0.05, 0.59, 0.26, 0.36, 0.46, 0.40, 0.43, 0.19, 0.42, 0.26, 0.14, 0.17, 0.15, 0.05, 0.06, 0.06, 0.15, 0.40, 0.22]
+    cnc = [0.20, 0.55, 0.29, 0.23, 0.50, 0.32, 0.23, 0.36, 0.28, 0.12, 0.32, 0.17, 0.24, 0.42, 0.30, 0.12, 0.42, 0.19, 0.10, 0.50, 0.17, 0.05, 0.40, 0.10]
+    deca = [0.33, 0.50, 0.40, 0.28, 0.37, 0.31, 0.33, 0.36, 0.34, 0.64, 0.28, 0.39, 0.42, 0.42, 0.42, 0.44, 0.67, 0.53, 0.32, 0.50, 0.39, 0.04, 0.10, 0.06]
+
+    baselines = {'nb': nb, 'rf': rf, 'gdbt': gdbt, 'casper': casper, 'cnc': cnc, 'deca': deca}
+    for baseline in baselines.keys():
+        data_temp = baselines[baseline]
+        richa_pre = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 1]
+        richa_rec = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 2]
+        richa_f1 = [ric_value for i, ric_value in enumerate(richa) if (i + 1) % 3 == 0]
+
+        base_pre = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 1]
+        base_rec = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 2]
+        base_f1 = [base_value for i, base_value in enumerate(data_temp) if (i + 1) % 3 == 0]
+        data_t = stats.ttest_ind(richa_f1, base_f1)
+        print(data_t)
+
+
+if __name__=='__main__':
+    # t_return()
+    draw_ablation()
diff --git a/disentanglement/src/LICENSE-src.txt b/disentanglement/src/LICENSE-src.txt
@@ -0,0 +1,14 @@
+Copyright (c) 2018, Jonathan K Kummerfeld <[email protected]>
+
+Permission to use, copy, modify, and/or distribute this software for any
+purpose with or without fee is hereby granted, provided that the above
+copyright notice and this permission notice appear in all copies.
+
+THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
+REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
+INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
+LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
+OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
+PERFORMANCE OF THIS SOFTWARE.
+
diff --git a/disentanglement/src/README.md b/disentanglement/src/README.md
@@ -0,0 +1,139 @@
+# System
+
+This folder contains code for reproducing our disentanglement experiments.
+
+## Requirements
+
+The only dependency is the [DyNet library](http://dynet.readthedocs.io), which can usually be installed with:
+
+```
+pip3 install dynet
+```
+
+## Running
+
+To see all options, run:
+
+```
+python3 disentangle.py --help
+```
+
+### Train
+
+To train, provide the `--train` argument followed by a series of filenames.
+
+The example command below will train a model with the same parameters as used in the ACL paper.
+The model is a feedforward neural network with 2 layers, 512 dimensional hidden vectors, and softsign non-linearities.
+
+```
+python3 disentangle.py \
+  example-train \
+  --train ../data/train/*annotation.txt \
+  --dev ../data/dev/*annotation.txt \
+  --hidden 512 \
+  --layers 2 \
+  --nonlin softsign \
+  --word-vectors ../data/glove-ubuntu.txt \
+  --epochs 20 \
+  --dynet-autobatch \
+  --drop 0 \
+  --learning-rate 0.018804 \
+  --learning-decay-rate 0.103 \
+  --seed 10 \
+  --clip 3.740 \
+  --weight-decay 1e-07 \
+  --opt sgd \
+  > example-train.out 2>example-train.err
+```
+
+### Infer
+
+This command will run the model trained above on the development set:
+
+```
+python3 disentangle.py \
+  angual_angular.1 \
+  --model example-train.dy.model \
+  --test /home/yuminz/gitter_chatmessage/angual_angular/*ascii* \
+  --test-start 0 \
+  --test-end 5000 \
+  --hidden 512 \
+  --layers 2 \
+  --nonlin softsign \
+  --word-vectors ../data/glove-ubuntu.txt \
+  > angual_angular.1.out 2>angual_angular.1.err
+```
+
+Note - the arguments defining the network (hiiden, layers, nonlin), must match those given in training.
+
+### Evaluate
+
+This command will run the output produced by the command above through the evaluation script:
+
+```
+python3 ../tools/evaluation/graph-eval.py --gold ../data/dev/*annotation* --auto example-run.1.out
+```
+
+The output should be something like:
+
+```
+g/a/m: 2607 2500 1855
+p/r/f: 74.2 71.2 72.6
+```
+
+The first row is a count of the gold links, auto links, and matching links.
+The second line is the precision, recall, and F-score.
+
+Note - the values in the paper are an average over 10 runs, so they will differ slightly from what you get here.
+
+### Running on a file
+
+If you want to apply a model to a file, see this script for an example of how to do it: `example-running.sh`.
+The script is set up so someone could call it like so (once the necessary placeholders in the script are set):
+
+./disentangle-file.sh < sample.ascii.txt > sample.links.txt
+
+## Ensemble
+
+For the best results, we used a simple ensemble of multiple models.
+We trained 10 models as described above, but with different random seeds (1 through to 10).
+We combined their output using the `majority_vote.py` script in this directory.
+
+The same script is used for all three ensemble methods, with slightly different input and arguments:
+
+Union
+```
+./majority_vote.py example-run*graphs --method union > example-run.combined.union
+```
+
+Vote
+```
+./majority_vote.py example-run*graphs --method vote > example-run.combined.vote
+```
+
+Intersect
+```
+./majority_vote.py example-run*clusters --method intersect > example-run.combined.intersect
+```
+
+All of these assume the output files have been converted into our graph format.
+Assuming you run `disentangle.py` above and save the output of each run as `example-run.1.out`, `example-run.2.out`, `example-run.3.out`, etc, then this command will use one of our tools to convert them to the graph format:
+```
+for name in example-run*out ; do ../tools/format-conversion/output-from-py-to-graph.py < $name > $name.graphs ; done
+```
+
+The intersect method also assumes they have been made into clusters, like this:
+```
+for name in example-run*out ; do ../tools/format-conversion/graph-to-cluster.py < $name.graphs > $name.clusters ; done
+```
+
+Note: An earlier version of the steps above didn't account for a change in the output of the main system. Apologies for the broken output this would have caused.
+
+## C++ Model
+
+As well as the main Python code, we also wrote a model in C++ that was used for DSTC 7 and the results in the 2018 arXiv version of the paper (the Python version was used for DSTC 8 and the 2019 ACL paper).
+The python model has additional input features and a different text representation method.
+The C++ model has support for a range of additional variations in both inference and modeling, which did not appear to improve performance.
+For details on how to build and run the C++ code, see [this page](./old-cpp-version/).
+
+[Go back](./../) to the main webpage.