This is the code for Spoiler Detection as Semantic Text Matching. The dataset along with a detailed description is available on Kaggle and Hugging Face.
Start by downloading the dataset from Kaggle or Hugging Face.
mkdir data
and extract the dataset into data/
.
Please ensure that you have Anaconda or Miniconda installed, then
conda env create -f environment.yml
conda activate spoiler
We use Comet.ml to store and read our logs. By default, train.py
will run in offline mode, but you may enter your API key at the top of train.py
to log your experiments on Comet.ml.
python train.py --config config/longformer.yml
Pytorch Lightning model checkpoints are automatically saved in the checkpoints directory under the experiment name and top 2 models with best validation MRR are kept.
Alternatively, you can skip training and download the models from the paper.
Point the resume_from
field in your config file (Ex: models/checkpoints/longformer/longformer.yml
) to your desired model checkpoint (Ex: models/checkpoints/longformer/best.ckpt
), then
python test.py --config models/checkpoints/longformer/longformer.yml --mode test
The individual MRR on the four test set shows will be printed first, then the total test set MRR.
We provide a medium-size autolabeled training set ready for training a spoiler matching model. But if you'd like to create your own training set, we also make available the raw unlabeled comments, as well as the irrelevant/relevant dataset we used to train the autolabeler.