This is my codes for the toxic comment classification competition hosted in Kaggle. Fully modified to another level from the base code here
To download datasets please run get_data.sh
The dataset comprises of comments from Wikipedia’s talk page edits. It is a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
toxic
severe_toxic
obscene
threat
insult
identity_hate
Creating an ensemble model which predicts a probability of each type of toxicity for each comment.Full explaination of my approach is documented here
run install.sh and then run pip install -r requirements.txt
- Make sure embeddings original preprocessing is used to ensure highest percentage of embeddings can be imported