The Dataset for Hate Speech Detection in Indonesian

(Dataset untuk Deteksi Ujaran Kebencian dalam Bahasa Indonesia)

Dataset
The dataset is a two columns data of: label - tweet, consist of 713 tweets in Indonesian. The label is Non_HS or HS. Non_HS for "non-hate-speech" tweet and HS for "hate-speech" tweet.

Number of Non_HS tweets: 453
Number of HS tweets: 260 Since this dataset is unbalanced, you might have to do over-sampling/down-sampling in order to create a balanced dataset.
The dataset may be used freely, but if you want to publish paper/publication using the dataset, please cite this publication:

Preproceesing

Case Folding (Lowercase, Remove Number, remove punctuation, whitespaces)
Tokenization
Stopword Removal
Stemming

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Hate_Speech_Detection.ipynb		Hate_Speech_Detection.ipynb
IDHSD_RIO_unbalanced_713_2017.txt		IDHSD_RIO_unbalanced_713_2017.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Dataset for Hate Speech Detection in Indonesian

About

Releases

Packages

Languages

abduhsalam/Hate-Speech-Detection

Folders and files

Latest commit

History

Repository files navigation

The Dataset for Hate Speech Detection in Indonesian

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages