In recent times, the pervasive influence of social media has been harnessed by various factions representing diverse religions and backgrounds, unfortunately resulting in a disturbing weaponization of these platforms. Tragically, this has led to the incitement of hatred, which, in turn, has manifested as physical communal violence causing significant loss of life and widespread destruction. The prevalence of communal violence is a longstanding issue that continues to escalate, not only in the Bengal Region, encompassing Bangladesh and the West Bengal Province of India, but also worldwide.
Against this backdrop, it becomes imperative to address and understand the different manifestations of communal violence. To this end, this shared task aims to categorize and discern various forms of communal violence, aiming to shed light on this complex phenomenon and contribute to its mitigation.
This shared task presents a challenge to NLP enthusiasts who wish to participate in a violence inciting text classification task. The dataset comprises YouTube comments related to the top 9 violent incidents that have occurred in the Bengal region (Bangladesh and West Bengal) within the past 10 years. The dataset encompasses content in Bangla, with comment lengths of up to 600 words. The primary objective of this task is to identify and classify threats associated with violence, which can potentially lead to further incitement of violent acts.
The task categories are defined as follows:
-
Direct Violence: This category encompasses explicit threats directed towards individuals or communities, including actions such as killing, rape, vandalism, deportation, desocialization (threats urging individuals or communities to abandon their religion, culture, or traditions), and resocialization (threats of forceful conversion). The detection of direct violence is crucial due to its potential to yield severe consequences in the future.
-
Passive Violence: In this category, instances of violence are represented by the use of derogatory language, abusive remarks, or slang targeting individuals or communities. Additionally, any form of justification for violence is also classified under this category.
-
Non-Violence: The contents falling under this category pertain to non-violent subjects, such as discussions about social rights or general conversational topics that do not involve any form of violence.
Participants are welcome to build systems for the classification task. The successful completion of this task will contribute significantly to the field of natural language processing and violence detection, facilitating the identification and prevention of potential threats leading to violent incidents.
We will provide dataset in CSV format. The dataset contains two columns: "text" and "label". The "text" column contains textual data collected from social media. The values in the "label" column will be 0, 1, or 2, representing different categories of violence.
Label | Category |
---|---|
Direct Violence | 2 |
Passive Violence | 1 |
Non-Violence | 0 |
Data is currently provided in .csv with train.csv and dev.csv files. A row within the CSV adheres to the following structure:
text label
Where:
- text: text
- label: Direct Violence, Passive Violence, Non-Violence
"ঢাকা কলেজে আগুন লাগিয়ে এই কুলাঙ্গার ছাত্রদের পুরিয়ে মারা উচিৎ,, এরাই এখন গলার কাটা",2
শয়তান মেরে হাসবে না তো কাঁদবে!!,1
যে মারা গেল তার ক্ষতিপূরনের ব্যবস্হা করে দেওয়া হোক।,0
There are two sets used for training and development purposes. The training set is comprised of 2700 samples, which means it contains 2700 instances of data used to train a machine learning model or algorithm. On the other hand, the development set consists of 1330 samples.
Dataset | Number of Samples |
---|---|
Train | 2700 |
Dev | 1330 |
The training set, which includes 2700 samples, is composed of a diverse range of data. Within this set, approximately 15% of the samples depict direct violence, 34% portray passive violence, and the remaining 51% represent non-violent instances. The following figure shows the percentage distribution in different categories in trainig set.
The development set comprises 1330 samples, showcasing a diverse range of data. Among these samples, around 15% illustrate direct violence, 31% depict passive violence, and the remaining 54% represent non-violent instances. The following figure shows the percentage distribution in different categories in development set.
We have used baseline system that fine-tunes the XLM-RoBERTa base model and BERT multilingual base model (cased) for the dataset.
The baseline results for the task on the development set are provided in the below table.
Model | macro-F1 |
---|---|
BanglaBERT | 0.7879 |
XLM-RoBERTa base | 0.7292 |
BERT multilingual base (cased) | 0.6819 |
- 16 July 2023: Registration on codalab and beginning of the development cycle
- 15 August 2023: Beginning of the evaluation cycle (test sets release and run submission)
- 18 August 2023: End of the evaluation cycle
- 20 August 2023: Publish rank list and share paper submission details
- 10 September 2023: Deadline for the submission of working notes
- 10 October 2023: Notification of acceptance
- 16 October 2023: Camera-ready due
- 8 December 2023: Workshop co-located with EMNLP (Singapore)
- Dataset of Trial Phase (Dataset provided after successful registration)
- To participate in this competition, you must have an account in codalab
- Contest Link: https://codalab.lisn.upsaclay.fr/competitions/14620
- This competition consists of two phases:
- Trial Phase: This phase involves working on the dev set.
- Evaluation Phase: This phase involves working on the test set, which will be released during the evaluation cycle.
- Participants can form a team with multiple people or a single-person team is okay.
- We request each team to establish and manage a single account for all submissions. Any submissions made from multiple accounts by the same team may lead to your system being not ranked from the final ranking in the overview paper.
- Your best score will automatically be added to the leaderboard and it will serve as your final submission.
- Each team is allowed a maximum of 30 submissions per day for the given task.
- Team Registration Form - Hit the link and fill out this form (Test set for evaluation phase provided after successful registration)
Your submissions will be evaluated by macro F1 comparing with the ground truth. The leaderboard’s performance is ranked by macro-F1 in descending order. To check more detailed evaluation scores after submission, you can "View scoring output log".
The prediction file should follow the CSV format. In the CSV file, there should have exactly two columns named "text" and "label". The "label" column will be used for evaluation. The values in the "label" column should range from 0 to 2, as there are three classes in our task. In the prediction file, the order of the data should be the order given by us. We have provided sample prediction files below for your convenience.
Follow the instructions below to submit your prediction file. Codalab requires all submissions to be in zip format.
- Use your trained model to generate a prediction file.
- Name the prediction file in the format of
<file_name>.csv
, where<file_name>
represents a descriptive name of your choice. - Compress the
<file_name>.csv
file into a zip file. - Submit the zip file on Codalab.
Rank | Team Name | Affiliation | Username | Submission Id | F1 Score (Macro) |
---|---|---|---|---|---|
1 | DeepBlueAI | DeepBlue Technology (Shanghai) Co., Ltd | DeepBlueAI | 520420 | 76.044 |
2 | Aambela | CCDS Lab, IUB, Dhaka | MoFa_Aambela | 520141 | 76.040 |
3 | NLP_CUET | Chittagong University of Engineering and Technology | NLP_TEAM | 520009 | 74.587 |
4 | Team Embeddings | NA | towhidul_tonmoy | 517629 | 74.418 |
5 | Semantics Squad | University of New Brunswick | KrishnoDey | 518447 | 74.413 |
6 | nlpBDpatriots | GEORGE MASON UNIVERSITY | Raihan008 | 516557 | 74.313 |
7 | the_linguists | Islamic University of Technology | tariquzzaman | 516636 | 73.978 |
8 | Panda | Researcher | yangst | 516283 | 73.808 |
9 | EmptyMind | Chittagong University of Engineering and Technology | empty_box | 518295 | 73.797 |
10 | Mavericks | Pune Institute of Computer Technology | kshitij | 520353 | 73.699 |
11 | LowResourceNLU | University of California Los Angeles (UCLA), Virginia Tech, James Cook University | Hari_vm | 520336 | 73.468 |
12 | VacLM | IIT Kanpur | pramitb | 516939 | 72.656 |
13 | LexicalMinds | Chittagong University of Engineering and Technology | saha17 | 519925 | 72.551 |
14 | Score_IsAll_You_Need | Chittagong University of Engineering and Technology | Ka05aR | 520135 | 72.376 |
15 | winging_it | Islamic University of Technology | shihab16 | 517184 | 71.207 |
16 | Semantic_Savants | Chittagong University of Engineering and Technology | Semantic_Savants | 515469 | 71.179 |
17 | BpHigh | Independent | bp-high | 520550 | 70.978 |
18 | SUST_Black Box | Shahjalal University of Science & Technology | hrithik4064 | 520481 | 70.680 |
19 | Team_Syrax | Shahjalal University of Science & Technology | riyadomf | 520099 | 70.450 |
20 | Blue | North South University | ShadmanRohan | 520323 | 70.012 |
21 | Team CentreBack | Independent | refaat1731 | 515685 | 69.390 |
22 | Souro | Charles University | souro | 520101 | 69.009 |
23 | BanglaNLP | Self | Ssaha | 517168 | 68.110 |
24 | KUET_NLP | Khulna University of Engineering & Technology | shakib034 | 520056 | 60.332 |
25 | Shibli_CL | Ahsanullah University of Science and Technology | Shibli_CL | 515841 | 38.427 |
26 | Ushoshi2023 | Daffodil International University, Florida Institute of Technology, University of North Carolina at Charlotte, United International University | nnur594 | 520100 | 31.913 |
26 | Team Error Point | Daffodil International University | rajeshdiu | 520024 | 31.913 |
27 | lixn | LMU | lixn | 520468 | 31.426 |
- Sourav Saha, Shahjalal University of Science and Technology
- Jahedul Alam Junaed, Shahjalal University of Science and Technology
- Dr. Nabeel Mohammed, Associate Professor, North South University
- Dr. Ruhul Amin, Asst. Professor, Fordham University
Please use the following BibTeX to cite the dataset and the task.
@inproceedings{SahaAndJunaed,
title = "Vio-Lens: A Novel Dataset of Annotated Social Network Posts Leading to Different Forms of Communal Violence and its Evaluation",
author= "Saha, Sourav and Junaed, Jahedul Alam and Saleki, Maryam and Sharma, Arnab Sen and Rifat, Mohammad Rashidujjaman and Rahout, Mohamed and Ahmed, Syed Ishtiaque and Mohammad, Nabeel and Amin, Mohammad Ruhul",
booktitle = "Proceedings of the 1st International Workshop on Bangla Language Processing (BLP-2023)",
month = "Dec",
year = "2023",
publisher = "Association for Computational Linguistics",
address = "Singapore",
}
@inproceedings{blp2023-overview-task1,
title = "BLP-2023 Task 1: Violence Inciting Text Detection (VITD)",
author= "Saha, Sourav and Junaed, Jahedul Alam and Saleki, Maryam and Rahouti, Mohamed and Mohammed, Nabeel and Amin, Mohammad Ruhul",
booktitle = "Proceedings of the 1st International Workshop on Bangla Language Processing (BLP-2023)",
month = "Dec",
year = "2023",
publisher = "Association for Computational Linguistics",
address = "Singapore",
}
- Join us in Slack
- Contact the organizers
EMNLP adheres to the ACL Anti-Harassment Policy. Any participant who experiences harassment or hostile behavior may contact any current member of the ACL Professional Conduct Committee. Please be assured that if you approach us, your concerns will be kept in strict confidence, and we will consult with you on any actions taken.