'This repository' refers to all the resources which are accessed through this hyperlink or url: https://github.com/Toluwase/Word-Level-Language-Identification-for-Resource-Scarce-/
The word-level language identification in this study refers to identifying the language of words in texts.
This study uses the term the main language, which refers to the language of interest. The scope of this study is two languages; one language of interest and one foreign language. In essence, a text will be in two languages: the main language and foreign language. For instance, if the main language is Yoruba, foreign language could be English. In this case text will be in Yoruba while there will be some words in English.
This README is a description of the datasets, research results and python program contained on this page; the datasets were used for the word-level language identification research for resource-scarce languages. The word-level language identification strategy proposed in the research was performed through the pattern analysis of the character trigrams of the featured languages. Languages featured in this research are English, Hausa, Igbo and Yoruba. The word-level language identification strategy does not require large corpus which are needed in/for previous word-level language identification and other natural language processing research. Secondly, this strategy potentially works for all languages whose writing systems are alphabet-based.
There are three types of Files
- Python Program: The Python program is contained in a python file named trigramanalyzer_wordlanguageid.py. Kindly refer to the Operating Instructions for details of the Python program
Corpora: One of the contributions of this research is the corpora of the three Nigerian languages (Hausa, Igbo and Yoruba) which were used for this research and we are making available to the public through this data repository. There are two types of corpora on this repository: three training datasets/corpora and three test datasets/corpora. The training corpora corpora are monolingual, that is Yoruba training corpus contains only texts in Yoruba language, Hausa training corpus contains Hausa texts alone while Igbo training corpus contains texts in Igbo language alone. On the other hand, the three test corpora are bilingual. The English-Hausa test corpus contains texts in English and Hausa languages, the English-Yoruba test corpus contains texts in English and Yoruba languages while the Igbo-Yoruba training corpus contain texts in Igbo and Yoruba languages.
Hausa, Igbo and Yoruba electronic texts are provided on this repository; all the training corpora have been largely cleaned of foreign words. Though the corpora have been cleaned, the sentence structures are still maintained. The Hausa corpus (hausa_training_corpus.txt ) has a text size of 190,833 with vocabulary size of 9,247. The Hausa corpus is not marked, that is the diacritically marked characters have been replaced with the unmarked equivalents. The Hausa corpus was harvested online from news and Christian religion organizations' websites. The Igbo corpus (igbo_training_corpus.txt) is fully diacritically marked, it has a text size of 125,690 with a vocabulary size of 6,737. The Igbo corpus was also harvested online largely from news and Christian religion organizations' websites. The Yoruba corpus (Yoruba_training_corpus(part).txt) is a part of the larger corpus that was used for this research that could be made available to the public; other part cannot be made available because of copyright issues. The Yoruba corpus has a text size of 116,652 and a vocabulary size of 6,954.
- Research Result: The research results include the research analysis, character trigram-frequency and the word-frequency of English, Hausa, Igbo and Yoruba texts used for the research. These results are presented in Trigams.xls and WORDList&Results.xls files for the trigrams-frequency, and word-frequency/research analysis respectively
This section of the readme highlights the description of the Python code named trigramanalyzer_wordlanguageid.py
This program is meant to find the character trigrams, based on their positions (single, pre, mid and post) in two languages. It calculates the probability of occurrence of the character trigrams in the two languages by reading text in the languages from specified text files which must be saved in UTF-8. The probability of occurrence of trigrams in each language equals to co-efficient of discrimination/identification.
In addition, it obtains the trigrams that are common to the two languages. The co-efficient of discrimination/identification for the trigrams that are common to the two languages equals zero (in place of the probability of occurence).
Addition of the co-efficient of discrimination/identification for all the character trigrams in a word gives the LID of the word.
If the LID of a word >0, the word belongs to the main language
If the LID of a word <0, the word belongs to the other language
If the LID of a word =0, the word belongs to the two languages or the word cannot be identified
This work by Asubiaro, T., Adegbola, T., Mercer, R. and Ajiferuke, I. (2018). A Word-Level Language Identification Strategy for Resource-Scarce Languages. In 2018 Conference of The Association for Information Science and Technology, Vancouver, BC, Canada - Nov. 10 - 14, 2018 is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.