Skip to content

English, Hausa, Igbo and Yoruba corpora and results (presented in excel files) of word-level language identification research using the character trigram of the featured languages

Notifications You must be signed in to change notification settings

Toluwase/Word-Level-Language-Identification-for-Resource-Scarce-

Repository files navigation

Word-Level Language Identification for Resource-Scarce Languages

Definition of Terms

'This repository' refers to all the resources which are accessed through this hyperlink or url: https://github.com/Toluwase/Word-Level-Language-Identification-for-Resource-Scarce-/

The word-level language identification in this study refers to identifying the language of words in texts.

This study uses the term the main language, which refers to the language of interest. The scope of this study is two languages; one language of interest and one foreign language. In essence, a text will be in two languages: the main language and foreign language. For instance, if the main language is Yoruba, foreign language could be English. In this case text will be in Yoruba while there will be some words in English.

Introduction

This README is a description of the datasets, research results and python program contained on this page; the datasets were used for the word-level language identification research for resource-scarce languages. The word-level language identification strategy proposed in the research was performed through the pattern analysis of the character trigrams of the featured languages. Languages featured in this research are English, Hausa, Igbo and Yoruba. The word-level language identification strategy does not require large corpus which are needed in/for previous word-level language identification and other natural language processing research. Secondly, this strategy potentially works for all languages whose writing systems are alphabet-based.

File Manifest

There are three types of Files

  1. Python Program: The Python program is contained in a python file named trigramanalyzer_wordlanguageid.py. Kindly refer to the Operating Instructions for details of the Python program
  2. Corpora: One of the contributions of this research is the corpora of the three Nigerian languages (Hausa, Igbo and Yoruba) which were used for this research and we are making available to the public through this data repository. There are two types of corpora on this repository: three training datasets/corpora and three test datasets/corpora. The training corpora corpora are monolingual, that is Yoruba training corpus contains only texts in Yoruba language, Hausa training corpus contains Hausa texts alone while Igbo training corpus contains texts in Igbo language alone. On the other hand, the three test corpora are bilingual. The English-Hausa test corpus contains texts in English and Hausa languages, the English-Yoruba test corpus contains texts in English and Yoruba languages while the Igbo-Yoruba training corpus contain texts in Igbo and Yoruba languages.

    Hausa, Igbo and Yoruba electronic texts are provided on this repository; all the training corpora have been largely cleaned of foreign words. Though the corpora have been cleaned, the sentence structures are still maintained. The Hausa corpus (hausa_training_corpus.txt ) has a text size of 190,833 with vocabulary size of 9,247. The Hausa corpus is not marked, that is the diacritically marked characters have been replaced with the unmarked equivalents. The Hausa corpus was harvested online from news and Christian religion organizations' websites. The Igbo corpus (igbo_training_corpus.txt) is fully diacritically marked, it has a text size of 125,690 with a vocabulary size of 6,737. The Igbo corpus was also harvested online largely from news and Christian religion organizations' websites. The Yoruba corpus (Yoruba_training_corpus(part).txt) is a part of the larger corpus that was used for this research that could be made available to the public; other part cannot be made available because of copyright issues. The Yoruba corpus has a text size of 116,652 and a vocabulary size of 6,954.

  3. Research Result: The research results include the research analysis, character trigram-frequency and the word-frequency of English, Hausa, Igbo and Yoruba texts used for the research. These results are presented in Trigams.xls and WORDList&Results.xls files for the trigrams-frequency, and word-frequency/research analysis respectively

Operating instructions

This section of the readme highlights the description of the Python code named trigramanalyzer_wordlanguageid.py

Stage one: Obtaining the Character Trigrams for the Two Languages

This program is meant to find the character trigrams, based on their positions (single, pre, mid and post) in two languages. It calculates the probability of occurrence of the character trigrams in the two languages by reading text in the languages from specified text files which must be saved in UTF-8. The probability of occurrence of trigrams in each language equals to co-efficient of discrimination/identification.

Stage two: Obtaining the Overlapping Trigrams between the Two Languages

In addition, it obtains the trigrams that are common to the two languages. The co-efficient of discrimination/identification for the trigrams that are common to the two languages equals zero (in place of the probability of occurence).

Stage three: The Language Identification of a word

Addition of the co-efficient of discrimination/identification for all the character trigrams in a word gives the LID of the word.

If the LID of a word >0, the word belongs to the main language

If the LID of a word <0, the word belongs to the other language

If the LID of a word =0, the word belongs to the two languages or the word cannot be identified

Copyright and licensing information

Creative Commons License
This work by Asubiaro, T., Adegbola, T., Mercer, R. and Ajiferuke, I. (2018). A Word-Level Language Identification Strategy for Resource-Scarce Languages. In 2018 Conference of The Association for Information Science and Technology, Vancouver, BC, Canada - Nov. 10 - 14, 2018 is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

About

English, Hausa, Igbo and Yoruba corpora and results (presented in excel files) of word-level language identification research using the character trigram of the featured languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages