This page tries to assemble all the research on Natural Language Processing (NLP) for native and indigenous languages of the American continent. Our languages are in danger, especially if they don't get involved in the new digital boom, that is introduced even into the most remote communities. Nevertheless, scientific and engineering work has been done in the field, much more work is necessary to archive usable tools that can compete with the products from the big companies (as Google Translate, Alexa, etc.). To push forward this effort, this work wants to generate an (as much as possible) complete list.
Our main aim is to encourage native speakers, researchers, and engineers to participate in this effort. Hopefully, we can do it with these survey.
If you want more information, please read our paper: "Challenges of language technologies for the indigenous languages of the Americas". We also invite you to have a look at our presentation
Last Update: 22/November/2020
- Machine Translation
- Automatic Lexical extraction
- Morphologcal analysis and segmentation
- Corpus and digital resources
- Speech Recognition
- POS Tagging
- Parsing
- Spell checking
- WordNet
- Language ID
- Code-Switching and Multilingual NLP
- Tools, documentation and education
- Computational Linguistic Analyze and Surveys
Online Corpus Resources
- BriBri Anotated speech + morphology corpus
- BriBri bilingual dictionary
- Inukitut Morhological Database
- JW300 Multilingual corpus that also include many indigenous languages of the american contienent. ( Soon available at OPUS )
- Cherokee-English Parallel Corpus
- English-Inuktitut Parallel Corpus
- Nahuatl-Spanish, Axolot Parallel Nahuatl - Spanish
- Gran diccionario Nahuatl
- Wixarika-Spanish Parallel Wixarika - Spanish
- Shipibo-Konibo Spanish Parallel corpus.
- Shipibo-Konibo Wordnet
- Shipibo-Konibo Lemma corpus
- Shipibo-Konbio POS-tag corpus
- Mapundung Speech and parallel corpus
- Mexican Languages Parallel Corpus
- Morphological reinflection (Navajo, Haida and Quechua)
- Morpholigucal inflection SIGMPRPHON 2020 (Tlatepuzco Chinantec, San Pedro Amuzgo Amuzgos, Yoloxóchitl Mixtec, Chichicapan Zapotec, Yaitepec Chatino, Zenzontepec Chatino, Eastern Highland Chatino, Eastern Highland Otomi, Mezquital Otomi and Chichimec)
- Siminchikkunarayku A Speech Corpus for Preservation of Southern Quechua
- Tsunkua Spanish-Hñahñu (Otomi) parallel corpus.
- Universal Dependencies: Mbya Guarani, Shipibo Konibo, Cusco Quechua
- FastText: Nahuatl
Scientific papers
Online demos and software
- CHANA A software platform for automatic translation between Peruvian native languages
- Mainumby is an experimental translation app for the Guarani-Spanish language pair.
- Microsoft Translator includes Yucatec Maya and Queretaro Otomí.
- Wayuu-Spanish Machine Translation Author: José Cirilo González Hernández
- Wixarika-Spanish Machine Translation Author: Jesús Manuel Mager Hois
- Zapotec-Spanish Tranlsation APP. Author: Gonazlo Santiago Martínez.
- Inuktitut Morphological Analyzer
- Wixarika Morphological Segmenter
- Morphological Analyzer for the Bribri language of the Chibchense family
- Guaraní
- Cusco Quechua
- Eastern Apurímac Quechua
- K'iche'
- Tseltal
- Morfo is an application that analyzes words in several languages (Guarani, K'iche', Qhichwua, etc).
- Plains Cree morphological analyzer/generator
Online available software
Mercado-Gonzales, R., Pereira-Noriega, J., Cabezudo, M. A. S., & Oncevay-Marcos, A. (2018). ChAnot: An intelligent annotation tool for indigenous and highly agglutinative languages in Peru.. In LREC.
Flores Solórzano, S. (2012). Teclado chibcha: un software lingüístico para los sistemas de escritura de las lenguas bribri y cabécar. Revista de Filología y Lingüística de la Universidad de Costa Rica Vol. 36 Núm. 2.
Kuhn, J. (2004). Applying computational linguistic techniques in a documentary project for Q’anjob’al (Mayan, Guatemala). In In Proceedings of LREC 2004.
Lessard, G., Brinklow, N., & Levison, M. (2018). Natural Language Generation for Polysynthetic Languages: Language Teaching and Learning Software for Kanyen’kéha (Mohawk). In Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages (pp. 41-52).
Sofía Flores Solórzano. (2010). Teclado chibcha: Un software lingüístico para los sistemas de escritura de las lenguas bribri y cab´ecar. Revista de Filología y Lingüística de la Universidad de Costa Rica, 36(2):155–161. DOI 10.15517/RFL.V36I2.1110.
Manning, C. D., Jansz, K., & Indurkhya, N. (2001). Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary. Literary and Linguistic Computing, 16(2), 135-151.
Jansz, K., Manning, C., & Indurkhya, N. (1999). Kirrkirr: Interactive visualisation and multimedia from a structured Warlpiri dictionary. Proceedings of AusWeb99, the Fifth Australian World Wide Web Conference, pp. 302-316.
This effort can be completed only with the cooperation of all visitors. If you know about some work in this field, please let me know and push to this repositoy or send an email to mmager [at] or visit my personal web page.
If you found this information usefull for your academic research please acknowledge its use with a citation:
Mager, M., Gutierrez. X., Sierra, G., and Meza, I. (2018, August). Challenges of language technologies for the Americas indigenous languages. In Proceedings of the 27th international conference on Computational linguistics. Association for Computational Linguistics.
