This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:
def convert_to_NFC(filename, outfilename):
text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
with open(outfilename, 'w') as f:
f.write(text)
- Lagos-NWU conversational corpus
- Bíbélì Mímọ́ ní Èdè Yorùbá Òde-Òní
- The Yorùbá blog
- BBC Yorùbá
- Yorùbá for Academic Purpose
- Yobá mọ oduá
- Àwa Ẹlẹ́rìí Jèhófà
- Orí Kìíní
- Iwé ti Nicé
- Alákọ̀wé
- lds.org
- Èdè Yorùbá Rẹwà
- Ìmọ̀_Ẹ̀rọ
- ọ̀rọ̀yorùbá
- Wikipedia
- Asubiaro, T., Adegbola, T., Mercer, R. and Ajiferuke, I. (2018). A Word-Level Language Identification Strategy for Resource-Scarce Languages
- https://twitter.com/yobamoodua
- https://twitter.com/yoruba_proverbs
- https://www.facebook.com/oweyoruba
Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!