Yorùbá text

This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:

def convert_to_NFC(filename, outfilename):
    text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
    with open(outfilename, 'w') as f:
        f.write(text)

Web sources:

Social Media sources:

Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
BibeliYoruba_corpus		BibeliYoruba_corpus
LagosNWUspeech_corpus		LagosNWUspeech_corpus
Ogboju_Ode_ninu_igbo_Irunmale		Ogboju_Ode_ninu_igbo_Irunmale
TheYorubaBlog_corpus		TheYorubaBlog_corpus
YorubaForAcademicPurpose_corpus		YorubaForAcademicPurpose_corpus
LICENSE		LICENSE
README.md		README.md
test_yoruba_diacritic_removal.py		test_yoruba_diacritic_removal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yorùbá text

Web sources:

Social Media sources:

Resources

About

Releases

Packages

Languages

License

Timilehin/yoruba-text

Folders and files

Latest commit

History

Repository files navigation

Yorùbá text

Web sources:

Social Media sources:

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages