Skip to content

magarw/kwakwala

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts

Paper

A link to the paper will be made available here once the Comput-EL proceedings are made public.

Cite this project

Please consider citing if you use the data or any other conceptual snippet from our work. A citation blurb will be included here on publication.

@inproceedings{agarwal-etal-2023-limit,
    title = "Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts",
    author = "Agarwal, Milind  and
      Rosenblum, Daisy and
      Anastasopoulos, Antonios",
    publisher = "The Use of Computational Methods in the Study of Endangered Languages (Comput-EL 8)",
    url = " ",
    doi = " ",
    pages = " ",
    abstract = " It is by now common knowledge in the NLP community that low-resource languages need large-scale data creation efforts and novel contributions in the form of robust algorithms that work in data-scarce settings. Amongst these languages, however, many have a large amount of data, ripe for NLP applications, except that this data  exists in image-based formats. This includes scanned copies of extremely valuable dictionaries, linguistic field notes, children's stories, plays, and other textual material. To extract the text data from these non machine-readable images, Optical Character Recognition (OCR) is the most popular technique, but it has proven to be challenging for low-resource languages because of their unique properties (uncommon diacritics, rare words etc.) and due to a general lack of preserved page-structure in the OCR output. So, to contribute to the reduction of these two big bottlenecks (lack of text data and layout quality), we release the first textual and structural OCR dataset for 8 indigenous languages of Latin America.  We hope that our dataset will encourage researchers within the NLP and Computational Linguistics communities to work with these languages.",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages