A link to the paper will be made available here once the Comput-EL proceedings are made public.
Please consider citing if you use the data or any other conceptual snippet from our work. A citation blurb will be included here on publication.
@inproceedings{agarwal-etal-2023-limit,
title = "Developing a Mixed-Methods Pipeline for Community-Oriented Digitization of Kwak'wala Legacy Texts",
author = "Agarwal, Milind and
Rosenblum, Daisy and
Anastasopoulos, Antonios",
publisher = "The Use of Computational Methods in the Study of Endangered Languages (Comput-EL 8)",
url = " ",
doi = " ",
pages = " ",
abstract = " It is by now common knowledge in the NLP community that low-resource languages need large-scale data creation efforts and novel contributions in the form of robust algorithms that work in data-scarce settings. Amongst these languages, however, many have a large amount of data, ripe for NLP applications, except that this data exists in image-based formats. This includes scanned copies of extremely valuable dictionaries, linguistic field notes, children's stories, plays, and other textual material. To extract the text data from these non machine-readable images, Optical Character Recognition (OCR) is the most popular technique, but it has proven to be challenging for low-resource languages because of their unique properties (uncommon diacritics, rare words etc.) and due to a general lack of preserved page-structure in the OCR output. So, to contribute to the reduction of these two big bottlenecks (lack of text data and layout quality), we release the first textual and structural OCR dataset for 8 indigenous languages of Latin America. We hope that our dataset will encourage researchers within the NLP and Computational Linguistics communities to work with these languages.",
}