Skip to content

nuriabel/LUTEST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LUTEST

Language Understanding Test Sets

LUTEST objective is the creation of test sets and an evaluation methodology that provide significant evidence about the linguistic generalization capabilities of deep learning methods applied to natural language processing. For the last years, there have been different works on building test sets and evaluation methods for the purpose of assessing the language understanding capabilities of deep neural models and what information they select and encode. However, there is much work to be done yet, in particular from a linguistically-motivated perspective and for languages other than English.

LUTEST has delivered two datasets: EsCoLA: Spanish Corpus of linguistic acceptability and CatCoLA: Catalan Corpus of Linguistic Acceptability. You can find them (only 1 partition and no test data) at the EsCOLA and CatCoLA directories respectively. The datasets are documented at the following publications.

Núria Bel, Marta Punsola, Valle Ruiz-Fernández, 2024, EsCoLA: Spanish Corpus of Linguistic Acceptability. Joint International Conference on Computational Linguistics, Language Resources and Evaluation LREC-COLING 2024. Torino. Italy.

Núria Bel, Marta Punsola, Valle Ruiz-Fernández, 2024, CatCoLA: Catalan Corpus of Linguistic Acceptability. Procesamiento del Lenguaje Natural 73, 2024.

MCI logoUPF logo

LUTEST is Project PID2019-104512GB-I00. Funded by Ministerio de Ciencia e Innovación (Spain). BSC participation has been promoted and financed by the Generalitat de Catalunya through the Aina project and by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - NextGenerationEU within the framework of the project ILENIA (2022/TL22/00215337-00215334).

Other publications with outcomes of the project are:

Zevallos, R.; Farrús, M.; Bel, N. (2023). Frequency Balanced Datasets Lead to Better Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7859–7872, Singapore. Association for Computational Linguistics.

Zevallos, R.; Bel, N. (2023). Hints on the data for language modeling of synthetic languages with transformers. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, (ACL'23), vol. 1: Long papers, 12508-12522.

Zevallos, R.; Bel, N.; Cámbara, G.; Farrús, M.; Luque J. (2022) Data augmentation for low-resource Quechua ASR improvement. Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea.

About

Language Understanding Test Sets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages