LUTEST

Language Understanding Test Sets

LUTEST objective is the creation of test sets and an evaluation methodology that provide significant evidence about the linguistic generalization capabilities of deep learning methods applied to natural language processing. For the last years, there have been different works on building test sets and evaluation methods for the purpose of assessing the language understanding capabilities of deep neural models and what information they select and encode. However, there is much work to be done yet, in particular from a linguistically-motivated perspective and for languages other than English.

LUTEST has delivered two datasets: EsCoLA: Spanish Corpus of linguistic acceptability and CatCoLA: Catalan Corpus of Linguistic Acceptability. You can find them (only 1 partition and no test data) at the EsCOLA and CatCoLA directories respectively. The datasets are documented at the following publications.

Núria Bel, Marta Punsola, Valle Ruiz-Fernández, 2024, EsCoLA: Spanish Corpus of Linguistic Acceptability. Joint International Conference on Computational Linguistics, Language Resources and Evaluation LREC-COLING 2024. Torino. Italy.

Núria Bel, Marta Punsola, Valle Ruiz-Fernández, 2024, CatCoLA: Catalan Corpus of Linguistic Acceptability. Procesamiento del Lenguaje Natural 73, 2024.

LUTEST is Project PID2019-104512GB-I00. Funded by Ministerio de Ciencia e Innovación (Spain). BSC participation has been promoted and financed by the Generalitat de Catalunya through the Aina project and by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - NextGenerationEU within the framework of the project ILENIA (2022/TL22/00215337-00215334).

Other publications with outcomes of the project are:

Zevallos, R.; Farrús, M.; Bel, N. (2023). Frequency Balanced Datasets Lead to Better Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7859–7872, Singapore. Association for Computational Linguistics.

Zevallos, R.; Bel, N. (2023). Hints on the data for language modeling of synthetic languages with transformers. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, (ACL'23), vol. 1: Long papers, 12508-12522.

Zevallos, R.; Bel, N.; Cámbara, G.; Farrús, M.; Luque J. (2022) Data augmentation for low-resource Quechua ASR improvement. Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
CatCoLA		CatCoLA
EsCOLA		EsCOLA
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LUTEST

About

Releases

Packages

Languages

nuriabel/LUTEST

Folders and files

Latest commit

History

Repository files navigation

LUTEST

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages