[Feature]: Latin NLP Model #3391

ch-sander · 2024-01-10T08:42:30Z

Problem statement

Classic languages such as Latin are mostly taking a back seat when it comes to NLP (for obvious reasons, though)

Solution

spaCy's model LatinCy has shown how nicely a Latin NLP model can perform. Is there any effort planned towards a Latin model within this project or any support in case a third party will aim for such a model?

Additional Context

No response

stefan-it · 2024-01-18T01:13:39Z

Hi @ch-sander ,

I think this is a very useful feature request! After having a look at the spaCy model for Latin on the Model Hub, for PoS Tagging the following repos from Universal Dependencies are used:

As far as I can see, only UD_Latin-LLCT is directly supported in Flair:

flair/flair/datasets/treebanks.py

Lines 542 to 562 in ddf3bb3

    
           class UD_LATIN(UniversalDependenciesCorpus): 
        
               def __init__( 
        
                   self, 
        
                   base_path: Optional[Union[str, Path]] = None, 
        
                   in_memory: bool = True, 
        
                   split_multiwords: bool = True, 
        
               ) -> None: 
        
                   base_path = Path(flair.cache_root) / "datasets" if not base_path else Path(base_path) 
        
                   # this dataset name 
        
                   dataset_name = self.__class__.__name__.lower() 
        
                   data_folder = base_path / dataset_name 
        
                   # download data if necessary 
        
                   web_path = "https://raw.githubusercontent.com/UniversalDependencies/UD_Latin-LLCT/master/" 
        
                   cached_path(f"{web_path}/la_llct-ud-dev.conllu", Path("datasets") / dataset_name) 
        
                   cached_path(f"{web_path}/la_llct-ud-test.conllu", Path("datasets") / dataset_name) 
        
                   cached_path(f"{web_path}/la_llct-ud-train.conllu", Path("datasets") / dataset_name) 
        
                   super().__init__(data_folder, in_memory=in_memory, split_multiwords=split_multiwords)

The other datasets can easily be added to Flair (I assigned issue to me).

For NER I was unfortunately not able to find the training dataset, that was used for LatinCy. I should be located here, but it is currently not available. So I am pinging @diyclassics for help on NER :)

When these resources are available and integrated into Flair, it should be very easy to train models on that. E.g. PoS Tagging and NER models can be trained with LMs like Latin BERT as backbone.

ch-sander · 2024-01-18T09:37:42Z

This sounds awesome! Thanks!

It would be promising to also involve https://github.com/CIRCSE and their many efforts related to the LiLa project @passarom. If I'm right, they also included more Medieval Latin than @diyclassics's model.

ch-sander added the feature A new feature label Jan 10, 2024

stefan-it self-assigned this Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Latin NLP Model #3391

[Feature]: Latin NLP Model #3391

ch-sander commented Jan 10, 2024

stefan-it commented Jan 18, 2024 •

edited

Loading

ch-sander commented Jan 18, 2024 •

edited

Loading

[Feature]: Latin NLP Model #3391

[Feature]: Latin NLP Model #3391

Comments

ch-sander commented Jan 10, 2024

Problem statement

Solution

Additional Context

stefan-it commented Jan 18, 2024 • edited Loading

ch-sander commented Jan 18, 2024 • edited Loading

stefan-it commented Jan 18, 2024 •

edited

Loading

ch-sander commented Jan 18, 2024 •

edited

Loading