Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Latin NLP Model #3391

Open
ch-sander opened this issue Jan 10, 2024 · 2 comments
Open

[Feature]: Latin NLP Model #3391

ch-sander opened this issue Jan 10, 2024 · 2 comments
Assignees
Labels
feature A new feature

Comments

@ch-sander
Copy link

Problem statement

Classic languages such as Latin are mostly taking a back seat when it comes to NLP (for obvious reasons, though)

Solution

spaCy's model LatinCy has shown how nicely a Latin NLP model can perform. Is there any effort planned towards a Latin model within this project or any support in case a third party will aim for such a model?

Additional Context

No response

@ch-sander ch-sander added the feature A new feature label Jan 10, 2024
@stefan-it
Copy link
Member

stefan-it commented Jan 18, 2024

Hi @ch-sander ,

I think this is a very useful feature request! After having a look at the spaCy model for Latin on the Model Hub, for PoS Tagging the following repos from Universal Dependencies are used:

As far as I can see, only UD_Latin-LLCT is directly supported in Flair:

class UD_LATIN(UniversalDependenciesCorpus):
def __init__(
self,
base_path: Optional[Union[str, Path]] = None,
in_memory: bool = True,
split_multiwords: bool = True,
) -> None:
base_path = Path(flair.cache_root) / "datasets" if not base_path else Path(base_path)
# this dataset name
dataset_name = self.__class__.__name__.lower()
data_folder = base_path / dataset_name
# download data if necessary
web_path = "https://raw.githubusercontent.com/UniversalDependencies/UD_Latin-LLCT/master/"
cached_path(f"{web_path}/la_llct-ud-dev.conllu", Path("datasets") / dataset_name)
cached_path(f"{web_path}/la_llct-ud-test.conllu", Path("datasets") / dataset_name)
cached_path(f"{web_path}/la_llct-ud-train.conllu", Path("datasets") / dataset_name)
super().__init__(data_folder, in_memory=in_memory, split_multiwords=split_multiwords)

The other datasets can easily be added to Flair (I assigned issue to me).

For NER I was unfortunately not able to find the training dataset, that was used for LatinCy. I should be located here, but it is currently not available. So I am pinging @diyclassics for help on NER :)

When these resources are available and integrated into Flair, it should be very easy to train models on that. E.g. PoS Tagging and NER models can be trained with LMs like Latin BERT as backbone.

@stefan-it stefan-it self-assigned this Jan 18, 2024
@ch-sander
Copy link
Author

ch-sander commented Jan 18, 2024

This sounds awesome! Thanks!

It would be promising to also involve https://github.com/CIRCSE and their many efforts related to the LiLa project @passarom. If I'm right, they also included more Medieval Latin than @diyclassics's model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

2 participants