Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus and Span label refactoring #2607

Merged
merged 55 commits into from
Jan 26, 2022
Merged

Corpus and Span label refactoring #2607

merged 55 commits into from
Jan 26, 2022

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Jan 25, 2022

First part of a larger refactoring of corpora and span logic in Flair. The refactoring is motivated by difficulties in scaling the EntityLinker up, which required looking at Span representations and the corpora.

1 Corpora
The PR unifies several corpora into a single object. Before, we had ColumnCorpus, UniversalDependenciesCorpus, CoNNLuCorpus, and EntityLinkingCorpus, which resulted in too much redundancy. Now, there is only the ColumnCorpus for all such datasets.

This makes a number of changes:

  • You can now specify a min_count when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. tag_dictionary = corpus.make_label_dictionary("ner", min_count=10))
  • The Dictionary will now compute count statistics for labels in a corpus
  • The ColumnCorpus can now handle relation annotation, dependency tree information and UD feats and misc

2 Span Labels
We now make special distinction between token-level and span-level labels. Instead of storing entities as BIOES tags, they are now stored as span-level annotations. The SequenceTagger internally converts these to BIOES/BIO tags during training. At prediction time, BIOES tags are interpreted and span-labels added.

This makes a number of changes:

  • You now choose the labeling format when instantiating the SequenceTagger, i.e.
    tagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type="ner",
        tag_format="BIOES",
    )
  • the get_spans() method of the Sentence is removed for now, a similar method will be added back in when span-refactoring is complete
  • The ColumnCorpus will automatically identify which columns are span labels and treat them accordingly

The PR also makes a number of other changes:

  • EntityLinker class refactored for speed
  • performance improvements in standard evaluate() method, especially for large datasets
  • A new WordTagger class is created for simple word-level predictions
  • ColumnCorpus no longer does disk reads when in_memory=False, it simply stores the raw data in memory leading to significant speed-ups on large datasets
  • The deprecated data_fetcher is finally removed
  • the Multiconer corpus object was renamed to NER_MULTI_CONER to match names of other corpora

@alanakbik alanakbik marked this pull request as ready for review January 26, 2022 13:50
@alanakbik alanakbik changed the title WIP: Corpus and Span label refactoring Corpus and Span label refactoring Jan 26, 2022
@alanakbik alanakbik merged commit bbee0d3 into master Jan 26, 2022
@alanakbik alanakbik deleted the corpus-refactor branch January 26, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant