-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpus and Span label refactoring #2607
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
First part of a larger refactoring of corpora and span logic in Flair. The refactoring is motivated by difficulties in scaling the EntityLinker up, which required looking at Span representations and the corpora.
1 Corpora
The PR unifies several corpora into a single object. Before, we had
ColumnCorpus
,UniversalDependenciesCorpus
,CoNNLuCorpus
, andEntityLinkingCorpus
, which resulted in too much redundancy. Now, there is only theColumnCorpus
for all such datasets.This makes a number of changes:
min_count
when computing the label dictionary. Labels below that count will be UNK'ed. (e.g.tag_dictionary = corpus.make_label_dictionary("ner", min_count=10)
)Dictionary
will now compute count statistics for labels in a corpusColumnCorpus
can now handle relation annotation, dependency tree information and UD feats and misc2 Span Labels
We now make special distinction between token-level and span-level labels. Instead of storing entities as BIOES tags, they are now stored as span-level annotations. The
SequenceTagger
internally converts these to BIOES/BIO tags during training. At prediction time, BIOES tags are interpreted and span-labels added.This makes a number of changes:
SequenceTagger
, i.e.get_spans()
method of the Sentence is removed for now, a similar method will be added back in when span-refactoring is completeColumnCorpus
will automatically identify which columns are span labels and treat them accordinglyThe PR also makes a number of other changes:
EntityLinker
class refactored for speedevaluate()
method, especially for large datasetsWordTagger
class is created for simple word-level predictionsColumnCorpus
no longer does disk reads when in_memory=False, it simply stores the raw data in memory leading to significant speed-ups on large datasetsNER_MULTI_CONER
to match names of other corpora