Corpus and Span label refactoring #2607

alanakbik · 2022-01-25T16:26:58Z

First part of a larger refactoring of corpora and span logic in Flair. The refactoring is motivated by difficulties in scaling the EntityLinker up, which required looking at Span representations and the corpora.

1 Corpora
The PR unifies several corpora into a single object. Before, we had ColumnCorpus, UniversalDependenciesCorpus, CoNNLuCorpus, and EntityLinkingCorpus, which resulted in too much redundancy. Now, there is only the ColumnCorpus for all such datasets.

This makes a number of changes:

You can now specify a min_count when computing the label dictionary. Labels below that count will be UNK'ed. (e.g. tag_dictionary = corpus.make_label_dictionary("ner", min_count=10))
The Dictionary will now compute count statistics for labels in a corpus
The ColumnCorpus can now handle relation annotation, dependency tree information and UD feats and misc

2 Span Labels
We now make special distinction between token-level and span-level labels. Instead of storing entities as BIOES tags, they are now stored as span-level annotations. The SequenceTagger internally converts these to BIOES/BIO tags during training. At prediction time, BIOES tags are interpreted and span-labels added.

This makes a number of changes:

You now choose the labeling format when instantiating the SequenceTagger, i.e.

    tagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type="ner",
        tag_format="BIOES",
    )

the get_spans() method of the Sentence is removed for now, a similar method will be added back in when span-refactoring is complete
The ColumnCorpus will automatically identify which columns are span labels and treat them accordingly

The PR also makes a number of other changes:

EntityLinker class refactored for speed
performance improvements in standard evaluate() method, especially for large datasets
A new WordTagger class is created for simple word-level predictions
ColumnCorpus no longer does disk reads when in_memory=False, it simply stores the raw data in memory leading to significant speed-ups on large datasets
The deprecated data_fetcher is finally removed
the Multiconer corpus object was renamed to NER_MULTI_CONER to match names of other corpora

alanakbik added 30 commits December 23, 2021 15:45

Change in_memory logic

ce6b51d

Begin refactor span annotation logic

9260165

Refactor out EntityLinkingCorpus

07feef2

Remove NP from CONLL_03 presets

d146705

Use Span logic

149d278

Fix heuristic for detecting BIOES tags

407965d

Extra output in dictionary creation

b4ba51b

Serialize loss weights

ac43b16

Make evaluation of single-label problems memory efficient

90c3af7

Fix micro avg key error in classification problems with no out label

f82d0f1

Re-enable non-BIOES in SequenceTagger

f1af960

Remove workers from evaluate function

c5285b0

Predict method checks if data loader is necessary

efc0817

Fix data loader

de95244

Merge branch 'master' into corpus-refactor

33aa8bf

Change data structure to set to increase speed on large datasets

dd12388

Change num_workers presets

83e5e48

Merge branch 'master' into corpus-refactor

967adbe

Merge branch 'master' into corpus-refactor

8408238

Formatting

0825d89

More changes to get_label logic

c53eb68

Merge branch 'master' into corpus-refactor

719079e

Change BIOES label creation

d0159ba

Changes in how BIOES tags are read

b41d9a4

Utility module for BIOES tags

36ef0c6

Add Tqdm to filtering

8b1d7fc

Updates to BIOES detection heuristic

83444cc

Refactor out get_spans

42f6212

UNK-skip probability in EntityLinker

d1e7846

Add dropout to entity linker

decd3a6

alanakbik added 24 commits January 2, 2022 10:29

Reduce cat operations in forward pass

4a05335

Reduce cat operations in forward pass

b3c8df5

Refactor simple sequence tagger into word tagger for new span logic

11c0742

Nicer printout

e368a92

Rename sense key

57d73a9

Make sequence TARS compatible with new logic

9d3d0d3

Fix TARS formatting

9d618f8

Remove UNK from dic

6752e2e

Modified tests for missing get_span

904d9f4

Filter lines

ad4a563

reformat

16a23ff

Begin refactoring out CoNNLU corpus

093e334

Start fixing unit tests

e20bb5a

Make method abstract

91d5c9a

modify span logic

dc1d584

Fix unit tests

f190843

Merge branch 'master' into corpus-refactor

4c73f1f

merge errors

e1d60c2

Fix unit test

1087539

Fix unit tests

d0330bb

Fix unit tests

3ace285

Fix unit tests

151dd20

Remove comment

63ad9c6

Remove comment

7dd91ea

alanakbik marked this pull request as ready for review January 26, 2022 13:50

alanakbik changed the title ~~WIP: Corpus and Span label refactoring~~ Corpus and Span label refactoring Jan 26, 2022

Remove comment

2cfa40b

alanakbik merged commit bbee0d3 into master Jan 26, 2022

alanakbik deleted the corpus-refactor branch January 26, 2022 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus and Span label refactoring #2607

Corpus and Span label refactoring #2607

alanakbik commented Jan 25, 2022 •

edited

Loading

Corpus and Span label refactoring #2607

Corpus and Span label refactoring #2607

Conversation

alanakbik commented Jan 25, 2022 • edited Loading

alanakbik commented Jan 25, 2022 •

edited

Loading