Inconsistent tokenizer use cause bad predictions ... #1768

sankaran45 · 2020-07-18T08:08:43Z

Describe the bug
I have a CSV training/test files that i use CSVClassificationCorpus to load and then train etc. The evaluate that runs after training works fine. Then i manually load the CSV file and for each line, i call Sentence(...) and then pass it to predict function. This time the results are arbitrary and poor.

I looked at it a bit, and it turned out that by default Sentence uses SpaceTokenizer (if no use_tokenizer parameter) is passed.

OTOH, CSVClassificationCorpus uses SegtokTokenizer by default ...

Leading to completely different results in the default case of not specifying these parameters.

So i fixed it by passing use_tokenize=SegtokTokenizer to my Sentence call before invoking predict

Quite counter-intutitive .. not necessarily a bug but posting in case some one else runs into same issue

alanakbik · 2020-08-12T15:12:44Z

Yes that's a good point. We have long considered making segtok the default tokenizer instead, but are unsure if this is the best way to go.

GH-1768: turn on tokenizer by default in Sentence object

alanakbik · 2020-08-13T13:45:32Z

Merged a PR for this - will be part of next Flair release!

sankaran45 added the bug Something isn't working label Jul 18, 2020

alanakbik added a commit that referenced this issue Aug 13, 2020

GH-1768: turn on tokenizer by default in Sentence object

c02aef4

alanakbik mentioned this issue Aug 13, 2020

GH-1768: turn on tokenizer by default in Sentence object #1806

Merged

alanakbik closed this as completed in #1806 Aug 13, 2020

alanakbik added a commit that referenced this issue Aug 13, 2020

Merge pull request #1806 from flairNLP/GH-1768-tokenizer-default

040d386

GH-1768: turn on tokenizer by default in Sentence object

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent tokenizer use cause bad predictions ... #1768

Inconsistent tokenizer use cause bad predictions ... #1768

sankaran45 commented Jul 18, 2020

alanakbik commented Aug 12, 2020

alanakbik commented Aug 13, 2020

Inconsistent tokenizer use cause bad predictions ... #1768

Inconsistent tokenizer use cause bad predictions ... #1768

Comments

sankaran45 commented Jul 18, 2020

alanakbik commented Aug 12, 2020

alanakbik commented Aug 13, 2020