Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent tokenizer use cause bad predictions ... #1768

Closed
sankaran45 opened this issue Jul 18, 2020 · 2 comments · Fixed by #1806
Closed

Inconsistent tokenizer use cause bad predictions ... #1768

sankaran45 opened this issue Jul 18, 2020 · 2 comments · Fixed by #1806
Labels
bug Something isn't working

Comments

@sankaran45
Copy link

Describe the bug
I have a CSV training/test files that i use CSVClassificationCorpus to load and then train etc. The evaluate that runs after training works fine. Then i manually load the CSV file and for each line, i call Sentence(...) and then pass it to predict function. This time the results are arbitrary and poor.

I looked at it a bit, and it turned out that by default Sentence uses SpaceTokenizer (if no use_tokenizer parameter) is passed.

OTOH, CSVClassificationCorpus uses SegtokTokenizer by default ...

Leading to completely different results in the default case of not specifying these parameters.

So i fixed it by passing use_tokenize=SegtokTokenizer to my Sentence call before invoking predict

Quite counter-intutitive .. not necessarily a bug but posting in case some one else runs into same issue

@sankaran45 sankaran45 added the bug Something isn't working label Jul 18, 2020
@alanakbik
Copy link
Collaborator

Yes that's a good point. We have long considered making segtok the default tokenizer instead, but are unsure if this is the best way to go.

@alanakbik
Copy link
Collaborator

Merged a PR for this - will be part of next Flair release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants