Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed string index out of range in embeddings.py #1135

Merged
merged 4 commits into from
Oct 4, 2019

Conversation

eurekaqq
Copy link
Contributor

@eurekaqq eurekaqq commented Sep 20, 2019

fixed issue1131,
This is because in embedding.py, token.text is an empty string.
The empty string in token.text is because the training data contains C1 control characters.

So, I added a condition to check if token.text is not empty and make C1 control char to correct char.

about C1 control char fixed stackoverflow ref

@pommedeterresautee
Copy link
Contributor

Hi @eurekaqq , not sure to understand, is the issue related to the way your dataset is encoded or a general issue with some UTF-8 encoded files on Python?
The idea behind is: is it better to detect the issue and display a warning to the final user to let him fix the way he/she wants? Or Should Flair silently repair the issue? Are C1 the only characters to fix?

@eurekaqq
Copy link
Contributor Author

Hi @pommedeterresautee , sorry about late reply.
The issue is a general issue with some UTF-8 encoded files.
If your dataset contains C1 control characters and then uses str.strip() or str.split()..., the C1 control characters will be the same as the empty characters and removed.

you can run this code snip:
the is b'\xc2\x85'.

temp = "the losers � ."

for idx, char in enumerate(temp.split()):
    if char:
        print(idx, char)

then output in my manjaro computer is like below:

0 the
1 losers
2 .

it lost �(b'\xc2\x85')

So, in the first commit is check the token.text is empty or not.
And the second commit, it is to change the C1 control character to the encoding with windows_1252, so that using str.strip(), it will not be an empty character and correct print out word.
by second commit, the output will be like below:

0 the
1 losers
2 …
3 .

@alanakbik
Copy link
Collaborator

Hello @eurekaqq thanks for adding this. Unfortunately when I run this version on the NEWSGROUPS corpus, I get different statistics than with the current version. It seems there are some very strang characters in the NEWSGROUPS corpus that get converted with this method, but not to correct symbols.

To reproduce:

from flair.datasets import NEWSGROUPS
corpus = NEWSGROUPS()
print(corpus.obtain_statistics())

Compare statistics before and after. I've looked into some converted symbols, for instance the file

~/.flair/datasets/newsgroups/original/20news-bydate-train/comp.windows.x/66923

is a Non-ISO extended-ASCII text file that contains the string J�rg Viola that gets converted with this code to J”rg Viola which is likely wrong since it should be an umlaut. So it seems the PR currently will sometimes convert characters into incorrect characters.

@eurekaqq
Copy link
Contributor Author

@alanakbik , thank you for your testing.
I will try the use NEWSGROUPS corpus and fix it!

@eurekaqq
Copy link
Contributor Author

Hi @alanakbik , I try to reproduce this problem.
and J�rg Viola in line 3097 of 20news-bydate-train.txt correctly converted.
you can run blow code to check:

temp = 'J�rg Viola'
print(temp)
print(temp.encode('utf-8'))
print(_restore_windows_1252_characters(temp))
print(_restore_windows_1252_characters(temp).encode('utf-8'))

then, output is like blow:

J�rg Viola
b'J\xc2\x94rg Viola'
J”rg Viola
b'J\xe2\x80\x9drg Viola'

0x94 in output2 is a C1 control character in utf-8. Therefore, you need to decode it by windows-1252.
Then, you will get %E2 %80 %9D in output4
about convert table, you can ref it.

@eurekaqq
Copy link
Contributor Author

by the way, in the reproduction.
I find the bug in open file without utf-8 encoding.

@eurekaqq
Copy link
Contributor Author

eurekaqq commented Oct 2, 2019

Hi @alanakbik , sorry, I don’t know much about umlaut.
Last week, maybe I misunderstood what you mean.
Your umlaut means Ĵ, right?
If so, I think this process has been converted correctly.
Because Ĵ is \xc4\xb4 or J\xcc\x82 in Unicode.
Here is J\xc2\x94.
If not, can you tell me what unicode you think is correct?

@alanakbik
Copy link
Collaborator

alanakbik commented Oct 2, 2019

Hello @eurekaqq it looks like the newsgroup corpus has some very weird (but rare) encoding artifacts. On all other datasets you're code runs good, in fact it corrects some errors in my IMDB dataset. So I think we can merge!

@alanakbik
Copy link
Collaborator

👍

1 similar comment
@yosipk
Copy link
Collaborator

yosipk commented Oct 4, 2019

👍

@alanakbik alanakbik merged commit b81bf84 into flairNLP:master Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IndexError while making NER using flair example with custom dataset
4 participants