fixed string index out of range in embeddings.py #1135

eurekaqq · 2019-09-20T02:08:41Z

fixed issue1131,
This is because in embedding.py, token.text is an empty string.
The empty string in token.text is because the training data contains C1 control characters.

So, I added a condition to check if token.text is not empty and make C1 control char to correct char.

about C1 control char fixed stackoverflow ref

pommedeterresautee · 2019-09-20T08:42:12Z

Hi @eurekaqq , not sure to understand, is the issue related to the way your dataset is encoded or a general issue with some UTF-8 encoded files on Python?
The idea behind is: is it better to detect the issue and display a warning to the final user to let him fix the way he/she wants? Or Should Flair silently repair the issue? Are C1 the only characters to fix?

eurekaqq · 2019-09-20T12:13:19Z

Hi @pommedeterresautee , sorry about late reply.
The issue is a general issue with some UTF-8 encoded files.
If your dataset contains C1 control characters and then uses str.strip() or str.split()..., the C1 control characters will be the same as the empty characters and removed.

you can run this code snip:
the � is b'\xc2\x85'.

temp = "the losers � ."

for idx, char in enumerate(temp.split()):
    if char:
        print(idx, char)

then output in my manjaro computer is like below:

0 the
1 losers
2 .

it lost �(b'\xc2\x85')

So, in the first commit is check the token.text is empty or not.
And the second commit, it is to change the C1 control character to the encoding with windows_1252, so that using str.strip(), it will not be an empty character and correct print out word.
by second commit, the output will be like below:

0 the
1 losers
2 …
3 .

alanakbik · 2019-09-25T13:16:52Z

Hello @eurekaqq thanks for adding this. Unfortunately when I run this version on the NEWSGROUPS corpus, I get different statistics than with the current version. It seems there are some very strang characters in the NEWSGROUPS corpus that get converted with this method, but not to correct symbols.

To reproduce:

from flair.datasets import NEWSGROUPS
corpus = NEWSGROUPS()
print(corpus.obtain_statistics())

Compare statistics before and after. I've looked into some converted symbols, for instance the file

~/.flair/datasets/newsgroups/original/20news-bydate-train/comp.windows.x/66923

is a Non-ISO extended-ASCII text file that contains the string J�rg Viola that gets converted with this code to J”rg Viola which is likely wrong since it should be an umlaut. So it seems the PR currently will sometimes convert characters into incorrect characters.

eurekaqq · 2019-09-25T14:06:55Z

@alanakbik , thank you for your testing.
I will try the use NEWSGROUPS corpus and fix it!

eurekaqq · 2019-09-26T12:44:41Z

Hi @alanakbik , I try to reproduce this problem.
and J�rg Viola in line 3097 of 20news-bydate-train.txt correctly converted.
you can run blow code to check:

temp = 'J�rg Viola'
print(temp)
print(temp.encode('utf-8'))
print(_restore_windows_1252_characters(temp))
print(_restore_windows_1252_characters(temp).encode('utf-8'))

then, output is like blow:

J�rg Viola
b'J\xc2\x94rg Viola'
J”rg Viola
b'J\xe2\x80\x9drg Viola'

0x94 in output2 is a C1 control character in utf-8. Therefore, you need to decode it by windows-1252.
Then, you will get %E2 %80 %9D in output4
about convert table, you can ref it.

eurekaqq · 2019-09-26T12:47:20Z

by the way, in the reproduction.
I find the bug in open file without utf-8 encoding.

eurekaqq · 2019-10-02T13:29:34Z

Hi @alanakbik , sorry, I don’t know much about umlaut.
Last week, maybe I misunderstood what you mean.
Your umlaut means Ĵ, right?
If so, I think this process has been converted correctly.
Because Ĵ is \xc4\xb4 or J\xcc\x82 in Unicode.
Here is J\xc2\x94.
If not, can you tell me what unicode you think is correct?

alanakbik · 2019-10-02T15:33:20Z

Hello @eurekaqq it looks like the newsgroup corpus has some very weird (but rare) encoding artifacts. On all other datasets you're code runs good, in fact it corrects some errors in my IMDB dataset. So I think we can merge!

alanakbik · 2019-10-02T15:34:10Z

👍

yosipk · 2019-10-04T09:09:46Z

👍

eurekaqq_lin added 3 commits September 20, 2019 09:41

fixed string index out of range in embeddings.py

ab29beb

Fixed an issue about making the C1 control character to an empty str

a92a1dc

fixed coding style

778ace9

fixed open file without utf-8 encoding

1e8f52c

alanakbik merged commit b81bf84 into flairNLP:master Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed string index out of range in embeddings.py #1135

fixed string index out of range in embeddings.py #1135

eurekaqq commented Sep 20, 2019 •

edited

Loading

pommedeterresautee commented Sep 20, 2019

eurekaqq commented Sep 20, 2019

alanakbik commented Sep 25, 2019

eurekaqq commented Sep 25, 2019

eurekaqq commented Sep 26, 2019

eurekaqq commented Sep 26, 2019

eurekaqq commented Oct 2, 2019

alanakbik commented Oct 2, 2019 •

edited

Loading

alanakbik commented Oct 2, 2019

yosipk commented Oct 4, 2019

fixed string index out of range in embeddings.py #1135

fixed string index out of range in embeddings.py #1135

Conversation

eurekaqq commented Sep 20, 2019 • edited Loading

pommedeterresautee commented Sep 20, 2019

eurekaqq commented Sep 20, 2019

alanakbik commented Sep 25, 2019

eurekaqq commented Sep 25, 2019

eurekaqq commented Sep 26, 2019

eurekaqq commented Sep 26, 2019

eurekaqq commented Oct 2, 2019

alanakbik commented Oct 2, 2019 • edited Loading

alanakbik commented Oct 2, 2019

yosipk commented Oct 4, 2019

eurekaqq commented Sep 20, 2019 •

edited

Loading

alanakbik commented Oct 2, 2019 •

edited

Loading