-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed string index out of range in embeddings.py #1135
Conversation
Hi @eurekaqq , not sure to understand, is the issue related to the way your dataset is encoded or a general issue with some UTF-8 encoded files on Python? |
Hi @pommedeterresautee , sorry about late reply. you can run this code snip: temp = "the losers � ."
for idx, char in enumerate(temp.split()):
if char:
print(idx, char) then output in my manjaro computer is like below:
it lost So, in the first commit is check the
|
Hello @eurekaqq thanks for adding this. Unfortunately when I run this version on the NEWSGROUPS corpus, I get different statistics than with the current version. It seems there are some very strang characters in the NEWSGROUPS corpus that get converted with this method, but not to correct symbols. To reproduce: from flair.datasets import NEWSGROUPS
corpus = NEWSGROUPS()
print(corpus.obtain_statistics()) Compare statistics before and after. I've looked into some converted symbols, for instance the file
is a Non-ISO extended-ASCII text file that contains the string |
@alanakbik , thank you for your testing. |
Hi @alanakbik , I try to reproduce this problem. temp = 'J�rg Viola'
print(temp)
print(temp.encode('utf-8'))
print(_restore_windows_1252_characters(temp))
print(_restore_windows_1252_characters(temp).encode('utf-8')) then, output is like blow:
|
by the way, in the reproduction. |
Hi @alanakbik , sorry, I don’t know much about |
Hello @eurekaqq it looks like the newsgroup corpus has some very weird (but rare) encoding artifacts. On all other datasets you're code runs good, in fact it corrects some errors in my IMDB dataset. So I think we can merge! |
👍 |
1 similar comment
👍 |
fixed issue1131,
This is because in embedding.py, token.text is an empty string.
The empty string in token.text is because the training data contains C1 control characters.
So, I added a condition to check if token.text is not empty and make C1 control char to correct char.
about C1 control char fixed stackoverflow ref