-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add error handling and specify encodings. #129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but maybe you could replace UTF8 in the documentation by UTF-8, which is the name of the standard ;).
src/finalfusion/scripts/util.py
Outdated
action="store_true", | ||
default=False, | ||
help= | ||
"Whether to fail on malformed UTF8. Setting this flag replaces malformed UTF8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8
src/finalfusion/compat/text.py
Outdated
Path to a file with embeddings in word2vec binary format. | ||
Path to a file with embeddings in text format with dimensions on the first line. | ||
lossy : bool | ||
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8
src/finalfusion/compat/text.py
Outdated
@@ -57,14 +64,18 @@ def load_text(file: Union[str, bytes, int, PathLike]) -> Embeddings: | |||
---------- | |||
file : str, bytes, int, PathLike | |||
Path to a file with embeddings in word2vec binary format. | |||
lossy : bool | |||
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8
src/finalfusion/compat/word2vec.py
Outdated
@@ -30,6 +31,9 @@ def load_word2vec(file: Union[str, bytes, int, PathLike]) -> Embeddings: | |||
---------- | |||
file : str, bytes, int, PathLike | |||
Path to a file with embeddings in word2vec binary format. | |||
lossy : bool | |||
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8
Thanks, fixed! |
There's some noise from refactoring the scripts, mostly due the odd design of
Format.{to, from}
.