Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error handling and specify encodings. #129

Merged
merged 1 commit into from
Jun 9, 2020
Merged

Add error handling and specify encodings. #129

merged 1 commit into from
Jun 9, 2020

Conversation

sebpuetz
Copy link
Member

@sebpuetz sebpuetz commented Jun 9, 2020

There's some noise from refactoring the scripts, mostly due the odd design of Format.{to, from}.

@sebpuetz sebpuetz requested a review from danieldk June 9, 2020 09:33
@sebpuetz sebpuetz linked an issue Jun 9, 2020 that may be closed by this pull request
Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but maybe you could replace UTF8 in the documentation by UTF-8, which is the name of the standard ;).

action="store_true",
default=False,
help=
"Whether to fail on malformed UTF8. Setting this flag replaces malformed UTF8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8

Path to a file with embeddings in word2vec binary format.
Path to a file with embeddings in text format with dimensions on the first line.
lossy : bool
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8

@@ -57,14 +64,18 @@ def load_text(file: Union[str, bytes, int, PathLike]) -> Embeddings:
----------
file : str, bytes, int, PathLike
Path to a file with embeddings in word2vec binary format.
lossy : bool
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8

@@ -30,6 +31,9 @@ def load_word2vec(file: Union[str, bytes, int, PathLike]) -> Embeddings:
----------
file : str, bytes, int, PathLike
Path to a file with embeddings in word2vec binary format.
lossy : bool
If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8

@sebpuetz
Copy link
Member Author

sebpuetz commented Jun 9, 2020

Thanks, fixed!

@sebpuetz sebpuetz merged commit d5e352d into master Jun 9, 2020
@sebpuetz sebpuetz deleted the lossy branch June 9, 2020 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Handling malformed UTF8
2 participants