Add error handling and specify encodings. #129

sebpuetz · 2020-06-09T09:33:31Z

There's some noise from refactoring the scripts, mostly due the odd design of Format.{to, from}.

danieldk

Looks good, but maybe you could replace UTF8 in the documentation by UTF-8, which is the name of the standard ;).

danieldk · 2020-06-09T09:56:44Z

src/finalfusion/scripts/util.py

+        action="store_true",
+        default=False,
+        help=
+        "Whether to fail on malformed UTF8. Setting this flag replaces malformed UTF8"


danieldk · 2020-06-09T09:58:01Z

src/finalfusion/compat/text.py

-        Path to a file with embeddings in word2vec binary format.
+        Path to a file with embeddings in text format with dimensions on the first line.
+    lossy : bool
+        If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`


danieldk · 2020-06-09T09:58:12Z

src/finalfusion/compat/text.py

@@ -57,14 +64,18 @@ def load_text(file: Union[str, bytes, int, PathLike]) -> Embeddings:
    ----------
    file : str, bytes, int, PathLike
        Path to a file with embeddings in word2vec binary format.
+    lossy : bool
+        If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`


danieldk · 2020-06-09T09:58:22Z

src/finalfusion/compat/word2vec.py

@@ -30,6 +31,9 @@ def load_word2vec(file: Union[str, bytes, int, PathLike]) -> Embeddings:
    ----------
    file : str, bytes, int, PathLike
        Path to a file with embeddings in word2vec binary format.
+    lossy : bool
+        If set to true, malformed UTF8 sequences in words will be replaced with the `U+FFFD`


sebpuetz · 2020-06-09T10:10:02Z

Thanks, fixed!

sebpuetz requested a review from danieldk June 9, 2020 09:33

sebpuetz linked an issue Jun 9, 2020 that may be closed by this pull request

Handling malformed UTF8 #91

Closed

danieldk approved these changes Jun 9, 2020

View reviewed changes

Add error handling and specify encodings.

2762bff

sebpuetz force-pushed the lossy branch from aa18005 to 2762bff Compare June 9, 2020 10:04

sebpuetz merged commit d5e352d into master Jun 9, 2020

sebpuetz deleted the lossy branch June 9, 2020 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error handling and specify encodings. #129

Add error handling and specify encodings. #129

sebpuetz commented Jun 9, 2020

danieldk left a comment

danieldk Jun 9, 2020

danieldk Jun 9, 2020

danieldk Jun 9, 2020

danieldk Jun 9, 2020

sebpuetz commented Jun 9, 2020

Add error handling and specify encodings. #129

Add error handling and specify encodings. #129

Conversation

sebpuetz commented Jun 9, 2020

danieldk left a comment

Choose a reason for hiding this comment

danieldk Jun 9, 2020

Choose a reason for hiding this comment

danieldk Jun 9, 2020

Choose a reason for hiding this comment

danieldk Jun 9, 2020

Choose a reason for hiding this comment

danieldk Jun 9, 2020

Choose a reason for hiding this comment

sebpuetz commented Jun 9, 2020