Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: Invalid continuation byte, when running pytest on new dataset created with tfds new on Windows #57

Open
cleong110 opened this issue Mar 1, 2024 · 4 comments

Comments

@cleong110
Copy link
Contributor

Followed steps from #56 to get pytest running, and then used instructions from https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset to create a new dataset.

Then I get errors like this:

FAILED sign_language_datasets/datasets/new_dataset/new_dataset_dataset_builder_test.py::NewDatasetTest::test_baseclass - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 4352: invalid continuation byte

image
with tracebacks going to abstract_path.py

..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\testing\dataset_builder_testing.py:350: in _make_builder
    return self.dataset_class(  # pylint: disable=not-callable
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\logging\__init__.py:288: in decorator
    return function(*args, **kwargs)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_builder.py:1336: in __init__
    super().__init__(**kwargs)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\logging\__init__.py:288: in decorator
    return function(*args, **kwargs)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_builder.py:284: in __init__
    self.info.initialize_from_bucket()
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\logging\__init__.py:168: in __call__
    return function(*args, **kwargs)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_builder.py:473: in info
    info = self._info()
sign_language_datasets\datasets\new_dataset\new_dataset_dataset_builder.py:17: in _info
    return self.dataset_info_from_configs(
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_builder.py:1115: in dataset_info_from_configs
    metadata = self.get_metadata()
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_builder.py:242: in get_metadata
    return dataset_metadata.load(cls._get_pkg_dir_path())
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_metadata.py:83: in load
    raw_metadata = _read_files(pkg_path)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_metadata.py:106: in _read_files
    return utils.tree.parallel_map(
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\utils\tree_utils.py:65: in parallel_map
    raise f.exception()
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\concurrent\futures\thread.py:58: in run
    result = self.fn(*self.args, **self.kwargs)
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\dataset_metadata.py:107: in <lambda>
    lambda f: f.read_text(encoding="utf-8"), name2path
..\..\..\miniconda3\envs\sign_language_datasets_source\lib\site-packages\etils\epath\abstract_path.py:157: in read_text
    return f.read()

I am on Windows, potentially this is an issue only with that, because when I do the same steps on Colab it does not occur. https://colab.research.google.com/drive/1X9sem_qFHNHgpRl-IqkHN0Mft8CBCp_O?usp=sharing

I went to abstract_path.py and manually edited it to dump to a .txt file

original:

  def read_text(self, encoding: Optional[str] = None) -> str:
    """Reads contents of self as a string."""
    with self.open('r', encoding=encoding) as f:
      return f.read()

edited:

  def read_text(self, encoding: Optional[str] = None) -> str:
    """Reads contents of self as a string."""
    
    with self.open('r', encoding=encoding) as f:
      with open(r"C:\Users\Colin\projects\sign-language\datasets\sign_language_datasets\datasets\jw_sign\wtf.txt", "w") as wtf:
        wtf.write("WTF\n")
        wtf.write(f"self: {self}\n")
        wtf.write(f"f: {f}\n")
      print(f"f is {f}")
      print(f"f is {f}")
      print(f"f is {f}")
      print(f"f is {f}")
      return f.read()

output:

WTF
self: C:\Users\Colin\miniconda3\envs\sign_language_datasets_source\lib\site-packages\tensorflow_datasets\core\valid_tags.txt
f: <_io.TextIOWrapper name='C:\\Users\\Colin\\miniconda3\\envs\\sign_language_datasets_source\\lib\\site-packages\\tensorflow_datasets\\core\\valid_tags.txt' mode='r' encoding='utf-8'>

This lead me to finally realize that what it actually wanted me to do, I think, was remove invalid tags?

So I opened up Tags.txt to have a look, closed it, and then ran the pytest again... and got a new error:

image

Apparently opening and closing Tags.txt made the error go away? I theorize it's something to do with the formatting of the .txt file on Windows

@cleong110
Copy link
Contributor Author

TAGS.txt
^ Here is a .txt file that DOES cause the error
TAGS.txt
^ here is one that does NOT

@cleong110
Copy link
Contributor Author

Looked up how to diff two files in Windows, got this:

(sign_language_datasets_source) C:\Users\Colin\projects\sign-language\datasets>FC C:\Users\Colin\projects\sign-language\datasets\sign_language_datasets\datasets\foo_dataset\TAGS.txt C:\Users\Colin\projects\sign-language\datasets\sign_language_datasets\datasets\new_dataset\TAGS.txt
Comparing files C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt and C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt
content.language.gl # Contains text in language Galician / gl.
content.language.gn # Contains text in language Guaranφ.
content.language.got # Contains text in language Gothic.
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
content.language.gl # Contains text in language Galician / gl.
content.language.gn # Contains text in language Guaran�.
content.language.got # Contains text in language Gothic.
*****

***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt
content.language.gub # Contains text in language Guajajara.
content.language.gun # Contains text in language Mbyß Guaranφ (Tupian).
content.language.ha # Contains text in language Hausa / ha.
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
content.language.gub # Contains text in language Guajajara.
content.language.gun # Contains text in language Mby� Guaran� (Tupian).
content.language.ha # Contains text in language Hausa / ha.
*****

***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt
content.language.my # Contains text in language Burmese / my.
content.language.myu # Contains text in language Munduruk·.
content.language.myv # Contains text in language Erzya.
content.language.nb # Contains text in language Bokmσl, Norwegian.
content.language.ne # Contains text in language Nepali (macrolanguage) / ne.
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
content.language.my # Contains text in language Burmese / my.
content.language.myu # Contains text in language Munduruk�.
content.language.myv # Contains text in language Erzya.
content.language.nb # Contains text in language Bokm�l, Norwegian.
content.language.ne # Contains text in language Nepali (macrolanguage) / ne.
*****

***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt
content.language.sm # Contains text in language Samoan.
content.language.sme # Contains text in language North Sßmi.
content.language.sms # Contains text in language Skolt Sami.
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
content.language.sm # Contains text in language Samoan.
content.language.sme # Contains text in language North S�mi.
content.language.sms # Contains text in language Skolt Sami.
*****

***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\FOO_DATASET\TAGS.txt
content.language.tl # Contains text in language Tagalog / tl.
content.language.tpn # Contains text in language Tupi(nambß).
content.language.tr # Contains text in language Turkish / tr.
***** C:\USERS\COLIN\PROJECTS\SIGN-LANGUAGE\DATASETS\SIGN_LANGUAGE_DATASETS\DATASETS\NEW_DATASET\TAGS.TXT
content.language.tl # Contains text in language Tagalog / tl.
content.language.tpn # Contains text in language Tupi(namb�).
content.language.tr # Contains text in language Turkish / tr.
*****

@cleong110
Copy link
Contributor Author

foo is the one that causes crashes, new is the one that does not. Looks like my Windows/my version of epath can't handle some of the symbols? But VS Code or whatever will just change them when resaving

@cleong110
Copy link
Contributor Author

Tried on an Ubuntu machine, no issue at all. Runs fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant