Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download size information to documentation #71

Open
cleong110 opened this issue Jun 3, 2024 · 14 comments
Open

Add download size information to documentation #71

cleong110 opened this issue Jun 3, 2024 · 14 comments

Comments

@cleong110
Copy link
Contributor

A la tensorflow/datasets#120, it would be helpful to have an estimate of how large each dataset is before downloading. Ideally, a breakdown by feature would be nice.

Currently taking a crack at the following:

  • on a machine that has a very large hard drive, try downloading everything in the example notebook
  • run the builder "size in bytes" function mentioned in the tfds issue mentioned above.
@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

What I'm trying:

  1. clone the tensorflow_datasets repo
  2. activate an environment with sign_language_datasets installed
  3. run the documentation scripts.

@cleong110
Copy link
Contributor Author

Had to pip install pyyaml and pandas, then ran the build_catalog.py and it complained about not having a "stable_versions.txt".

That seems to come from https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/freeze_dataset_versions.py

@cleong110
Copy link
Contributor Author

When I run THAT, it outputs 5812 datasets versions to a file in my conda env

/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

So of course I want it to also register the sign language datasets, right? So I edited the file to import that as well, like so:

from absl import app

import tensorflow_datasets as tfds
import sign_language_datasets.datasets


def main(_):
  tfds.core.visibility.set_availables([
      tfds.core.visibility.DatasetType.TFDS_PUBLIC,
  ])

  registered_names = tfds.core.load.list_full_names()
  version_path = tfds.core.utils.tfds_write_path() / 'stable_versions.txt'
  version_path.write_text('\n'.join(registered_names))
  print(f'{len(registered_names)} datasets versions written to {version_path}.')


if __name__ == '__main__':
  app.run(main)

When I run it THEN, it writes 5858 dataset versions instead. Opening up stable_versions, I see a few SL datasets including autsl.

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

tfds_stable_versions_no_sl.txt
tfds_stable_versions_sl.txt
The two different versions of the .txt file, copied and renamed.

apparently the comm utility lets you find diffs easily
image

output:

comm -23 tfds_stable_versions_sl.txt tfds_stable_versions_no_sl.txt > sl_stable_versions.txt
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: input is not in sorted order

OK, let's sort then.

@cleong110
Copy link
Contributor Author

list filenames, pipe to gnu parallel (yes I will cite it don't worry), and use sort and output to _sorted.txt

ls tfds_stable_versions* | parallel sort --output {.}_sorted.txt {}

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

NOW:

comm -23 tfds_stable_versions_sl_sorted.txt tfds_stable_versions_no_sl_sorted.txt > tfds_stable_versions_sl_only.txt

Which gives us

asl_citizen/default/1.0.0
aslg_pc12/0.0.1
asl_lex/annotations/2.0.0
asl_lex/default/2.0.0
asl_signs/default/1.0.0
autsl/default/1.0.0
autsl/holistic/1.0.0
autsl/openpose/1.0.0
bsl_corpus/annotations/1.0.0
bsl_corpus/default/1.0.0
chicago_fs_wild/default/2.0.0
dgs_corpus/annotations/3.0.0
dgs_corpus/default/3.0.0
dgs_corpus/holistic/3.0.0
dgs_corpus/openpose/3.0.0
dgs_corpus/sentences/3.0.0
dgs_corpus/videos/3.0.0
dgs_types/annotations/3.0.0
dgs_types/default/3.0.0
dgs_types/holistic/3.0.0
dicta_sign/annotations/1.0.0
dicta_sign/default/1.0.0
dicta_sign/poses/1.0.0
how2_sign/default/1.0.0
mediapi_skel/default/1.0.0
ngt_corpus/annotations/3.0.0
ngt_corpus/default/3.0.0
ngt_corpus/videos/3.0.0
rwth_phoenix2014_t/annotations/3.0.0
rwth_phoenix2014_t/default/3.0.0
rwth_phoenix2014_t/poses/3.0.0
rwth_phoenix2014_t/videos/3.0.0
sem_lex/default/1.0.0
sign2_mint/annotations/1.0.0
sign2_mint/default/1.0.0
sign_bank/default/1.0.0
sign_suisse/default/1.0.0
sign_suisse/holistic/1.0.0
sign_typ/default/1.0.0
sign_wordnet/default/0.2.0
spread_the_sign/default/1.0.0
swojs_glossario/annotations/1.0.0
swojs_glossario/default/1.0.0
wlasl/default/0.3.0
wmtslt/annotations/1.2.0
wmtslt/default/1.2.0

@cleong110
Copy link
Contributor Author

Which, I'm just gonna overwrite the stable_versions.txt with that...

cat tfds_stable_versions_sl_only.txt > /home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

@cleong110
Copy link
Contributor Author

Sigh:
image

Offending assertion:
image

@cleong110
Copy link
Contributor Author

cleong110 commented Jun 12, 2024

Note also that it's using the document_datasets.py in the site-packages, not in the cloned repo.
/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/document_datasets.py

Just gonna comment that bit and try again... FileNotFoundError: Error for asl_citizen: [Errno 2] No such file or directory: '/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/tfds_to_pwc_links.json'

Which, digging into the code, it's looking for "# Filepath for mapping between TFDS datasets and PapersWithCode entries."

OK so in dataset_markdown_builder has a bunch of sections, we don't care about them. What if we comment those out?

image

@cleong110
Copy link
Contributor Author

Still no luck Getting weird auth token errors. Tried a few datasets.

image

I give up. This seems like a dead end.

@cleong110
Copy link
Contributor Author

Setup a script to simply loop through available datasets and tfds.load every builder config. Then I can read download and dataset size from the returned ds_info.

DGS Corpus is the one holdout, because the download process crashes very consistently. Even when passing it process_video=False I have not figured out any way to download the various configs other than "annotations". Spent two hours trying. And tfds has no method to download only, without preparing.

Who decided that download_and_prepare was a good idea for a function? Functions should do one thing!

@cleong110
Copy link
Contributor Author

managed to download many of the datasets and check the sizes, or log the error

{
    'AUTSL/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB
    }), 
    'AUTSL/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB
    }), 
    'AUTSL/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB
    }), 
    'AslLex/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'AslLex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'ChicagoFSWild/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': ExtractError('Error while extracting /media/vlab/storage/data/tfds/downloads/dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz to /media/vlab/storage/data/tfds/downloads/extracted/TAR_GZ.dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz: ')
    }), 
    'DgsTypes/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': DownloadError('Failed to get url https: //www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 
    'DgsTypes/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")
        }), 
    'DgsTypes/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB
        }), 
    'DictaSign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB
        }), 
    'DictaSign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB
        }), 'DictaSign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB
        }), 'How2Sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': DownloadError('Failed to get url https: //drive.usercontent.google.com/download?id=1dYey1F_SeHets-UO8F9cE3VMhRBO-6e0&export=download. HTTP code: 404.')}), 'NGTCorpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'RWTHPhoenix2014T/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'RWTHPhoenix2014T/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'Sign2MINT/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'Sign2MINT/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'SignBank/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 113.86 MiB, 'dataset_size': 140.10 MiB}), 'SignSuisse/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.77 MiB, 'dataset_size': 4.97 MiB}), 'SignSuisse/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 33.57 GiB, 'dataset_size': 9.96 GiB}), 'SignTyp/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'signtyp.uconn.edu\', port=443): Max retries exceeded with url: /signpuddle/export.php (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7c2520789650>: Failed to resolve \'signtyp.uconn.edu\' ([Errno -2] Name or service not known)"))'))}), 'SignWordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'SwojsGlossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'SwojsGlossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'Wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')}), 'asl_lex/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'asl_lex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'autsl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB}), 'autsl/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB}), 'autsl/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB}), 'dgs_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 46.23 GiB, 'dataset_size': 27.56 GiB}), 'dgs_corpus/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/sentences': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_types/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': DownloadError('Failed to get url https://www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 'dgs_types/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")}), 'dgs_types/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB}), 'dicta_sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB}), 'ngt_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'rwth_phoenix2014_t/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'rwth_phoenix2014_t/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'sign_wordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'swojs_glossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'swojs_glossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')})}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant