Add download size information to documentation #71

cleong110 · 2024-06-03T16:26:46Z

A la tensorflow/datasets#120, it would be helpful to have an estimate of how large each dataset is before downloading. Ideally, a breakdown by feature would be nice.

Currently taking a crack at the following:

on a machine that has a very large hard drive, try downloading everything in the example notebook
run the builder "size in bytes" function mentioned in the tfds issue mentioned above.

cleong110 · 2024-06-12T14:55:33Z

Digging a bit deeper, tfds seems to have a script for generating documentation: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/document_datasets.py

which is used by another script, build_catalog.py: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/build_catalog.py

cleong110 · 2024-06-12T15:23:14Z

What I'm trying:

clone the tensorflow_datasets repo
activate an environment with sign_language_datasets installed
run the documentation scripts.

cleong110 · 2024-06-12T15:25:04Z

Had to pip install pyyaml and pandas, then ran the build_catalog.py and it complained about not having a "stable_versions.txt".

That seems to come from https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/freeze_dataset_versions.py

cleong110 · 2024-06-12T15:26:33Z

When I run THAT, it outputs 5812 datasets versions to a file in my conda env

/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

cleong110 · 2024-06-12T15:27:27Z

So of course I want it to also register the sign language datasets, right? So I edited the file to import that as well, like so:

from absl import app

import tensorflow_datasets as tfds
import sign_language_datasets.datasets


def main(_):
  tfds.core.visibility.set_availables([
      tfds.core.visibility.DatasetType.TFDS_PUBLIC,
  ])

  registered_names = tfds.core.load.list_full_names()
  version_path = tfds.core.utils.tfds_write_path() / 'stable_versions.txt'
  version_path.write_text('\n'.join(registered_names))
  print(f'{len(registered_names)} datasets versions written to {version_path}.')


if __name__ == '__main__':
  app.run(main)

When I run it THEN, it writes 5858 dataset versions instead. Opening up stable_versions, I see a few SL datasets including autsl.

cleong110 · 2024-06-12T15:31:33Z

tfds_stable_versions_no_sl.txt
tfds_stable_versions_sl.txt
The two different versions of the .txt file, copied and renamed.

apparently the comm utility lets you find diffs easily

output:

comm -23 tfds_stable_versions_sl.txt tfds_stable_versions_no_sl.txt > sl_stable_versions.txt
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: input is not in sorted order

OK, let's sort then.

cleong110 · 2024-06-12T15:43:13Z

list filenames, pipe to gnu parallel (yes I will cite it don't worry), and use sort and output to _sorted.txt

ls tfds_stable_versions* | parallel sort --output {.}_sorted.txt {}

cleong110 · 2024-06-12T15:43:41Z

NOW:

comm -23 tfds_stable_versions_sl_sorted.txt tfds_stable_versions_no_sl_sorted.txt > tfds_stable_versions_sl_only.txt

Which gives us

asl_citizen/default/1.0.0
aslg_pc12/0.0.1
asl_lex/annotations/2.0.0
asl_lex/default/2.0.0
asl_signs/default/1.0.0
autsl/default/1.0.0
autsl/holistic/1.0.0
autsl/openpose/1.0.0
bsl_corpus/annotations/1.0.0
bsl_corpus/default/1.0.0
chicago_fs_wild/default/2.0.0
dgs_corpus/annotations/3.0.0
dgs_corpus/default/3.0.0
dgs_corpus/holistic/3.0.0
dgs_corpus/openpose/3.0.0
dgs_corpus/sentences/3.0.0
dgs_corpus/videos/3.0.0
dgs_types/annotations/3.0.0
dgs_types/default/3.0.0
dgs_types/holistic/3.0.0
dicta_sign/annotations/1.0.0
dicta_sign/default/1.0.0
dicta_sign/poses/1.0.0
how2_sign/default/1.0.0
mediapi_skel/default/1.0.0
ngt_corpus/annotations/3.0.0
ngt_corpus/default/3.0.0
ngt_corpus/videos/3.0.0
rwth_phoenix2014_t/annotations/3.0.0
rwth_phoenix2014_t/default/3.0.0
rwth_phoenix2014_t/poses/3.0.0
rwth_phoenix2014_t/videos/3.0.0
sem_lex/default/1.0.0
sign2_mint/annotations/1.0.0
sign2_mint/default/1.0.0
sign_bank/default/1.0.0
sign_suisse/default/1.0.0
sign_suisse/holistic/1.0.0
sign_typ/default/1.0.0
sign_wordnet/default/0.2.0
spread_the_sign/default/1.0.0
swojs_glossario/annotations/1.0.0
swojs_glossario/default/1.0.0
wlasl/default/0.3.0
wmtslt/annotations/1.2.0
wmtslt/default/1.2.0

cleong110 · 2024-06-12T15:45:46Z

Which, I'm just gonna overwrite the stable_versions.txt with that...

cat tfds_stable_versions_sl_only.txt > /home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.

cleong110 · 2024-06-12T15:56:42Z

Sigh:

Offending assertion:

cleong110 · 2024-06-12T15:58:43Z

Note also that it's using the document_datasets.py in the site-packages, not in the cloned repo.
/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/document_datasets.py

Just gonna comment that bit and try again... FileNotFoundError: Error for asl_citizen: [Errno 2] No such file or directory: '/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/tfds_to_pwc_links.json'

Which, digging into the code, it's looking for "# Filepath for mapping between TFDS datasets and PapersWithCode entries."

OK so in dataset_markdown_builder has a bunch of sections, we don't care about them. What if we comment those out?

cleong110 · 2024-06-12T16:10:51Z

Still no luck Getting weird auth token errors. Tried a few datasets.

I give up. This seems like a dead end.

cleong110 · 2024-06-26T16:08:13Z

Setup a script to simply loop through available datasets and tfds.load every builder config. Then I can read download and dataset size from the returned ds_info.

DGS Corpus is the one holdout, because the download process crashes very consistently. Even when passing it process_video=False I have not figured out any way to download the various configs other than "annotations". Spent two hours trying. And tfds has no method to download only, without preparing.

Who decided that download_and_prepare was a good idea for a function? Functions should do one thing!

cleong110 · 2024-07-01T21:14:56Z

managed to download many of the datasets and check the sizes, or log the error

{
    'AUTSL/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB
    }), 
    'AUTSL/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB
    }), 
    'AUTSL/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB
    }), 
    'AslLex/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'AslLex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'ChicagoFSWild/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': ExtractError('Error while extracting /media/vlab/storage/data/tfds/downloads/dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz to /media/vlab/storage/data/tfds/downloads/extracted/TAR_GZ.dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz: ')
    }), 
    'DgsTypes/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': DownloadError('Failed to get url https: //www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 
    'DgsTypes/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")
        }), 
    'DgsTypes/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB
        }), 
    'DictaSign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB
        }), 
    'DictaSign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB
        }), 'DictaSign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB
        }), 'How2Sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': DownloadError('Failed to get url https: //drive.usercontent.google.com/download?id=1dYey1F_SeHets-UO8F9cE3VMhRBO-6e0&export=download. HTTP code: 404.')}), 'NGTCorpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'RWTHPhoenix2014T/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'RWTHPhoenix2014T/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'Sign2MINT/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'Sign2MINT/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'SignBank/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 113.86 MiB, 'dataset_size': 140.10 MiB}), 'SignSuisse/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.77 MiB, 'dataset_size': 4.97 MiB}), 'SignSuisse/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 33.57 GiB, 'dataset_size': 9.96 GiB}), 'SignTyp/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'signtyp.uconn.edu\', port=443): Max retries exceeded with url: /signpuddle/export.php (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7c2520789650>: Failed to resolve \'signtyp.uconn.edu\' ([Errno -2] Name or service not known)"))'))}), 'SignWordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'SwojsGlossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'SwojsGlossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'Wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')}), 'asl_lex/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'asl_lex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'autsl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB}), 'autsl/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB}), 'autsl/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB}), 'dgs_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 46.23 GiB, 'dataset_size': 27.56 GiB}), 'dgs_corpus/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/sentences': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_types/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': DownloadError('Failed to get url https://www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 'dgs_types/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")}), 'dgs_types/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB}), 'dicta_sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB}), 'ngt_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'rwth_phoenix2014_t/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'rwth_phoenix2014_t/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'sign_wordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'swojs_glossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'swojs_glossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')})}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add download size information to documentation #71

Add download size information to documentation #71

cleong110 commented Jun 3, 2024

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 26, 2024

cleong110 commented Jul 1, 2024

Add download size information to documentation #71

Add download size information to documentation #71

Comments

cleong110 commented Jun 3, 2024

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024

cleong110 commented Jun 12, 2024 • edited Loading

cleong110 commented Jun 12, 2024

cleong110 commented Jun 26, 2024

cleong110 commented Jul 1, 2024

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading

cleong110 commented Jun 12, 2024 •

edited

Loading