-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add download size information to documentation #71
Comments
Digging a bit deeper, tfds seems to have a script for generating documentation: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/document_datasets.py which is used by another script, build_catalog.py: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/build_catalog.py |
What I'm trying:
|
Had to pip install pyyaml and pandas, then ran the build_catalog.py and it complained about not having a "stable_versions.txt". That seems to come from https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/freeze_dataset_versions.py |
When I run THAT, it outputs 5812 datasets versions to a file in my conda env
|
So of course I want it to also register the sign language datasets, right? So I edited the file to import that as well, like so: from absl import app
import tensorflow_datasets as tfds
import sign_language_datasets.datasets
def main(_):
tfds.core.visibility.set_availables([
tfds.core.visibility.DatasetType.TFDS_PUBLIC,
])
registered_names = tfds.core.load.list_full_names()
version_path = tfds.core.utils.tfds_write_path() / 'stable_versions.txt'
version_path.write_text('\n'.join(registered_names))
print(f'{len(registered_names)} datasets versions written to {version_path}.')
if __name__ == '__main__':
app.run(main) When I run it THEN, it writes 5858 dataset versions instead. Opening up stable_versions, I see a few SL datasets including autsl. |
tfds_stable_versions_no_sl.txt apparently the output:
OK, let's sort then. |
list filenames, pipe to
|
NOW:
Which gives us
|
Which, I'm just gonna overwrite the stable_versions.txt with that...
|
Note also that it's using the document_datasets.py in the site-packages, not in the cloned repo. Just gonna comment that bit and try again... Which, digging into the code, it's looking for "# Filepath for mapping between TFDS datasets and PapersWithCode entries." OK so in dataset_markdown_builder has a bunch of sections, we don't care about them. What if we comment those out? |
Setup a script to simply loop through available datasets and DGS Corpus is the one holdout, because the download process crashes very consistently. Even when passing it Who decided that |
managed to download many of the datasets and check the sizes, or log the error
|
A la tensorflow/datasets#120, it would be helpful to have an estimate of how large each dataset is before downloading. Ideally, a breakdown by feature would be nice.
Currently taking a crack at the following:
The text was updated successfully, but these errors were encountered: