Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] Display the download size for each dataset. #120

Closed
dynamicwebpaige opened this issue Mar 3, 2019 · 12 comments
Closed

[Documentation] Display the download size for each dataset. #120

dynamicwebpaige opened this issue Mar 3, 2019 · 12 comments
Labels
enhancement New feature or request

Comments

@dynamicwebpaige
Copy link
Contributor

Currently, just by looking at the list of datasets available in TFDS, there is no way to know the size of each dataset prior to downloading. Users may be operating under constrained disk space, and should be informed of the size of the dataset before requesting.

This feature enhancement would detail the download size of each dataset on the markdown file referenced above.

@dynamicwebpaige dynamicwebpaige added enhancement New feature or request help wanted labels Mar 3, 2019
@ChanchalKumarMaji
Copy link
Contributor

I can work on this issue. Can you please assign this to me ?

Also please tell me what to do. Should I add a new column on the md file ?

@ParthS007
Copy link
Contributor

@dynamicwebpaige I also want to work on this Issue.
@ChanchalKumarMaji Can we collaborate on this one?

@ChanchalKumarMaji
Copy link
Contributor

@ParthS007 lets collaborate together.

@vijaysinghkadam
Copy link

Hi ,
I have never contributed before to tensorflow and I want to start contributing ,
can I work on this issue

@Conchylicultor
Copy link
Member

Some pointers on this:
This should be a trival change (probably less than 10 lines of code) to https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/document_datasets.py
It would be just to expose the builder.info.size_in_bytes field in the generated markdown.
The download size is automatically computed when downloading the dataset

@Anupam-tripathi-zz
Copy link

Anupam-tripathi-zz commented Mar 3, 2019

I would also like to contribute in this issue, warning the user by displaying the size of the dataset he wants to download. Let's collaborate!!

@Conchylicultor
Copy link
Member

@anupam-tripathi Please note that the download size is already displayed when downloading a dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L387

Without downloading, the user should already be able to get the info with:

builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)

This issue is mostly to expose this information in the webpage doc.

@ChanchalKumarMaji
Copy link
Contributor

@Conchylicultor ,

I have generated a new dataset documentation here. There are some datasets where I get the size as 0. Cannot make out why this happens.

But when I run the following code in colab

builder = tfds.builder('imagenet2012')
print(builder.info.size_in_bytes)

it prints 0.

Will I generate a new pull request so that you can see the changes made by me in document_datasets.py?

@Conchylicultor
Copy link
Member

Oh, sorry about this. Yes, ImageNet was a bad example because it is not automatically downloaded (due to the ImageNet licence, it has to be manually downloaded by the user)

Otherwise, it is possible that most recent datasets do not have size information yet. We are pre-computing the size_in_bytes internally at Google and when a used do a tfds.load, it download the dataset information from the internet (size, statistics about the dataset,...). So after a new dataset is added/updated, it may take some time before info gets available.

@Conchylicultor
Copy link
Member

Yes, please generate a pull request with your changes.

Also note that there is tfds.units.size_str to have a human readable formatting:

>> tfds.units.size_str(12312312)
'11.74 MiB'

@ChanchalKumarMaji
Copy link
Contributor

@Conchylicultor I have generated a pull request here. Please check.

@ParthS007
Copy link
Contributor

@Conchylicultor, I have also added them as Table form in the starting of the Docs. Please review my PR here. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants