Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load_dataset_builder #2500

Merged
merged 18 commits into from
Jul 5, 2021

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Jun 14, 2021

Adds the load_dataset_builder function. The good thing is that we can reuse this function to load the dataset info without downloading the dataset itself.

TODOs:

  • Add docstring and entry in the docs
  • Add tests

Closes #2484

Comment on lines 663 to 665
if _return_resolved_file_path:
return builder_instance, resolved_file_path
return builder_instance
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is not very nice. Maybe it's better to define base_path as an optional attribute of DatasetBuilder and then in DatasetBuilder.download_and_prepare we can do the following:

base_path = base_path if base_path is not None else self._base_path

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea !

@mariosasko mariosasko changed the title Add load_dataset_builder [WIP] Add load_dataset_builder Jun 14, 2021
@albertvillanova
Copy link
Member

albertvillanova commented Jun 14, 2021

Hi @mariosasko, thanks for taking on this issue.

Just a few logistic suggestions, as you are one of our most active contributors ❤️ :

  • When you start working on an issue, you can self-assign it to you by commenting on the issue page with the keyword: #self-assign; we have implemented a GitHub Action to take care of that... 😉
  • When you are still working on your Pull Request, instead of using the [WIP] in the PR name, you can instead create a draft pull request: use the drop-down (on the right of the Create Pull Request button) and select Create Draft Pull Request, then click Draft Pull Request.

I hope you find these hints useful. 🤗

@mariosasko
Copy link
Collaborator Author

@albertvillanova Thanks for the tips. When creating this PR, it slipped my mind that this should be a draft. GH has an option to convert already created PRs to draft PRs, but this requires write access for the repo, so maybe you can help.

@albertvillanova albertvillanova marked this pull request as draft June 14, 2021 17:31
@albertvillanova albertvillanova changed the title [WIP] Add load_dataset_builder Add load_dataset_builder Jun 14, 2021
@mariosasko
Copy link
Collaborator Author

Ready for the review!

One additional change. I've modified the camelcase_to_snakecase/snakecase_to_camelcase conversion functions to fix conversion of the names with 2 or more underscores (e.g. camelcase_to_snakecase("__DummyDataset__") would return ___dummy_dataset__; notice one extra underscore at the beginning). The implementation is based on the inflection library.

@mariosasko mariosasko marked this pull request as ready for review June 30, 2021 16:50
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding load_dataset_builder :)
This is really helpful.

I just have one comment about the part that hashes the builder's code:

@mariosasko mariosasko requested a review from lhoestq July 2, 2021 22:56
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now !

Thank you for adding this :)

@lhoestq lhoestq merged commit b15b476 into huggingface:master Jul 5, 2021
@@ -431,7 +431,7 @@ For example, run the following to skip integrity verifications when loading the
Loading datasets offline
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the huggingface/datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggested a white space here: 🤗Datasets => 🤗 Datasets

@stas00
Copy link
Contributor

stas00 commented Jul 5, 2021

Thank you for adding this feature, @mariosasko - this is really awesome!

Tried with:

python -c "from datasets import load_dataset_builder; b = load_dataset_builder('openwebtext-10k'); print(b.cache_dir)"
Using the latest cached version of the module from /home/stas/.cache/huggingface/modules/datasets_modules/datasets
/openwebtext-10k/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b (last modified on Wed May 12 
20:22:53 2021) 

since it couldn't be found locally at openwebtext-10k/openwebtext-10k.py 

or remotely (FileNotFoundError).

/home/stas/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b

The logger message (edited by me to add new lines to point the issues out) is a bit confusing to the user - that is what does FileNotFoundError refer to?

  1. May be replace FileNotFoundError with where it was looking for a file online. But then the remote file is there - it's found
  2. I'm not sure why it says "since it couldn't be found locally" - as it is locally found at the cache folder and again what does " locally at openwebtext-10k/openwebtext-10k.py" mean - i.e. where does it look for it? Is it ./openwebtext-10k/openwebtext-10k.py it's looking for? or in some specific dir?

If the cached version always supersedes any other versions perhaps this is what it should say?

found cached version at xxx, not looking for a local at yyy, not downloading remote at zzz

@lhoestq
Copy link
Member

lhoestq commented Jul 6, 2021

Hi ! Thanks for the comments

Regarding your last message:
You must pass stas/openwebtext-10k as in load_dataset instead of openwebtext-10k. Otherwise it doesn't know how to retrieve the builder from the HF Hub.

When you specify a dataset name without a slash, it tries to load a canonical dataset or it looks locally at ./openwebtext-10k/openwebtext-10k.py
Here since openwebtext-10k is not a canonical dataset and doesn't exist locally at ./openwebtext-10k/openwebtext-10k.py: it raised a FileNotFoundError.
As a fallback it managed to find the dataset script in your cache and it used this one.

@stas00
Copy link
Contributor

stas00 commented Jul 6, 2021

Oh, I see, so I actually used an incorrect input. so it was a user error. Correcting it:

python -c "from datasets import load_dataset_builder; b = load_dataset_builder('stas/openwebtext-10k'); print(b.cache_dir)"
/home/stas/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b

Now there is no logger message. Got it!

OK, I'm not sure the magical recovery it did in first place is most beneficial in the long run. I'd have rather it failed and said: "incorrect input there is no such dataset as 'openwebtext-10k' at or " - because if it doesn't fail I may leave it in the code and it'll fail later when another user tries to use my code and won't have the cache. Does it make sense? Giving me this url allows me to go to the datasets hub and realize that the dataset is missing the username qualifier.

Here since openwebtext-10k is not a canonical dataset and doesn't exist locally at ./openwebtext-10k/openwebtext-10k.py: it raised a FileNotFoundError.

Except it slapped the exception name to remotely (FileNotFoundError). which makes no sense.

Plus for the local it's not clear where is it looking relatively too when it gets FileNotFoundError - perhaps it'd help to use absolute path and use it in the message?


Finally, the logger format is not set up so the user gets a warning w/o knowing it's a warning. As you can see it's missing the WARNING pre-amble in #2500 (comment)

i.e. I had no idea it was warning me of something, I was just trying to make sense of the message that's why I started the discussion and otherwise I'd have completely missed the point of me making an error.

@mariosasko mariosasko deleted the add-load_dataset_builder branch July 9, 2021 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement loading a dataset builder
4 participants