-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load_dataset_builder #2500
Add load_dataset_builder #2500
Conversation
src/datasets/load.py
Outdated
if _return_resolved_file_path: | ||
return builder_instance, resolved_file_path | ||
return builder_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is not very nice. Maybe it's better to define base_path
as an optional attribute of DatasetBuilder
and then in DatasetBuilder.download_and_prepare
we can do the following:
base_path = base_path if base_path is not None else self._base_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea !
Hi @mariosasko, thanks for taking on this issue. Just a few logistic suggestions, as you are one of our most active contributors ❤️ :
I hope you find these hints useful. 🤗 |
@albertvillanova Thanks for the tips. When creating this PR, it slipped my mind that this should be a draft. GH has an option to convert already created PRs to draft PRs, but this requires write access for the repo, so maybe you can help. |
… add-load_dataset_builder
Ready for the review! One additional change. I've modified the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding load_dataset_builder
:)
This is really helpful.
I just have one comment about the part that hashes the builder's code:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now !
Thank you for adding this :)
@@ -431,7 +431,7 @@ For example, run the following to skip integrity verifications when loading the | |||
Loading datasets offline | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the huggingface/datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__. | |||
Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggested a white space here: 🤗Datasets => 🤗 Datasets
Thank you for adding this feature, @mariosasko - this is really awesome! Tried with:
The logger message (edited by me to add new lines to point the issues out) is a bit confusing to the user - that is what does
If the cached version always supersedes any other versions perhaps this is what it should say?
|
Hi ! Thanks for the comments Regarding your last message: When you specify a dataset name without a slash, it tries to load a canonical dataset or it looks locally at ./openwebtext-10k/openwebtext-10k.py |
Oh, I see, so I actually used an incorrect input. so it was a user error. Correcting it:
Now there is no logger message. Got it! OK, I'm not sure the magical recovery it did in first place is most beneficial in the long run. I'd have rather it failed and said: "incorrect input there is no such dataset as 'openwebtext-10k' at or " - because if it doesn't fail I may leave it in the code and it'll fail later when another user tries to use my code and won't have the cache. Does it make sense? Giving me
Except it slapped the exception name to Plus for the local it's not clear where is it looking relatively too when it gets Finally, the logger format is not set up so the user gets a warning w/o knowing it's a warning. As you can see it's missing the WARNING pre-amble in #2500 (comment) i.e. I had no idea it was warning me of something, I was just trying to make sense of the message that's why I started the discussion and otherwise I'd have completely missed the point of me making an error. |
Adds the
load_dataset_builder
function. The good thing is that we can reuse this function to load the dataset info without downloading the dataset itself.TODOs:
Closes #2484