Add load_dataset_builder #2500

mariosasko · 2021-06-14T14:27:45Z

Adds the load_dataset_builder function. The good thing is that we can reuse this function to load the dataset info without downloading the dataset itself.

TODOs:

Add docstring and entry in the docs
Add tests

Closes #2484

mariosasko · 2021-06-14T14:54:21Z

src/datasets/load.py

+    if _return_resolved_file_path:
+        return builder_instance, resolved_file_path
+    return builder_instance


This part is not very nice. Maybe it's better to define base_path as an optional attribute of DatasetBuilder and then in DatasetBuilder.download_and_prepare we can do the following:

base_path = base_path if base_path is not None else self._base_path

Good idea !

albertvillanova · 2021-06-14T16:43:47Z

Hi @mariosasko, thanks for taking on this issue.

Just a few logistic suggestions, as you are one of our most active contributors ❤️ :

When you start working on an issue, you can self-assign it to you by commenting on the issue page with the keyword: #self-assign; we have implemented a GitHub Action to take care of that... 😉
When you are still working on your Pull Request, instead of using the [WIP] in the PR name, you can instead create a draft pull request: use the drop-down (on the right of the Create Pull Request button) and select Create Draft Pull Request, then click Draft Pull Request.

I hope you find these hints useful. 🤗

mariosasko · 2021-06-14T17:22:28Z

@albertvillanova Thanks for the tips. When creating this PR, it slipped my mind that this should be a draft. GH has an option to convert already created PRs to draft PRs, but this requires write access for the repo, so maybe you can help.

… add-load_dataset_builder

mariosasko · 2021-06-30T16:50:24Z

Ready for the review!

One additional change. I've modified the camelcase_to_snakecase/snakecase_to_camelcase conversion functions to fix conversion of the names with 2 or more underscores (e.g. camelcase_to_snakecase("__DummyDataset__") would return ___dummy_dataset__; notice one extra underscore at the beginning). The implementation is based on the inflection library.

lhoestq

Thank you for adding load_dataset_builder :)
This is really helpful.

I just have one comment about the part that hashes the builder's code:

src/datasets/packaged_modules/__init__.py

… add-load_dataset_builder

lhoestq

Looks all good now !

Thank you for adding this :)

stas00 · 2021-07-05T17:15:18Z

docs/source/loading_datasets.rst

@@ -431,7 +431,7 @@ For example, run the following to skip integrity verifications when loading the
 Loading datasets offline
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the huggingface/datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.
+Each dataset builder (e.g. "squad") is a python script that is downloaded and cached from either from the 🤗Datasets GitHub repository or from the `HuggingFace Hub <https://huggingface.co/datasets>`__.


I'd suggested a white space here: 🤗Datasets => 🤗 Datasets

docs/source/loading_datasets.rst

stas00 · 2021-07-05T17:27:25Z

Thank you for adding this feature, @mariosasko - this is really awesome!

Tried with:

python -c "from datasets import load_dataset_builder; b = load_dataset_builder('openwebtext-10k'); print(b.cache_dir)"
Using the latest cached version of the module from /home/stas/.cache/huggingface/modules/datasets_modules/datasets
/openwebtext-10k/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b (last modified on Wed May 12 
20:22:53 2021) 

since it couldn't be found locally at openwebtext-10k/openwebtext-10k.py 

or remotely (FileNotFoundError).

/home/stas/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b

The logger message (edited by me to add new lines to point the issues out) is a bit confusing to the user - that is what does FileNotFoundError refer to?

May be replace FileNotFoundError with where it was looking for a file online. But then the remote file is there - it's found
I'm not sure why it says "since it couldn't be found locally" - as it is locally found at the cache folder and again what does " locally at openwebtext-10k/openwebtext-10k.py" mean - i.e. where does it look for it? Is it ./openwebtext-10k/openwebtext-10k.py it's looking for? or in some specific dir?

If the cached version always supersedes any other versions perhaps this is what it should say?

found cached version at xxx, not looking for a local at yyy, not downloading remote at zzz

lhoestq · 2021-07-06T13:36:31Z

Hi ! Thanks for the comments

Regarding your last message:
You must pass stas/openwebtext-10k as in load_dataset instead of openwebtext-10k. Otherwise it doesn't know how to retrieve the builder from the HF Hub.

When you specify a dataset name without a slash, it tries to load a canonical dataset or it looks locally at ./openwebtext-10k/openwebtext-10k.py
Here since openwebtext-10k is not a canonical dataset and doesn't exist locally at ./openwebtext-10k/openwebtext-10k.py: it raised a FileNotFoundError.
As a fallback it managed to find the dataset script in your cache and it used this one.

stas00 · 2021-07-06T16:07:31Z

Oh, I see, so I actually used an incorrect input. so it was a user error. Correcting it:

python -c "from datasets import load_dataset_builder; b = load_dataset_builder('stas/openwebtext-10k'); print(b.cache_dir)"
/home/stas/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b

Now there is no logger message. Got it!

OK, I'm not sure the magical recovery it did in first place is most beneficial in the long run. I'd have rather it failed and said: "incorrect input there is no such dataset as 'openwebtext-10k' at or " - because if it doesn't fail I may leave it in the code and it'll fail later when another user tries to use my code and won't have the cache. Does it make sense? Giving me this url allows me to go to the datasets hub and realize that the dataset is missing the username qualifier.

Here since openwebtext-10k is not a canonical dataset and doesn't exist locally at ./openwebtext-10k/openwebtext-10k.py: it raised a FileNotFoundError.

Except it slapped the exception name to remotely (FileNotFoundError). which makes no sense.

Plus for the local it's not clear where is it looking relatively too when it gets FileNotFoundError - perhaps it'd help to use absolute path and use it in the message?

Finally, the logger format is not set up so the user gets a warning w/o knowing it's a warning. As you can see it's missing the WARNING pre-amble in #2500 (comment)

i.e. I had no idea it was warning me of something, I was just trying to make sense of the message that's why I started the discussion and otherwise I'd have completely missed the point of me making an error.

Add load_dataset_builder

bbd000d

mariosasko commented Jun 14, 2021

View reviewed changes

mariosasko changed the title ~~Add load_dataset_builder~~ [WIP] Add load_dataset_builder Jun 14, 2021

Fix

1a027f4

albertvillanova marked this pull request as draft June 14, 2021 17:31

albertvillanova changed the title ~~[WIP] Add load_dataset_builder~~ Add load_dataset_builder Jun 14, 2021

mariosasko added 3 commits June 15, 2021 20:45

Merge branch 'master' of https://github.com/huggingface/datasets into…

0498124

… add-load_dataset_builder

Add docstring

9676f44

Remove _return_resolved_file_path arg

37a0a5e

mariosasko mentioned this pull request Jun 19, 2021

Can datasets remove duplicated rows? #2514

Open

mariosasko added 8 commits June 23, 2021 01:26

Improve camel-case/snake-case conversion

1950fa8

Add test

942c8c1

Fix test

005ebac

Merge branch 'master' of https://github.com/huggingface/datasets into…

eeb1098

… add-load_dataset_builder

Fix packaged_modules

880e954

Improve test

a6d3ca9

Fix prepare_module test

1c1fffc

Doc improvement

2c8dc1e

mariosasko mentioned this pull request Jun 30, 2021

Existing cache for local dataset builder file updates is ignored with ignore_verifications=True #2561

Closed

Mention load_dataset_builder in docs

c89dab6

mariosasko marked this pull request as ready for review June 30, 2021 16:50

lhoestq reviewed Jul 1, 2021

View reviewed changes

src/datasets/packaged_modules/__init__.py Outdated Show resolved Hide resolved

mariosasko added 3 commits July 1, 2021 12:37

Remove replacements in packaged_module

26ce0f2

Try to trigger CI

ad5a723

Merge branch 'master' of https://github.com/huggingface/datasets into…

0ac9792

… add-load_dataset_builder

mariosasko requested a review from lhoestq July 2, 2021 22:56

mention dataset_builder.info in the docs

d8f9df2

lhoestq approved these changes Jul 5, 2021

View reviewed changes

lhoestq merged commit b15b476 into huggingface:master Jul 5, 2021

stas00 reviewed Jul 5, 2021

View reviewed changes

docs/source/loading_datasets.rst Show resolved Hide resolved

albertvillanova mentioned this pull request Jul 6, 2021

Remove redundant prepare_module #2597

Merged

mariosasko deleted the add-load_dataset_builder branch July 9, 2021 00:08

mariosasko mentioned this pull request Jul 9, 2021

More consistent naming #2611

Merged

mariosasko mentioned this pull request Jul 20, 2021

Print absolute local paths in load_dataset error messages #2684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load_dataset_builder #2500

Add load_dataset_builder #2500

mariosasko commented Jun 14, 2021 •

edited

Loading

mariosasko Jun 14, 2021

lhoestq Jun 15, 2021

albertvillanova commented Jun 14, 2021 •

edited

Loading

mariosasko commented Jun 14, 2021

mariosasko commented Jun 30, 2021

lhoestq left a comment

lhoestq left a comment

stas00 Jul 5, 2021

stas00 commented Jul 5, 2021 •

edited

Loading

lhoestq commented Jul 6, 2021 •

edited

Loading

stas00 commented Jul 6, 2021 •

edited

Loading

Add load_dataset_builder #2500

Add load_dataset_builder #2500

Conversation

mariosasko commented Jun 14, 2021 • edited Loading

mariosasko Jun 14, 2021

Choose a reason for hiding this comment

lhoestq Jun 15, 2021

Choose a reason for hiding this comment

albertvillanova commented Jun 14, 2021 • edited Loading

mariosasko commented Jun 14, 2021

mariosasko commented Jun 30, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

stas00 Jul 5, 2021

Choose a reason for hiding this comment

stas00 commented Jul 5, 2021 • edited Loading

lhoestq commented Jul 6, 2021 • edited Loading

stas00 commented Jul 6, 2021 • edited Loading

mariosasko commented Jun 14, 2021 •

edited

Loading

albertvillanova commented Jun 14, 2021 •

edited

Loading

stas00 commented Jul 5, 2021 •

edited

Loading

lhoestq commented Jul 6, 2021 •

edited

Loading

stas00 commented Jul 6, 2021 •

edited

Loading