Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BC-Breaking] Remove download and change root directory of CommonVoice #1082

Merged
merged 4 commits into from
Dec 18, 2020

Conversation

mthrok
Copy link
Collaborator

@mthrok mthrok commented Dec 9, 2020

After #1080, this PR

  • removes downloading feature and logics
  • [BC-breaking] expects the dataset directory to be provided by users as root argument
  • then deprecate the related constructor arguments

API-wise, it is not BC-braking, but users have to change the root directory they give to root argument.

root argument Directory where data set is expected
Before: <root_dir> <root_dir>/CommonVoice/<VERSION>/<LANGUAGE>
After: <root_dir> <root_dir>

If you were using, root="/data", url="english", version="cv-corpus-4-2019-12-10", then you need to change the root to /data/CommonVoice/cv-corpus-4-2019-12-10/en, (and you can should the url and version argument)

See #1077 for the agenda.
Closes #1083 #1077

@mthrok mthrok mentioned this pull request Dec 9, 2020
5 tasks
raise RuntimeError(
f"The hash of {filepath} does not match. Delete the file manually and retry."
)
extract_archive(archive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep archive extraction, e.g. here

if not os.path.isdir(self._path):
if not os.path.isfile(archive):
checksum = _CHECKSUMS.get(url, None)
download_url(url, root, hash_value=checksum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can't download the dataset due to legal reason, we can still extract it.

If the user has not extracted the dataset yet, we can do so.

If we're about to extract and the user has provided a checksum (from the common voice website) we can also verify it.

Copy link
Collaborator Author

@mthrok mthrok Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out that dataset archives have different directory structure for different version.

  • Version 4 - cv-corpus-4-2019-12-10
$ tar -tvf tmp/en.tar.gz
2019-12-09 17:47 .DS_Store
2019-12-10 08:09 clips/
2019-12-10 05:41 dev.tsv
2019-12-10 05:41 invalidated.tsv
2019-12-10 05:41 other.tsv
2019-12-10 05:41 test.tsv
2019-12-10 05:41 train.tsv
2019-12-10 05:41 validated.tsv
...
  • Version 5.1 - cv-corpus-5.1-2020-06-22
$ tar -tvf en.tar.gz
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/
2020-07-11 20:40 cv-corpus-5.1-2020-06-22/en/clips/
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/dev.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/invalidated.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/other.tsv
2020-07-11 20:41 cv-corpus-5.1-2020-06-22/en/reported.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/test.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/train.tsv
...

Also, because of this (and the fact that folder_in_archive argument which is not used properly), the previous implementation (download + extract) does not work. #1088 (I suspect it ever worked.)

The thing is that the current implementation expects the files to be presented at
<ROOT_DIR>/CommonVoice/<VERSION>/<LANG_CODE>.

But the previous code behavior extracts the archive at either
<ROOT_DIR>/ or
<ROOT_DIR>/<VERSION>/<LANG_CODE>.

I was thinking to resort to the original behavior when handling archive, but since there was not working behavior in the previous implementation when it comes to download + extract, I again propose not to try to extract the archive and ask users to extract the archive and provide the directory where the data are found. (and deprecate folder_in_archive) argument in addition to download and url arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offline discussion: change the root to expect the exact directory users extract the archive.

@mthrok mthrok force-pushed the common-voice-3 branch 4 times, most recently from acbf081 to 26e9de4 Compare December 15, 2020 20:03
@mthrok mthrok changed the title Remove execution path for download=True and clean up [BC-Breaking] Remove execution path for download=True and clean up Dec 16, 2020
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In fact, we could consider this simplified interface for other datasets too, cc #910.

for name, val in deprecated:
if val is not None:
warnings.warn(
f"`{name}` argument is no longer used and deprecated. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, can we improve the message here? What should someone do who is currently already using this dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought no longer used + Please remove it from the function call would suffice. What do you suggest?


base_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com"
url = os.path.join(base_url, version, language + ext_archive)
"Please download the dataset and extract it manually.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, we can provide a bit more guidance (a link or such) on how to go about this.

@mthrok mthrok merged commit 6b81024 into pytorch:master Dec 18, 2020
@mthrok mthrok deleted the common-voice-3 branch December 18, 2020 19:30
@mthrok mthrok changed the title [BC-Breaking] Remove execution path for download=True and clean up [BC-Breaking] Remove download and change root directory of CommonVoice Dec 18, 2020
mthrok pushed a commit to mthrok/audio that referenced this pull request Dec 13, 2022
Co-authored-by: holly1238 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants