-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BC-Breaking] Remove download and change root directory of CommonVoice #1082
Conversation
raise RuntimeError( | ||
f"The hash of {filepath} does not match. Delete the file manually and retry." | ||
) | ||
extract_archive(archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep archive extraction, e.g. here
if not os.path.isdir(self._path): | ||
if not os.path.isfile(archive): | ||
checksum = _CHECKSUMS.get(url, None) | ||
download_url(url, root, hash_value=checksum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we can't download the dataset due to legal reason, we can still extract it.
If the user has not extracted the dataset yet, we can do so.
If we're about to extract and the user has provided a checksum (from the common voice website) we can also verify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out that dataset archives have different directory structure for different version.
- Version 4 -
cv-corpus-4-2019-12-10
$ tar -tvf tmp/en.tar.gz
2019-12-09 17:47 .DS_Store
2019-12-10 08:09 clips/
2019-12-10 05:41 dev.tsv
2019-12-10 05:41 invalidated.tsv
2019-12-10 05:41 other.tsv
2019-12-10 05:41 test.tsv
2019-12-10 05:41 train.tsv
2019-12-10 05:41 validated.tsv
...
- Version 5.1 -
cv-corpus-5.1-2020-06-22
$ tar -tvf en.tar.gz
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/
2020-07-11 20:40 cv-corpus-5.1-2020-06-22/en/clips/
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/dev.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/invalidated.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/other.tsv
2020-07-11 20:41 cv-corpus-5.1-2020-06-22/en/reported.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/test.tsv
2020-07-11 23:59 cv-corpus-5.1-2020-06-22/en/train.tsv
...
Also, because of this (and the fact that folder_in_archive
argument which is not used properly), the previous implementation (download + extract) does not work. #1088 (I suspect it ever worked.)
The thing is that the current implementation expects the files to be presented at
<ROOT_DIR>/CommonVoice/<VERSION>/<LANG_CODE>
.
But the previous code behavior extracts the archive at either
<ROOT_DIR>/
or
<ROOT_DIR>/<VERSION>/<LANG_CODE>
.
I was thinking to resort to the original behavior when handling archive, but since there was not working behavior in the previous implementation when it comes to download + extract, I again propose not to try to extract the archive and ask users to extract the archive and provide the directory where the data are found. (and deprecate folder_in_archive
) argument in addition to download
and url
arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offline discussion: change the root
to expect the exact directory users extract the archive.
acbf081
to
26e9de4
Compare
26e9de4
to
29aef9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In fact, we could consider this simplified interface for other datasets too, cc #910.
for name, val in deprecated: | ||
if val is not None: | ||
warnings.warn( | ||
f"`{name}` argument is no longer used and deprecated. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, can we improve the message here? What should someone do who is currently already using this dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought no longer used
+ Please remove it from the function call
would suffice. What do you suggest?
|
||
base_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com" | ||
url = os.path.join(base_url, version, language + ext_archive) | ||
"Please download the dataset and extract it manually.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, we can provide a bit more guidance (a link or such) on how to go about this.
Co-authored-by: holly1238 <[email protected]>
After #1080, this PR
root
argumentAPI-wise, it is not BC-braking, but users have to change the root directory they give to
root
argument.root
argument<root_dir>
<root_dir>/CommonVoice/<VERSION>/<LANGUAGE>
<root_dir>
<root_dir>
If you were using,
root="/data", url="english", version="cv-corpus-4-2019-12-10"
, then you need to change theroot
to/data/CommonVoice/cv-corpus-4-2019-12-10/en
, (and you can should theurl
andversion
argument)See #1077 for the agenda.
Closes #1083 #1077