-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jehovah Witness Sign Language Resources #29
Comments
@ShesterG this is something you could perhaps take on ;) |
Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help! It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign |
Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run. They've been precomputed/saved off, they just need to be hosted somewhere. |
OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...
|
One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See and https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails |
OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file. import tensorflow_datasets as tfds
if __name__ == "__main__":
####################################
# try to download newindex.list.gz
#####################################
# downloads a 0 MB empty file
google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"
# extract the ID from above and append it to "https://drive.google.com/uc?id="
# downloads an actual file.
google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
dl_manager = tfds.download.DownloadManager(download_dir="./foo")
extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)
# ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
# which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
print(extracted_path) |
Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=" seems to work |
Here's my notes on newindex.list.gz (drive link): Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID'] First 10:
Which is a pickled list, compressed with gzip. In compressed form it is about 19,000 KB, or about 19 MB. In decompressed form it's closer to 100MB. |
(Side note: investigate Parquet data format?) |
(or Arrow?) |
Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads". |
JSON for DGS is parsed here:
|
And the JSON is created here:
|
Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex. |
https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets. Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing |
https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this |
Went and figured out how the index was created, and pushed an updated version of the create_index.py ShesterG#1 Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign |
Not being familiar with
Of course the next question is how to make tests! |
OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests: |
OK, testing procedure:
|
All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests. I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was. Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes. |
In order to iterate/test the dataset I will need to:
|
OK, I did
And then repeatedly edited, pytested, using the |
OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date. |
I want to see if I can make a completely basic text-only dataset to start. |
Apparently Google Drive doesn't play nice. When I try to use tfds' Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.
Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing |
Workaround: split the text dict into one for every spoken language |
To get the links for all 51 .json files: Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o Select all the files, you have to scroll a bit because it only shows 50 by default right-click->Share->copy links
|
And of course I can split those one by one and get the link in a format that tfds can download... ...except, how do I re-associate the filename with the link? |
I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.
|
OK going with that for now, we can compress them later. I just want to get something running |
With a bit of munging I was able to download all the files, read the code, and then create a |
Gonna have to call it for today, but I added some notes to jw_sign.py for next time. |
TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time |
We should add resources from JW, like the bible.
The text was updated successfully, but these errors were encountered: