Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jehovah Witness Sign Language Resources #29

Open
AmitMY opened this issue Feb 15, 2023 · 38 comments
Open

Jehovah Witness Sign Language Resources #29

AmitMY opened this issue Feb 15, 2023 · 38 comments

Comments

@AmitMY
Copy link
Contributor

AmitMY commented Feb 15, 2023

We should add resources from JW, like the bible.

@bricksdont
Copy link
Collaborator

@ShesterG this is something you could perhaps take on ;)

@cleong110
Copy link
Contributor

Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!

It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign

@cleong110
Copy link
Contributor

Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.

They've been precomputed/saved off, they just need to be hosted somewhere.

@cleong110
Copy link
Contributor

cleong110 commented Jan 4, 2024

OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...

@cleong110
Copy link
Contributor

One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See
tensorflow/datasets#1482

and

https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails

@cleong110
Copy link
Contributor

OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.

import tensorflow_datasets as tfds


if __name__ == "__main__":
    ####################################
    # try to download newindex.list.gz
    #####################################

    # downloads a 0 MB empty file
    google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"

    # extract the ID from above and append it to "https://drive.google.com/uc?id="
    # downloads an actual file. 
    google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
    dl_manager = tfds.download.DownloadManager(download_dir="./foo")

    extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)

    # ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
    # which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
    print(extracted_path)

image

@cleong110
Copy link
Contributor

Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=" seems to work

@cleong110
Copy link
Contributor

OK, the next thing I want to figure out is how to actually download and load files.

DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs.json

When I open it up with Firefox, the format looks like this, looks like there's links to files in there.
image

This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.

@cleong110
Copy link
Contributor

cleong110 commented Jan 4, 2024

Here's my notes on newindex.list.gz (drive link):

Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']

First 10:

{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}       
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}

Which is a pickled list, compressed with gzip.

In compressed form it is about 19,000 KB, or about 19 MB.

In decompressed form it's closer to 100MB.

@cleong110
Copy link
Contributor

(Side note: investigate Parquet data format?)

@cleong110
Copy link
Contributor

(or Arrow?)

@cleong110
Copy link
Contributor

Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".

@cleong110
Copy link
Contributor

What are the numbers in DGS? Unique IDs? Should we generate some for our dataset?
image

@cleong110
Copy link
Contributor

JSON for DGS is parsed here:

def _split_generators(self, dl_manager: tfds.download.DownloadManager):

@cleong110
Copy link
Contributor

And the JSON is created here:

, which calls the numbers "tr_id"

@cleong110
Copy link
Contributor

Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.

@cleong110
Copy link
Contributor

https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.

Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing

@cleong110
Copy link
Contributor

https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this

@cleong110
Copy link
Contributor

Went and figured out how the index was created, and pushed an updated version of the create_index.py ShesterG#1

Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign

@cleong110
Copy link
Contributor

image_480
From the presented slides for JWSign, this is what we're going for

@cleong110
Copy link
Contributor

Not being familiar with tfds or sign_language_datasets I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:

  1. Make sure you install from source.
  2. pip install pytest pytest-cov dill to get the testing deps
  3. pytest . in whatever folder you want to run tests for, incl. the top-level.

Of course the next question is how to make tests!

@cleong110
Copy link
Contributor

OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests:
image

@cleong110
Copy link
Contributor

OK, testing procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

@cleong110
Copy link
Contributor

cleong110 commented Mar 4, 2024

All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.

I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.

Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.

@cleong110
Copy link
Contributor

cleong110 commented Mar 4, 2024

In order to iterate/test the dataset I will need to:

# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test

@cleong110
Copy link
Contributor

OK, I did

# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory

And then repeatedly edited, pytested, using the rwth_phoenix2014_t code as a base, until the VideoTest passed. Excellent.

@cleong110
Copy link
Contributor

OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.

@cleong110
Copy link
Contributor

@cleong110
Copy link
Contributor

I want to see if I can make a completely basic text-only dataset to start.

@cleong110
Copy link
Contributor

cleong110 commented Mar 4, 2024

Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract method on the text51.dict.gz file, I get a .html instead.

Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.

gdown library works, but then that doesn't play with tfds

Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing

@cleong110
Copy link
Contributor

Workaround: split the text dict into one for every spoken language

@cleong110
Copy link
Contributor

To get the links for all 51 .json files:

Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o

Select all the files, you have to scroll a bit because it only shows 50 by default

right-click->Share->copy links

https://drive.google.com/file/d/122Fs-e5O9SPELpE83FohdXky9r1QIpDB/view?usp=drive_link, https://drive.google.com/file/d/1-KptAhyZCfnxGG4OMOVHXfgFqfjhmduu/view?usp=drive_link, https://drive.google.com/file/d/1-QQKmrW0iI9lBLxihtnxOgT_xAQSWCt7/view?usp=drive_link, https://drive.google.com/file/d/12Ahgjl3wbho9ShwlGVdtUx-uDnGacPFn/view?usp=drive_link, https://drive.google.com/file/d/1DMTgyGq9td8XpWqMeez7Hv1Cy-UyIAmd/view?usp=drive_link, https://drive.google.com/file/d/1Dze5_WZyXkAq8gca9eiPnofDTbXiFCYy/view?usp=drive_link, https://drive.google.com/file/d/1FXZF2SMv4GZ4visJrUwxY9vnHIs2Gzli/view?usp=drive_link, https://drive.google.com/file/d/1PyUgjqq6gt5kf4Pa4fRJok8b5eech2An/view?usp=drive_link, https://drive.google.com/file/d/1jOInwyi6XAAkwJ5iMLc90NI9W8MlbARF/view?usp=drive_link, https://drive.google.com/file/d/10peSKPs99feSfsyWmkSmaONM8H2pGWEX/view?usp=drive_link, https://drive.google.com/file/d/1NoCcaKy_BPrlboXP5kx4_iP3DrOq0ITq/view?usp=drive_link, https://drive.google.com/file/d/1dNnGpaoMGR4IPhWuH-TIobDyeL3tcdHx/view?usp=drive_link, https://drive.google.com/file/d/1dqO8p1tD_UUXUw3pDMCYwOIAS2niKHgm/view?usp=drive_link, https://drive.google.com/file/d/1iwF3OKvo4WmWvlqXjoDqZ1j41C4qthjA/view?usp=drive_link, https://drive.google.com/file/d/1tXHY2m4-P_jD7I4xBvkbB3lUX8FXVNmi/view?usp=drive_link, https://drive.google.com/file/d/139UErDv_QeaAmm5n7l5IqsktC5b3hO5A/view?usp=drive_link, https://drive.google.com/file/d/1Z5iTuQGTl15oh_xm9cSrtJkkmfu9s7qz/view?usp=drive_link, https://drive.google.com/file/d/1kccYftVcapjNLXxE-VYZpIOVNcRoa2mI/view?usp=drive_link, https://drive.google.com/file/d/1r8ao3bUf4xcsyTqJp0AQBiM29Y2wXBcO/view?usp=drive_link, https://drive.google.com/file/d/1rY6VjXhQXL330uNpxmekrpi_xs3T2JeK/view?usp=drive_link, https://drive.google.com/file/d/1tOMJNzNYo-Bpo6rxDZW94tpBnH9lZadv/view?usp=drive_link, https://drive.google.com/file/d/13-C5Z3YFjEE4dpstt3hDbNORiGgB4BgP/view?usp=drive_link, https://drive.google.com/file/d/19-zA-4dsfB-LZcDWXiOKNnniNuEHCZh2/view?usp=drive_link, https://drive.google.com/file/d/1GR3A6NXnsoItIwaxQvflCXufhLV-xCz2/view?usp=drive_link, https://drive.google.com/file/d/1KApiflPkVm6Jn0sGw2OyRT__VAec_bFd/view?usp=drive_link, https://drive.google.com/file/d/1NibQFTL0gGUL9NYnYFjlk_uCIZSM-RqA/view?usp=drive_link, https://drive.google.com/file/d/1SINxYL1u2T-dG79TjQmj2AZsfQB8rTaC/view?usp=drive_link, https://drive.google.com/file/d/1wE9Po5-nrr9PS-xdT_F8WK-kDyYCDZAp/view?usp=drive_link, https://drive.google.com/file/d/1Df5j9YsEMdvNx9NR7zl1gE58mnIZkS06/view?usp=drive_link, https://drive.google.com/file/d/1K8HCwsEtdKba248wPxbZJcRDyP4NlfLp/view?usp=drive_link, https://drive.google.com/file/d/1hT0dqllsIUL5G6AKP_vA1bkzir1ZYT0q/view?usp=drive_link, https://drive.google.com/file/d/1mZLUo9k8VTRyPdvrJEduUxMQv4Cxdotk/view?usp=drive_link, https://drive.google.com/file/d/1sioyZKRvTfujJ0aeJYPOYAEXZz5pIpTp/view?usp=drive_link, https://drive.google.com/file/d/1ygGnPbz4ssjwXNyZRmOGgkbcFfOzHu1b/view?usp=drive_link, https://drive.google.com/file/d/1LOsj8qvhmyRtfmaimELeBIiD4xl1vFeW/view?usp=drive_link, https://drive.google.com/file/d/1NjK0YIAowCv5uMv4yEcKdi-dU6LoVwD6/view?usp=drive_link, https://drive.google.com/file/d/1Nv8ecYBPdogebdtGT4HcAGdRxI_hf-yI/view?usp=drive_link, https://drive.google.com/file/d/1_6nC34lGBDRSZVAM5msW4Ol-BbyvoMcK/view?usp=drive_link, https://drive.google.com/file/d/1eTHQKEotMJm20BKe--CLQfBqUUvlFZpo/view?usp=drive_link, https://drive.google.com/file/d/1jizBtuPzBA8Bcy5IMs-EeF2_q41A68zr/view?usp=drive_link, https://drive.google.com/file/d/1sJhcz_mwCGafr9hi0aLQkr1U91_cq6Qx/view?usp=drive_link, https://drive.google.com/file/d/1EOaMjlUVy-hNGLqH3zLfGtxSVRN0X9O1/view?usp=drive_link, https://drive.google.com/file/d/1HPz-ZDjJeomlqNpxc5cWsEO4P7liqiiU/view?usp=drive_link, https://drive.google.com/file/d/1_3H2N92wLAEIqi9VF735KeSPGqQtJE2Q/view?usp=drive_link, https://drive.google.com/file/d/1_C-cqwzEJI89tNjLlSsvHQoDgnBMELGr/view?usp=drive_link, https://drive.google.com/file/d/1gSlQrYvfB1m26npbNRYXP14idcn-_2aA/view?usp=drive_link, https://drive.google.com/file/d/1rrrn73YFhC4yjUwbPKxcSzaWQ8FzKXY2/view?usp=drive_link, https://drive.google.com/file/d/1z_KYeV2u0KgNOjZ_9ARnp4PnF_bJFkf5/view?usp=drive_link, https://drive.google.com/file/d/13wKXt6R4h_trTlDVtYXZRRPOFJ9omSnz/view?usp=drive_link, https://drive.google.com/file/d/1BQaj9_RC_lnsc3kJVSFKWw2o_hWp_X4c/view?usp=drive_link, https://drive.google.com/file/d/1P5fzxscp5uoq3AtxKthu29D7BYfV4nKe/view?usp=drive_link

@cleong110
Copy link
Contributor

And of course I can split those one by one and get the link in a format that tfds can download...

...except, how do I re-associate the filename with the link?

@cleong110
Copy link
Contributor

cleong110 commented Mar 4, 2024

I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.

{"spoken_language": "de", 
"data": {"v1001001": "1 \u00a0Am Anfang erschuf Gott Himmel und Erde.+", "v1001002": "2\u00a0\u00a0Die Erde nun war formlos und \u00f6de*. \u00dcber dem tief
}

@cleong110
Copy link
Contributor

OK going with that for now, we can compress them later. I just want to get something running

@cleong110
Copy link
Contributor

With a bit of munging I was able to download all the files, read the code, and then create a
spoken_lang_text_file_download_urls.json
dictionary of download URLs, which I saved to a .json

cleong110 added a commit to cleong110/datasets that referenced this issue Mar 4, 2024
@cleong110
Copy link
Contributor

Gonna have to call it for today, but I added some notes to jw_sign.py for next time.

@cleong110
Copy link
Contributor

TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants