Jehovah Witness Sign Language Resources #29

AmitMY · 2023-02-15T19:27:00Z

We should add resources from JW, like the bible.

bricksdont · 2023-02-15T20:02:10Z

@ShesterG this is something you could perhaps take on ;)

cleong110 · 2023-12-13T23:19:27Z

Chipping in here to provide a bit of assistance with Shester's dataset. Asked his permission to help!

It seems he got started with this at https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign

cleong110 · 2023-12-13T23:29:47Z

Shester informs me that the annotations can be recreated via https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/jw_sign/create_index.py, it just takes a long time to run.

They've been precomputed/saved off, they just need to be hosted somewhere.

cleong110 · 2024-01-04T16:05:17Z

OK, for now the files are uploaded to https://drive.google.com/drive/folders/1QFmq5Byg0xTLgJ7sBdVuQBlgxrkzp9vV , thank you Shester! Now we need to...

using DGS Corpus as inspiration, add data-loading functionality to https://github.com/ShesterG/datasets/tree/shester/sign_language_datasets/datasets/jw_sign. Should ideally work the same as DGS in the example notebook.
DGS Corpus uses tfds download manager, see this function in the main branch for example. So we need to figure out how to download Google Drive files with tfds
Then, figure out how to test all this. (perhaps adapt https://github.com/ShesterG/datasets/blob/shester/sign_language_datasets/datasets/dgs_corpus/dgs_corpus_test.py?)

cleong110 · 2024-01-04T16:09:40Z

One thing for us to consider: Google Drive will sometimes cause issues if too many people download the same files. See
tensorflow/datasets#1482

and

https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails

cleong110 · 2024-01-04T16:34:26Z

OK, the following actually manages to download newindex.list.gz, but saves it with a weird name. However when I manually rename it, I can open the file with 7zip and see it's the right file.

import tensorflow_datasets as tfds


if __name__ == "__main__":
    ####################################
    # try to download newindex.list.gz
    #####################################

    # downloads a 0 MB empty file
    google_drive_link_to_newindex = "https://drive.google.com/file/d/1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf/view?usp=drive_link"

    # extract the ID from above and append it to "https://drive.google.com/uc?id="
    # downloads an actual file. 
    google_drive_link_to_newindex_take2 = "https://drive.google.com/uc?id=1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf"
    dl_manager = tfds.download.DownloadManager(download_dir="./foo")

    extracted_path = dl_manager.download(google_drive_link_to_newindex_take2)

    # ends up printing "foo\ucid_1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk"
    # which is 1LhyPOH6JrqmSYagL4SVHLW6SjwBhsnkf (the ID from above, followed by "2gL0ALrIqsmpQErZRz8dejCm0UbwN7MWPKpoimgtDwk" which I don't understand what that is)
    print(extracted_path)

cleong110 · 2024-01-04T18:02:48Z

Implication here is that the strategy of doing URLs like "https://drive.google.com/uc?id=" seems to work

cleong110 · 2024-01-04T21:01:02Z

OK, the next thing I want to figure out is how to actually download and load files.

DGS Corpus actually includes a "dgs.json" in github, about 440 kB in size.

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs.json

When I open it up with Firefox, the format looks like this, looks like there's links to files in there.

This is most similar, I think, to our "newindex.list.gz", in that there's a list of unique data items, with URL links to videos.

cleong110 · 2024-01-04T21:03:42Z

Here's my notes on newindex.list.gz (drive link):

Keys: ['video_url', 'video_name', 'verse_lang', 'verse_name', 'verse_start', 'verse_end', 'duration', 'verse_unique', 'verseID']

First 10:

{'video_url': 'https://download-a.akamaihd.net/files/media_publication/f3/nwt_01_Ge_ALS_03_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_03_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 3:15', 'verse_start': '0.000000', 'verse_end': '31.198000', 'duration': 31.198, 'verse_unique': 'ALS Zan. 3:15', 'verseID': 'v1003015'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:2', 'verse_start': '0.000000', 'verse_end': '26.760000', 'duration': 26.76, 'verse_unique': 'ALS Zan. 39:2', 'verseID': 'v1039002'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/59/nwt_01_Ge_ALS_39_r720P.mp4', 'video_name': 'nwt_01_Ge_ALS_39_r720P', 'verse_lang': 'ALS', 'verse_name': 'Zan. 39:3', 'verse_start': '26.760000', 'verse_end': '47.848000', 'duration': 21.087999999999997, 'verse_unique': 'ALS Zan. 39:3', 'verseID': 'v1039003'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/10/nwt_03_Le_ALS_19_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_19_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 19:18', 'verse_start': '0.000000', 'verse_end': '32.399000', 'duration': 32.399, 'verse_unique': 'ALS Lev. 19:18', 'verseID': 'v3019018'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/0c/nwt_03_Le_ALS_25_r720P.mp4', 'video_name': 'nwt_03_Le_ALS_25_r720P', 'verse_lang': 'ALS', 'verse_name': 'Lev. 25:10', 'verse_start': '0.000000', 'verse_end': '8.320000', 'duration': 8.32, 'verse_unique': 'ALS Lev. 25:10', 'verseID': 'v3025010'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:6', 'verse_start': '0.000000', 'verse_end': '7.341000', 'duration': 7.341, 'verse_unique': 'ALS Ligj. 6:6', 'verseID': 'v5006006'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/64/nwt_05_De_ALS_06_r720P.mp4', 'video_name': 'nwt_05_De_ALS_06_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 6:7', 'verse_start': '7.341000', 'verse_end': '24.024000', 'duration': 16.683, 'verse_unique': 'ALS Ligj. 6:7', 'verseID': 'v5006007'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/3d/nwt_05_De_ALS_10_r720P.mp4', 'video_name': 'nwt_05_De_ALS_10_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 10:20', 'verse_start': '0.000000', 'verse_end': '10.644000', 'duration': 10.644, 'verse_unique': 'ALS Ligj. 10:20', 'verseID': 'v5010020'}       
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/34/nwt_05_De_ALS_32_r720P.mp4', 'video_name': 'nwt_05_De_ALS_32_r720P', 'verse_lang': 'ALS', 'verse_name': 'Ligj. 32:4', 'verse_start': '0.000000', 'verse_end': '43.844000', 'duration': 43.844, 'verse_unique': 'ALS Ligj. 32:4', 'verseID': 'v5032004'}
{'video_url': 'https://download-a.akamaihd.net/files/media_publication/1e/nwt_09_1Sa_ALS_01_r720P.mp4', 'video_name': 'nwt_09_1Sa_ALS_01_r720P', 'verse_lang': 'ALS', 'verse_name': '1 Sam. 1:15', 'verse_start': '0.000000', 'verse_end': '23.557000', 'duration': 23.557, 'verse_unique': 'ALS 1 Sam. 1:15', 'verseID': 'v9001015'}

Which is a pickled list, compressed with gzip.

In compressed form it is about 19,000 KB, or about 19 MB.

In decompressed form it's closer to 100MB.

cleong110 · 2024-01-16T15:07:52Z

(Side note: investigate Parquet data format?)

cleong110 · 2024-01-16T15:10:29Z

(or Arrow?)

cleong110 · 2024-01-16T15:12:29Z

Another TODO at some point: Upload files to a better hosting platform, e.g. Zenodo or Zindi, to prevent issues with "too many downloads".

cleong110 · 2024-01-16T16:55:31Z

What are the numbers in DGS? Unique IDs? Should we generate some for our dataset?

cleong110 · 2024-01-16T17:01:48Z

JSON for DGS is parsed here:

datasets/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

Line 278 in e864f36

def _split_generators(self, dl_manager: tfds.download.DownloadManager):

cleong110 · 2024-01-16T17:04:03Z

And the JSON is created here:

datasets/sign_language_datasets/datasets/dgs_corpus/create_index.py

Line 17 in e864f36

for tr_id, tr in trs:

, which calls the numbers "tr_id"

cleong110 · 2024-01-16T17:06:32Z

Ah... "transcript ID". And they're not generated in the Python code, they're parsed from the source page via regex.

cleong110 · 2024-01-18T00:22:49Z

https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md helpful guide for TFDS datasets.

Also apparently tfds.testing is a thing, used for example in dgs_corpus_test.py https://tensorflow.google.cn/datasets/api_docs/python/tfds/testing

cleong110 · 2024-01-18T00:24:36Z

https://tensorflow.google.cn/datasets/add_dataset?hl=en actually the helpful guide above is source code for this

cleong110 · 2024-01-24T18:20:06Z

Went and figured out how the index was created, and pushed an updated version of the create_index.py ShesterG#1

Now that I know the data a bit better, gonna move on to filling out the Builder class for JWSign

cleong110 · 2024-02-15T19:24:01Z

From the presented slides for JWSign, this is what we're going for

cleong110 · 2024-02-21T19:06:01Z

Not being familiar with tfds or sign_language_datasets I am attempting to go with a "get basic functionality working and then test it" approach. But then I ran into the issue of not knowing how to test a dataset locally. #53 documents part of this, but the basic guide to testing is:

Make sure you install from source.
pip install pytest pytest-cov dill to get the testing deps
pytest . in whatever folder you want to run tests for, incl. the top-level.

Of course the next question is how to make tests!

cleong110 · 2024-02-21T19:08:59Z

OK, so even if you just follow https://tensorflow.google.cn/datasets/add_dataset?hl=en#test_your_dataset and add nothing, you'll still get some basic unit tests:

cleong110 · 2024-03-01T20:26:43Z

OK, testing procedure:

conda create -n sign_language_datasets_source pip python=3.10 # if I do 3.11 on Windows then there's no compatible tensorflow
conda activate sign_language_datasets_source 
# navigate to the repo
git pull # to make sure it's up to date
python -m pip install . #python -m pip ensures we're using the pip inside the conda env
python -m pip install pytest pytest-cov dill
pytest .

cleong110 · 2024-03-04T16:50:52Z

All right, after messing around with #56, it seems that by deleting this file I am then able to run pytests.

I had also ran into another weird issue in #57, where pytest ran into an error while telling me what a different error was.

Now I can finally proceed with JW Sign dataset some more. Let's see if I can make a version which at least downloads the spoken-language text, and maybe make a much simplified index for testing purposes.

cleong110 · 2024-03-04T17:09:05Z

In order to iterate/test the dataset I will need to:

# in the top-level directory of the repo, with __init__.py removed
# make some change to the builder script
pytest ./sign_language_datasets/datasets/new_dataset/
# pip install . # not necessary actually, you can simply run the test

cleong110 · 2024-03-04T17:12:44Z

OK, I did

# navigate to sign_language_datasets/datasets/
tfds new new_dataset # create a new directory

And then repeatedly edited, pytested, using the rwth_phoenix2014_t code as a base, until the VideoTest passed. Excellent.

cleong110 · 2024-03-04T20:19:39Z

OK, I'm starting from scratch. I made a new fork, this time forking off of https://github.com/sign-language-processing/datasets so that I'm up to date.

cleong110 · 2024-03-04T20:20:07Z

https://github.com/cleong110/datasets/tree/jw_sign

cleong110 · 2024-03-04T20:20:32Z

I want to see if I can make a completely basic text-only dataset to start.

cleong110 · 2024-03-04T21:41:20Z

Apparently Google Drive doesn't play nice. When I try to use tfds' download_and_extract method on the text51.dict.gz file, I get a .html instead.

Turns out Google likes to pop up a "can't scan this for viruses" message, and that's what gets downloaded.

gdown library works, but then that doesn't play with tfds

Here's my Colab notebook playing with it: https://colab.research.google.com/drive/1EMKnpKrDUHxq5COFM6Acm7PqAQmkdvTS?usp=sharing

cleong110 · 2024-03-04T21:55:30Z

Workaround: split the text dict into one for every spoken language

cleong110 · 2024-03-04T22:02:16Z

To get the links for all 51 .json files:

Go to the folder: https://drive.google.com/drive/folders/1r-ftcljPRm1kLasqCK_cYL9zc4mxE6_o

Select all the files, you have to scroll a bit because it only shows 50 by default

right-click->Share->copy links

https://drive.google.com/file/d/122Fs-e5O9SPELpE83FohdXky9r1QIpDB/view?usp=drive_link, https://drive.google.com/file/d/1-KptAhyZCfnxGG4OMOVHXfgFqfjhmduu/view?usp=drive_link, https://drive.google.com/file/d/1-QQKmrW0iI9lBLxihtnxOgT_xAQSWCt7/view?usp=drive_link, https://drive.google.com/file/d/12Ahgjl3wbho9ShwlGVdtUx-uDnGacPFn/view?usp=drive_link, https://drive.google.com/file/d/1DMTgyGq9td8XpWqMeez7Hv1Cy-UyIAmd/view?usp=drive_link, https://drive.google.com/file/d/1Dze5_WZyXkAq8gca9eiPnofDTbXiFCYy/view?usp=drive_link, https://drive.google.com/file/d/1FXZF2SMv4GZ4visJrUwxY9vnHIs2Gzli/view?usp=drive_link, https://drive.google.com/file/d/1PyUgjqq6gt5kf4Pa4fRJok8b5eech2An/view?usp=drive_link, https://drive.google.com/file/d/1jOInwyi6XAAkwJ5iMLc90NI9W8MlbARF/view?usp=drive_link, https://drive.google.com/file/d/10peSKPs99feSfsyWmkSmaONM8H2pGWEX/view?usp=drive_link, https://drive.google.com/file/d/1NoCcaKy_BPrlboXP5kx4_iP3DrOq0ITq/view?usp=drive_link, https://drive.google.com/file/d/1dNnGpaoMGR4IPhWuH-TIobDyeL3tcdHx/view?usp=drive_link, https://drive.google.com/file/d/1dqO8p1tD_UUXUw3pDMCYwOIAS2niKHgm/view?usp=drive_link, https://drive.google.com/file/d/1iwF3OKvo4WmWvlqXjoDqZ1j41C4qthjA/view?usp=drive_link, https://drive.google.com/file/d/1tXHY2m4-P_jD7I4xBvkbB3lUX8FXVNmi/view?usp=drive_link, https://drive.google.com/file/d/139UErDv_QeaAmm5n7l5IqsktC5b3hO5A/view?usp=drive_link, https://drive.google.com/file/d/1Z5iTuQGTl15oh_xm9cSrtJkkmfu9s7qz/view?usp=drive_link, https://drive.google.com/file/d/1kccYftVcapjNLXxE-VYZpIOVNcRoa2mI/view?usp=drive_link, https://drive.google.com/file/d/1r8ao3bUf4xcsyTqJp0AQBiM29Y2wXBcO/view?usp=drive_link, https://drive.google.com/file/d/1rY6VjXhQXL330uNpxmekrpi_xs3T2JeK/view?usp=drive_link, https://drive.google.com/file/d/1tOMJNzNYo-Bpo6rxDZW94tpBnH9lZadv/view?usp=drive_link, https://drive.google.com/file/d/13-C5Z3YFjEE4dpstt3hDbNORiGgB4BgP/view?usp=drive_link, https://drive.google.com/file/d/19-zA-4dsfB-LZcDWXiOKNnniNuEHCZh2/view?usp=drive_link, https://drive.google.com/file/d/1GR3A6NXnsoItIwaxQvflCXufhLV-xCz2/view?usp=drive_link, https://drive.google.com/file/d/1KApiflPkVm6Jn0sGw2OyRT__VAec_bFd/view?usp=drive_link, https://drive.google.com/file/d/1NibQFTL0gGUL9NYnYFjlk_uCIZSM-RqA/view?usp=drive_link, https://drive.google.com/file/d/1SINxYL1u2T-dG79TjQmj2AZsfQB8rTaC/view?usp=drive_link, https://drive.google.com/file/d/1wE9Po5-nrr9PS-xdT_F8WK-kDyYCDZAp/view?usp=drive_link, https://drive.google.com/file/d/1Df5j9YsEMdvNx9NR7zl1gE58mnIZkS06/view?usp=drive_link, https://drive.google.com/file/d/1K8HCwsEtdKba248wPxbZJcRDyP4NlfLp/view?usp=drive_link, https://drive.google.com/file/d/1hT0dqllsIUL5G6AKP_vA1bkzir1ZYT0q/view?usp=drive_link, https://drive.google.com/file/d/1mZLUo9k8VTRyPdvrJEduUxMQv4Cxdotk/view?usp=drive_link, https://drive.google.com/file/d/1sioyZKRvTfujJ0aeJYPOYAEXZz5pIpTp/view?usp=drive_link, https://drive.google.com/file/d/1ygGnPbz4ssjwXNyZRmOGgkbcFfOzHu1b/view?usp=drive_link, https://drive.google.com/file/d/1LOsj8qvhmyRtfmaimELeBIiD4xl1vFeW/view?usp=drive_link, https://drive.google.com/file/d/1NjK0YIAowCv5uMv4yEcKdi-dU6LoVwD6/view?usp=drive_link, https://drive.google.com/file/d/1Nv8ecYBPdogebdtGT4HcAGdRxI_hf-yI/view?usp=drive_link, https://drive.google.com/file/d/1_6nC34lGBDRSZVAM5msW4Ol-BbyvoMcK/view?usp=drive_link, https://drive.google.com/file/d/1eTHQKEotMJm20BKe--CLQfBqUUvlFZpo/view?usp=drive_link, https://drive.google.com/file/d/1jizBtuPzBA8Bcy5IMs-EeF2_q41A68zr/view?usp=drive_link, https://drive.google.com/file/d/1sJhcz_mwCGafr9hi0aLQkr1U91_cq6Qx/view?usp=drive_link, https://drive.google.com/file/d/1EOaMjlUVy-hNGLqH3zLfGtxSVRN0X9O1/view?usp=drive_link, https://drive.google.com/file/d/1HPz-ZDjJeomlqNpxc5cWsEO4P7liqiiU/view?usp=drive_link, https://drive.google.com/file/d/1_3H2N92wLAEIqi9VF735KeSPGqQtJE2Q/view?usp=drive_link, https://drive.google.com/file/d/1_C-cqwzEJI89tNjLlSsvHQoDgnBMELGr/view?usp=drive_link, https://drive.google.com/file/d/1gSlQrYvfB1m26npbNRYXP14idcn-_2aA/view?usp=drive_link, https://drive.google.com/file/d/1rrrn73YFhC4yjUwbPKxcSzaWQ8FzKXY2/view?usp=drive_link, https://drive.google.com/file/d/1z_KYeV2u0KgNOjZ_9ARnp4PnF_bJFkf5/view?usp=drive_link, https://drive.google.com/file/d/13wKXt6R4h_trTlDVtYXZRRPOFJ9omSnz/view?usp=drive_link, https://drive.google.com/file/d/1BQaj9_RC_lnsc3kJVSFKWw2o_hWp_X4c/view?usp=drive_link, https://drive.google.com/file/d/1P5fzxscp5uoq3AtxKthu29D7BYfV4nKe/view?usp=drive_link

cleong110 · 2024-03-04T22:08:42Z

And of course I can split those one by one and get the link in a format that tfds can download...

...except, how do I re-associate the filename with the link?

cleong110 · 2024-03-04T22:22:44Z

I suppose I could just add the key back in? Then I do still have to download all 51 files, but at least the relevant info will still be inside each one.

{"spoken_language": "de", 
"data": {"v1001001": "1 \u00a0Am Anfang erschuf Gott Himmel und Erde.+", "v1001002": "2\u00a0\u00a0Die Erde nun war formlos und \u00f6de*. \u00dcber dem tief
}

cleong110 · 2024-03-04T22:25:56Z

OK going with that for now, we can compress them later. I just want to get something running

cleong110 · 2024-03-04T22:52:10Z

With a bit of munging I was able to download all the files, read the code, and then create a
spoken_lang_text_file_download_urls.json
dictionary of download URLs, which I saved to a .json

cleong110 · 2024-03-04T23:18:28Z

Gonna have to call it for today, but I added some notes to jw_sign.py for next time.

cleong110 · 2024-03-04T23:29:45Z

TODO: code to generate the .json files containing text for each spoken language, on demand. Those need to be re-scraped each time

cleong110 mentioned this issue Feb 21, 2024

ImportError: "attempted relative import with no known parent package" when attempting to locally test rwth_phoenix2014_t #53

Closed

cleong110 added a commit to cleong110/datasets that referenced this issue Mar 4, 2024

CDL: fresh start on jw_sign, see sign-language-processing#29

edbb161

Jehovah Witness Sign Language Resources #29

Jehovah Witness Sign Language Resources #29

Comments

AmitMY commented Feb 15, 2023

bricksdont commented Feb 15, 2023

cleong110 commented Dec 13, 2023

cleong110 commented Dec 13, 2023

cleong110 commented Jan 4, 2024 • edited Loading

cleong110 commented Jan 4, 2024

cleong110 commented Jan 4, 2024

cleong110 commented Jan 4, 2024

cleong110 commented Jan 4, 2024

cleong110 commented Jan 4, 2024 • edited Loading

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 16, 2024

cleong110 commented Jan 18, 2024

cleong110 commented Jan 18, 2024

cleong110 commented Jan 24, 2024

cleong110 commented Feb 15, 2024

cleong110 commented Feb 21, 2024

cleong110 commented Feb 21, 2024

cleong110 commented Mar 1, 2024

cleong110 commented Mar 4, 2024 • edited Loading

cleong110 commented Mar 4, 2024 • edited Loading

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024 • edited Loading

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024 • edited Loading

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Mar 4, 2024

cleong110 commented Jan 4, 2024 •

edited

Loading

cleong110 commented Jan 4, 2024 •

edited

Loading

cleong110 commented Mar 4, 2024 •

edited

Loading

cleong110 commented Mar 4, 2024 •

edited

Loading

cleong110 commented Mar 4, 2024 •

edited

Loading

cleong110 commented Mar 4, 2024 •

edited

Loading