Skip to content

Commit

Permalink
Add requirements and edit readme
Browse files Browse the repository at this point in the history
  • Loading branch information
damiaanr committed Jun 20, 2023
1 parent 857c9ee commit b59ef3f
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ To provide a small additional overview of the quality of the dataset in terms of
A single listed language with a score of `1.000` means that every sample was detected as that language with full confidence (a perfect score). A single listed language with a score below perfect indicates that no other language was detected in any of the samples, but that Google was not always completely confident in its detection. In the latter case, considering that the authors of NewsCrawl deliberately scraped sentences of particular languages, the source set is highly likely to contain solely sentences of the intended source language. For other source languages, the language of some of the samples was detected to be a typologically similar one to the intended language that is often subject to *language-or-dialect* debates (*e.g.*, `cs` and `sk`, or `hr` and `bs`). However, in the case of `so` (Somali) and `rw` (Kinyarwanda), the source set turned out to contain a high number of English sentences and was therefore deemed less useful and consequently deleted from the dataset (including in the above specifications). While the original NewsCrawl dataset contains more noisy source sets (such as a large number of Ukrainian sentences being present in the Russian source set), these seem to have been effectively filtered using the provided cleaning steps (see above).

## 2. About the script: create a better version of GTNC!
We encourage everyone to create newer versions of the dataset; either with a larger amount of samples (requiring more Cloud Translation API Credit) or using more recent versions of Google Translate or NewsCrawl(-like datasets). The code in this repository can be used out-of-the-box to create your own many-languages-to-one dataset. All code is thoroughly documented, type-annotated, and PEP8-compliant, and should be straightforward to understand. A `requirements.txt` file is provided to help you set up a working environment. In addition to the code, a couple of tips are shown below.
We encourage everyone to create newer versions of the dataset; either with a larger amount of samples (requiring more Cloud Translation API Credit) or using more recent versions of Google Translate or NewsCrawl(-like datasets). The code in this repository can be used out-of-the-box to create your own many-languages-to-one dataset. All code is thoroughly documented, type-annotated, and PEP8-compliant, and should be straightforward to understand. A `requirements.txt` file is provided to help you set up a working environment. In addition to the code, a couple of tips are shown below. Note that the `evaluation` folder contains code from a stand-alone accuracy evaluation program based on WALS from which only part of the functionality is used to create the dataset.

### Connecting with Google's API services
Although Google provides documentation on [Authenticating and ‘how to start’](https://developers.google.com/people/quickstart/python) and on [how to use it's Python library to translate text](https://cloud.google.com/python/docs/reference/translate/latest/google.cloud.translate_v3.services.translation_service.TranslationServiceClient#google_cloud_translate_v3_services_translation_service_TranslationServiceClient_translate_text), it may be handy to be aware of the following:
Expand Down
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
justext==3.0.0
matplotlib==3.6.2
numpy==1.23.4
protobuf==4.23.3
regex==2023.6.3

0 comments on commit b59ef3f

Please sign in to comment.