Details of Scripts

Python Files

all_genres_list.py Takes input README.md file for a treebank, and returns the list of unique values listed in Genre category in machine-readable metadata. Multiple inputs supported.

Defaults output to stdout, which can then be piped into a file, if desired.
```
python3 all_genres_list.py <input_file(s)>
```
average_sentence_length.py
Calculates the average sentence length in a given CoNLL-U file, given by the total number of syntactic words divided by the total number of sentences.

Usage:
```
 python3 average_sentence_length.py <input_file>
```
compare_treebank_bool.py
Determines if the total number of sentences in a given input file is more than 1000, or not. Returning True if the number of sentences is more than 1000, and False otherwise.

Usage:
```
 python3 compare_treebank_bool.py <input_file> 
```

downsample.py
Downsamples a given CoNLL-U file to a given number of sentences, or according to number of sentences in another CoNLL-U formatted file

Arguments:

-i --input         Input File that needs to be downsampled
-n --number        Number of Sentences to downsample to
-f --file          The file whose number of instances the input file should be downsampled to
-o --output        Output file to write the downsampled data in. If the argument is not provided, 
                   defaults to <input_file>_<downsampled_instances_count>.CoNLLu
-h --help          Display Help Message and Exit

Usage:

python3 downsample.py [-h] -i <input_file> (-n NUMBER | -f FILE) [-o <output_file>]

get_coverage_scores.py
Calculates the coverage statistics for trigrams, as a percentage of trigrams in target file. The file with greater number of trigrams is selected as the source, while the other is selected as source. Reports score as a percentage of trigrams common to source and target, over number of trigrams in target.

Arguments:
```
Arg1: File 1 in CoNLL-U format
Arg2: File 2 in CoNLL-U format
```
Usage:
```
python3 get_coverage_scores.py <input_file1> <input_file2>
```
get_formality_scores.py
NOT USED IN THE PIPELINE.

The script reads input CoNLL-U file to calculate the F-score (as given by [Heylighen and Dewaele, 1999]), but using normalised frequencies. Similarly, can be used to calculate I-measure (as given by [Mosquera and Pozo, 2011]), by commenting out line#51 and including line#52.

Usage:
```
 python3 get_formality_scores.py <input_CoNLL-U_file>
```
get_scores_with_sd.py
Calculates mean and standard deviation values when theta_POS.py file is run multiple times for the similar data (example- 100 runs on slightly changing data). The first argument is the mode of the input file(s) followed by the different input files from second argument onwards. Can take multiple inputs. Defaults printing the output to stdout, from where it can be piped to the desired output file.

Mode Values Description:
```
Mode   File_Type
 1     TSV File with column1=fieldname, column2=values
 2     File such that fieldname and values are each in new lines. 
       Empty line separates fieldname entries
```
Usage:
```
python3 get_scores_with_sd.py <mode> <input_file(s)>
```
get_unique_trigrams.py
Calculates for a given input file
- the total number of POS trigrams
- the total number of unique POS trigrams
If only one input_file with a .tsv extension is given as an argument, the script reads the file to plot the first column as an x-value, while the subsequent columns are used to plot lines in the graph.

For all other cases, the input file is read as a CoNLL-U file, and the values as indicated earlier are computed for the input files.

Usage:
```
 python3  get_unique_trigrams.py <input_file(s)>
```
kfold.py
Creates test and training folds for the given data, based on the number of folds given as argument. Arguments:
```
Arg1: The number of folds (integer)
Arg2: Input File in CoNLL-U format
```
Usage:
```
python3 kfold.py <folds_count> <input_file>
```

klcpos3.py
File for calculating KL_cpos³ measure of source and target treebanks for single-source and multi-source-weighted delexicalised parsing.

Arguments:

--source:           Source Candidate File(s), in CoNLL-U format
--target:           Target Candidate File, in CoNLL-U format
--single_source:    Used for selection of a single source, the sources would be displayed in
                     decreasing order of similarity measure
--multi_source:     Used for computing klcpos3 ^ -4 as a similarity measure for weighted 
                     multiple-source parsing. The output values are not normalised.

Usage:

python3 klcpos3.py [-h] -t <target_file> -s <source_file(s)> [--single_source | --multi_source]

split_EDT_genres.py, split_fi_genres.py, split_PDT_genres.py and split_pl_genres.py
Split the given input_file into its constituent genres. Takes CoNLL-U file as a singular input.
```
python3 <python_file> <input_file>
```
test_significance.py
For each given mean, tests how many other means it is significantly same with, at 95% confidence value.
```
python3 test_significance.py <input_file> <output_file>
```

theta_POS.py
Reads the file, and calculate the symmetric metric θ_pos, which is a sum of calculated KL_cpos³ scores in either direction. Defaults output to stdout, from where it can be piped into a file.

Input File Format:

   file1 file2
   klcpos3(file1, file2) score
   klcpos3(file2, file1) score
   
   file1 file3
   klcpos3(file1, file3) score
   klcpos3(file3, file1) score
   
   file2 file3
   klcpos3(file2, file3) score
   klcpos3(file3, file2) score

Output Format (tsv, columns mark individual values):

   file1<space>file2  theta_pos_score
   file1<space>file3  theta_pos_score
   file2<space>file3  theta_pos_score
   ...

Usage:

python3 theta_POS.py <input_file>

Shell Files

additive_genres_pl.sh
Batch File to check how multiple genres at the same time affect θ_pos scores.
genre_fi.sh and genre_pl.sh
Batch Files to check the variance of θ_pos scores on inter-genre basis in Finnish-TDT and Polish-LFG treebanks respectively. In case of Finnish data, the intra-genre variance is also calculated.
size_cs.sh and size_et.sh
Batch Files to check the variance of θ_pos scores with the change in dataset size in Czech-PDT and Estonian-EDT data respectively.
treebanks_to_compare.sh
Runs compare_treebank_bool.py on all the different treebanks to indicate which treebanks should be compared with other treebanks of the same language. Stores result in TSV file. The different treebanks of a language are separated from other treebanks of other language by an empty line in between.
unique_trigrams_cs1.sh, unique_trigrams_cs2.sh and unique_trigrams_et.sh
Calculate the variance of count of POS trigrams Czech-PDT and Estonian-EDT data with change in dataset size. unique_trigrams_cs1.sh and unique_trigrams_cs2.sh calculate the statistics for Czech-PDT treebanks, and are split to allow parallel computation. Generates unique_trigrams directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Details of Scripts

Contents

Python Files

Shell Files

Files

README.md

Latest commit

History

README.md

File metadata and controls

Details of Scripts

Contents

Python Files

Shell Files