Note: Unless mentioned otherwise, input files are in CoNLL-U format.
-
all_genres_list.py Takes input
README.md
file for a treebank, and returns the list of unique values listed inGenre
category in machine-readable metadata. Multiple inputs supported.Defaults output to stdout, which can then be piped into a file, if desired.
python3 all_genres_list.py <input_file(s)>
-
average_sentence_length.py
Calculates the average sentence length in a given CoNLL-U file, given by the total number of syntactic words divided by the total number of sentences.Usage:
python3 average_sentence_length.py <input_file>
-
compare_treebank_bool.py
Determines if the total number of sentences in a given input file is more than 1000, or not. ReturningTrue
if the number of sentences is more than 1000, andFalse
otherwise.Usage:
python3 compare_treebank_bool.py <input_file>
-
downsample.py
Downsamples a given CoNLL-U file to a given number of sentences, or according to number of sentences in another CoNLL-U formatted fileArguments:
-i --input Input File that needs to be downsampled -n --number Number of Sentences to downsample to -f --file The file whose number of instances the input file should be downsampled to -o --output Output file to write the downsampled data in. If the argument is not provided, defaults to <input_file>_<downsampled_instances_count>.CoNLLu -h --help Display Help Message and Exit
Usage:
python3 downsample.py [-h] -i <input_file> (-n NUMBER | -f FILE) [-o <output_file>]
-
get_coverage_scores.py
Calculates the coverage statistics for trigrams, as a percentage of trigrams in target file. The file with greater number of trigrams is selected as the source, while the other is selected as source. Reports score as a percentage of trigrams common to source and target, over number of trigrams in target.Arguments:
Arg1: File 1 in CoNLL-U format Arg2: File 2 in CoNLL-U format
Usage:
python3 get_coverage_scores.py <input_file1> <input_file2>
-
get_formality_scores.py
NOT USED IN THE PIPELINE.The script reads input CoNLL-U file to calculate the F-score (as given by [Heylighen and Dewaele, 1999]), but using normalised frequencies. Similarly, can be used to calculate I-measure (as given by [Mosquera and Pozo, 2011]), by commenting out line#51 and including line#52.
Usage:
python3 get_formality_scores.py <input_CoNLL-U_file>
-
get_scores_with_sd.py
Calculates mean and standard deviation values when theta_POS.py file is run multiple times for the similar data (example- 100 runs on slightly changing data). The first argument is the mode of the input file(s) followed by the different input files from second argument onwards. Can take multiple inputs. Defaults printing the output to stdout, from where it can be piped to the desired output file.Mode Values Description:
Mode File_Type 1 TSV File with column1=fieldname, column2=values 2 File such that fieldname and values are each in new lines. Empty line separates fieldname entries
Usage:
python3 get_scores_with_sd.py <mode> <input_file(s)>
-
get_unique_trigrams.py
Calculates for a given input file- the total number of POS trigrams
- the total number of unique POS trigrams
If only one
input_file
with a.tsv
extension is given as an argument, the script reads the file to plot the first column as an x-value, while the subsequent columns are used to plot lines in the graph.For all other cases, the input file is read as a CoNLL-U file, and the values as indicated earlier are computed for the input files.
Usage:
python3 get_unique_trigrams.py <input_file(s)>
-
kfold.py
Createstest
andtraining
folds for the given data, based on the number of folds given as argument. Arguments:Arg1: The number of folds (integer) Arg2: Input File in CoNLL-U format
Usage:
python3 kfold.py <folds_count> <input_file>
-
klcpos3.py
File for calculating KLcpos3 measure of source and target treebanks for single-source and multi-source-weighted delexicalised parsing.Arguments:
--source: Source Candidate File(s), in CoNLL-U format --target: Target Candidate File, in CoNLL-U format --single_source: Used for selection of a single source, the sources would be displayed in decreasing order of similarity measure --multi_source: Used for computing klcpos3 ^ -4 as a similarity measure for weighted multiple-source parsing. The output values are not normalised.
Usage:
python3 klcpos3.py [-h] -t <target_file> -s <source_file(s)> [--single_source | --multi_source]
-
split_EDT_genres.py, split_fi_genres.py, split_PDT_genres.py and split_pl_genres.py
Split the giveninput_file
into its constituent genres. Takes CoNLL-U file as a singular input.python3 <python_file> <input_file>
-
test_significance.py
For each given mean, tests how many other means it is significantly same with, at 95% confidence value.python3 test_significance.py <input_file> <output_file>
-
theta_POS.py
Reads the file, and calculate the symmetric metric θpos, which is a sum of calculated KLcpos3 scores in either direction. Defaults output to stdout, from where it can be piped into a file.Input File Format:
file1 file2 klcpos3(file1, file2) score klcpos3(file2, file1) score file1 file3 klcpos3(file1, file3) score klcpos3(file3, file1) score file2 file3 klcpos3(file2, file3) score klcpos3(file3, file2) score
Output Format (tsv, columns mark individual values):
file1<space>file2 theta_pos_score file1<space>file3 theta_pos_score file2<space>file3 theta_pos_score ...
Usage:
python3 theta_POS.py <input_file>
- additive_genres_pl.sh
Batch File to check how multiple genres at the same time affect θpos scores. - genre_fi.sh and genre_pl.sh
Batch Files to check the variance of θpos scores on inter-genre basis in Finnish-TDT and Polish-LFG treebanks respectively. In case of Finnish data, the intra-genre variance is also calculated. - size_cs.sh and size_et.sh
Batch Files to check the variance of θpos scores with the change in dataset size in Czech-PDT and Estonian-EDT data respectively. - treebanks_to_compare.sh
Runs compare_treebank_bool.py on all the different treebanks to indicate which treebanks should be compared with other treebanks of the same language. Stores result in TSV file. The different treebanks of a language are separated from other treebanks of other language by an empty line in between. - unique_trigrams_cs1.sh, unique_trigrams_cs2.sh and unique_trigrams_et.sh
Calculate the variance of count of POS trigrams Czech-PDT and Estonian-EDT data with change in dataset size. unique_trigrams_cs1.sh and unique_trigrams_cs2.sh calculate the statistics for Czech-PDT treebanks, and are split to allow parallel computation. Generates unique_trigrams directory.