-
Notifications
You must be signed in to change notification settings - Fork 11
Programs included
In this package, we have the following programs:
-
opf_split: This is a program to randomly split the dataset into training, evaluation and test sets.
-
opf_train: This is a program to execute the training phase considering the OPF proposed by [PapaIJIST09,PapaPR12].
-
opf_learn: This is a program to execute the learning phase from classification errors in the evaluation set considering the OPF proposed by [PapaIJIST09,PapaPR12]. It substitutes
opf_train
. -
opf_classify: This is a program to execute the test phase by classifying the test set considering the OPF proposed by [PapaIJIST09,PapaPR12].
-
opf_train: This is a program to execute the training phase considering the OPF proposed by [PapaISVC08].
-
opf_knnclassify: This is a program to execute the test phase by classifying the test set considering the OPF proposed by [PapaISVC08].
-
opf_accuracy: This is a program to compute the accuracy over training and/or test set.
-
opf_accuracy4label: This is a program to compute the accuracy over training and/or test set for each label.
- opf_cluster: This is a program to compute clusters by OPF. It assigns a consecutive number starting from 1 to N for N clusters, when the training set is unlabeled. Otherwise, it propagates the true labels of the roots to the labels of the nodes in their respective trees in order to evaluate the quality of the clustering. The resulting classifier is written in classifier.opf.
-
opf_distance: This is a program to compute distance functions and store them into a precomputed distance file.
-
opf_normalize: This is a program to normalise datasets.
-
opf_info: This is a program that retrieves basic information about OPF files, such as the dataset size, number of labels and features.
-
opf_fold: This program partitions the datasets in k folds.
-
opf_merge: This program merges the folds, and it can be used together with
opf_fold
program.
Usage: opf_split <P1> <P2> <P3> <P4> <P5>
P1: dataset in the OPF file format
P2: percentage of the training set size [0,1]
P3: percentage of the evaluation set size [0,1] (leave 0 in the case of no learning)
P4: percentage of the test set size [0,1]
P5: normalize features? 1 - Yes 0 - No
The sum P2 + P3 + P4 must be 1.
The features are normalized with the following equation:
N_i = (F_i - M_i)/S_i,
where F_i, M_i and S_i are, respectively, the feature i, the average of F_i and the standard deviation of F_i in the dataset.
The program splits the dataset into two new files, training.opf and testing.opf, when P3 = 0, and it splits the dataset into three files, training.opf, evaluating.opf and testing.opf, otherwise.
Usage: opf_train <P1> <P2>
P1: training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)
The program designs a classifier from training.opf and outputs it in a file named classifier.opf, which is used by opf_classify
for testing.
The opf_train
also outputs the following files:
- .out: it contains the predicted labels (training phase)
- .time: it contains the execution time in seconds (training phase)
- .acc: it contains the accuracy (training phase)
Usage: opf_learn <P1> <P2> <P3>
P1: training set in the OPF file format
P2: evaluation set in the OPF file format
P3: precomputed distance file (leave it in blank if you are not using this resource)
The program substitutes opf_learn
when there is evaluation set. It learns from the classification errors in the evaluation set without increasing the training set size, and outputs a final classifier in a file named classifier.opf, which is used for testing by the program opf_classify
.
The opf_learning
outputs the following file:
- .time: it contains the execution time in seconds (learning phase)
Usage: opf_classify <P1> <P2>
P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)
The opf_classify
outputs the following files:
- .out: it contains the predicted labels (test phase)
- .time: it contains the execution time in seconds (test phase)
Usage: opfknn_train <P1> <P2> <P3>
P1: training set in the OPF file format
P2: kmax (maximum value for the k-neighborhood)
P3: precomputed distance file (leave it in blank if you are not using this resource)
The program designs a classifier from training.opf and outputs it in a
file named classifier.opf, which is used by opfknn_classify
for
testing.
The opf_knntrain
also outputs the following files:
- .out: it contains the predicted labels (training phase)
- .time: it contains the execution time in seconds (training phase)
- .acc: it contains the accuracy (training phase)
Usage: opf_knnclassify <P1> <P2>
P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)
The opf_knnclassify
outputs the following files:
- .out: it contains the predicted labels (test phase)
- .time: it contains the execution time in seconds (test phase)
Usage: opf_accuracy <P1>
P1: data set in the OPF file format
The opf_accuracy
will look for a classified file with the same name of the data set file in P1 and extension ".out" in order to compute the accuracy of that classification. It outputs a text file with the same name and extension ".acc".
Usage: opf_accuracy4label <P1>
P1: data set in the OPF file format
The opf_accuracy4label
will look for a classified file with the same name of the data set file in P1 and extension ".out" in order to compute the accuracy of that classification. It outputs a text file with the same name and extension ".acc".
Usage: opf_cluster <P1> <P2> <P3> <P4> <P5>
P1: unlabeled data set in the OPF file format
P2: kmax (maximum degree for the knn graph)
P3: 0 (height), 1 (area) and 2 (volume)
P4: value of parameter P3 (integer) in (0-1)
P5: precomputed distance file (leave it in blank if you are not using this resource)
P3: allows to remove maxima from the pdf based on height, area or volume criteria.
Note: the opf_cluster
outputs the k value that minimized the cut in
the graph as well as the number of obtained clusters and a classifier
written in a file classifier.opf. The labeled samples (predicted) are
also outputed in a ".out file".
Usage: opf_knn_classify <P1> <P2>
P1: test/training set in the OPF file format
P2: precomputed distance file (leave it in blank if you are not using this resource)
The opf_knn_classify outputs the following files:
- .out: it contains the predicted labels (test phase)
- .time: it contains the execution time in seconds (test phase)
One of the most important characteristic of the OPF classifier is the possibility of working with any distance function. Its default is the Euclidean metric. The user can execute the program opf_distance
with the following options of distance functions.
Usage: opf_distance <P1> <P2> <P3>
P1: Dataset in the OPF file format
P2: Distance ID
1 - Euclidean
2 - Chi-Square
3 - Manhattan (L1)
4 - Canberra
5 - Squared Chord
6 - Squared Chi-Squared
7 - BrayCurtis
P3: Distance normalization? 1- yes 0 - no
The program computes the selected distance function between every pair of samples in the dataset and outputs a precomputed distance file (distances.dat). The sample identifier in the dataset is used here. The distance values may be or not be normalized with P3. The user can also create his/her own distance file. The file BINARY format is:
<# of samples>
<Distance from sample 0 to sample 0> <Distance from sample 0 to sample 1> ...
<Distance from sample 1 to sample 0> <Distance from sample 1 to sample 1> ...
.
.
<Distance from sample n-1 to sample 0> <Distance from sample n-1 to sample 1> ...
Comment #1: Note that, the file is an N x N matrix of distance values. It must be binary with no blank spaces. This ASCII representation is just for illustration.
If the user has its own datasets and does not need to use opf_split
, he/her may need to normalise the dataset. Therefore, the user can use the opf_normalize
program, which employes the same normalisation process used by opf_split
.
Usage: opf_normalize <P1> <P2>
P1: input dataset in the OPF file format
P2: normalized output dataset in the OPF file format
It retrieves basic information about OPF files, such as dataset size, and number of labels and features.
Usage: opf_info <P1>
P1: OPF file
If the user needs to employ a k-fold cross validation, he/she can use the opf_fold
program, which partitions the dataset in k folds. The user can merge folds with opf_merge
program.
Usage: opf_fold <P1> <P2> <P3>
P1: input dataset in the OPF file format
P2: k
P3: normalize features? 1 - Yes 0 - No
If merges n folds for a k-fold cross validation.
Usage: opf_merge <P1> <P2> ... <Pn>
P1: input dataset 1 in the OPF file format
P2: input dataset 2 in the OPF file format
Pn: input dataset n in the OPF file format