Basic Configurations

Mode

First, choose what you want to do with your texts.

analyze or empty: If you only want to analyze your texts, choose analyze or leave the mode empty. Your texts' features are dumped in a file in tsv format.
cluster: If you want to find clusters of similar texts in your corpus, choose cluster.
train_classifier: If you have texts associated with to two or more classes and want to train or evaluate a classifier on them, choose train_classifier. For further instructions on how to train a model, see models.
classify: If you have already trained a classifier, you can label a text with unknown class by choosing classify. For instructions on how to load a pretrained model, see models.
train_linear_regressor: If you have texts associated with continuous scores and want to train or evaluate a linear regressor on them, choose train_linear_regressor. For further instructions on how to train a model, see models.
score: If you have already trained a linear regression model, you can score a text with unknown rating by choosing score. For instructions on how to load a pretrained model, see models.

Corpora

Next, specify the texts or corpora you want to process with register. You can use multiple corpora. For each one you have to define the path (path) to the respective folder containing txt files or to a tsv file which contains one column 'id' for the text ids and one column 'text' for the actual texts. Further define the corpus language (lang). register lets you compare corpora from different languages, but keep in mind that this may be not meaningful for all feature packages (e.g. for word n-grams). If you want to name your corpus different than it's folder name, do it via name (pay attention to not having multiple corpora with the same name).

path: path to corpus, folder with text files or tsv file (obligatory!)
lang: ISO language code of the corpus (obligatory!)
name: corpus name (optional)

Further, you can specify a class associated with the corpus (class) or point to a tsv file containing the text id and class (or score for linear regression) via path_targets. You can define train/test splits either via setting set to train, test or apply or point to a tsv file containing the set for each file. For further instructions see models.

class: class/label associated with a corpus
path_targets: path to a tsv file containing the text id (e.g., document name) and the associated classes or scores in class/score
set: train, test or apply sets for training a model
path_train_test_splits: path to a tsv file containing the text id and the associated sets (train or test)

Feature Packages

There are different feature packages you can choose from:

character n-grams
token n-grams
span n-grams
orthography
metrics
morphology
syntax
named entities
embeddings
emotion
formality (for German)

Further Configurations

Output

You can specify the path to folder for the output of register in base_dir (default is the folder you are running the program from). If you train a model you get explanations for the model's output using SHAP by setting explanation_shap to true.

Models

For chooseable machine-learning models see models.

Languages

For specific configurations regarding the chosen languages see language.

Vectorizing and Scaling

For chooseable vectorizers and scaling options see vectorizers and scaling.

Examples

Configure register via the config.json file in the src directory or via another json file, which you have to pass as argument to register. See simple_config.json or advanced_config.json for a simple or advanced configuration example. Other examples are config_pt16.json or config_c18.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1_basic_configurations.md

1_basic_configurations.md

Basic Configurations

Mode

Corpora

Feature Packages

Further Configurations

Output

Models

Languages

Vectorizing and Scaling

Examples

Files

1_basic_configurations.md

Latest commit

History

1_basic_configurations.md

File metadata and controls

Basic Configurations

Mode

Corpora

Feature Packages

Further Configurations

Output

Models

Languages

Vectorizing and Scaling

Examples