NOTE: The models depend on deprecated versions of the pytorch library and can only be used with Python 3.7.5 and Flair 0.1.0. We are working on fixing this. In the meantime, Christof Schöch helpfully provided instructions how to get the tagger to run in a virtual environment.
NOTE: The code was adjusted to account for new versions of the pytorch and flair packages. Please note the new requirements - especially:
- pytorch version 1.10.1
- flair version 0.10
The code will no longer work with the old requirements. The old version of the tagger code has been archived as release v1.0.0.
Please use Github's issue tracker if you encounter any problems.
These recognizers were developed by the DFG-funded project "Redewiedergabe - eine literatur- und sprachwissenschaftliche Korpusanalyse" (Leibniz Institute for the German Language / University of Würzburg) (www.redewiedergabe.de) and (mostly) trained on data from Corpus Redewiedergabe.
They can be used to automatically detect and annotate the following 4 types of speech, thought and writing representation in German texts.
STWR type | Example | Translation |
---|---|---|
direct | Dann sagte er: "Ich habe Hunger." | Then he said: "I'm hungry." |
free indirect ('erlebte Rede') | Er war ratlos. Woher sollte er denn hier bloß ein Mittagessen bekommen? | He was at a loss. Where should he ever find lunch here? |
indirect | Sie fragte, wo das Essen sei. | She asked where the food was. |
reported | Sie sprachen über das Mittagessen. | They talked about lunch. |
For more detailed descriptions of these STWR types please refer to the Redewiedergabe annotation guidelines (in German).
The recognizers are based on deepLearning and utilize the FLAIR NLP framework.
Main Publication (please cite when using the recognizers):
Annelen Brunner, Ngoc Duyen Tanja Tu, Lukas Weimer, Fotis Jannidis: To BERT or not to BERT – Comparing contextual embeddings in a deep learning architecture for the automatic recognition of four types of speech, thought and writing representation, Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), Zurich, Switzerland, June 23-25, 2020.
Other Publications:
Annelen Brunner, Ngoc Duyen Tanja Tu, Lukas Weimer, Fotis Jannidis: Deep learning for Free Indirect Representation, KONVENS Erlangen 2019, pp. 241-245.
- Get our models
- Get our custom-trained language embeddings (BERT and FLAIR)
- Set up the recognizer
- Note: As an alternative to using the code provided here, you can also use run our models directly via the FLAIR framework!
Each STWR type is recognized by a separate model. The downloads are zip archives. Simply unpack them and move the folders into directory rwtagger/models.
All models are named final-model.pt and the name of the subfolder is used to locate the model you want to use. The default models are stored in directories named after their STWR type (direct, indirect, reported or freeIndirect).
The subfolders for alternative models have different names. If you want to use them, you have to edit the file rwtagger/config.txt
Example: For using the direct model that is based on BERT embeddings, first download and unpack this alternative model. It is stored in a folder namend direct_BERT. Move the folder into the directory rwtagger/models. Then add the following line to the file rwtagger/config.txt:
direct@direct_BERT
When you run the rwtagger script again, the BERT model will be used to recognize direct STWR instead of the default.
These are the models discussed in the KONVENS 2020 paper.
All models first encode the text with a customized Language Embedding (depending on the model: see table) and were then trained for their STWR task using a deep learning architecture with 2 BiLSTM layers and one CRF layer.
The recognizers work on token basis and the scores are calculated based on tokens as well.
Each model recognizes one specific type of STWR in a binary classification ("direct" vs. "x", "indirect" vs. "x", etc.).
The training, validation and test corpora used to train and evaluate the taggers for direct, indirect and reported STWR are available here. Unfortunately, we cannot provide the exact data for the free indirect model due to copyright restrictions.
These are the best performing models as presented in the KONVENS 2020 paper. They are considered the default models for the recognizers.
Package with all 4 STWR models at once (~3 GB)
STWR type | F1 | Precision | Recall | Language embedding | Training and Test material | Download |
---|---|---|---|---|---|---|
direct | 0.85 | 0.93 | 0.78 | Skipgram with 500 dimensions & FLAIR embeddings (both custom trained) | historical German (19th to early 20th century), fiction and non-fiction | Direct model (~1.6 GB) |
indirect | 0.76 | 0.81 | 0.71 | BERT (custom finetuned) | historical German (19th to early 20th century), fiction and non-fiction | Indirect model (~460 MB) |
reported | 0.60 | 0.67 | 0.54 | BERT (custom finetuned) | historical German (19th to early 20th century), fiction and non-fiction | Reported model (~460 MB) |
free indirect | 0.59 | 0.78 | 0.47 | BERT (custom finetuned) | historical and modern German (late 19th century to current), only fiction | Free indirect model (~460 MB) |
As an alternative, we also provide the most successful models using an alternative language embedding. These were used in the comparisons in the KONVENS 2020 paper.
STWR type | F1 | Precision | Recall | Language embedding | Training and Test material | Download |
---|---|---|---|---|---|---|
direct | 0.80 | 0.87 | 0.74 | BERT (custom finetuned) | historical German (19th to early 20th century), fiction and non-fiction | Direct model (~ 460 MB) |
indirect | 0.74 | 0.77 | 0.71 | Skipgram with 300 dimensions & FLAIR embeddings (both custom trained) | historical German (19th to early 20th century), fiction and non-fiction | Indirect model (~788 MB) |
reported | 0.58 | 0.69 | 0.50 | Skipgram with 500 dimensions & FLAIR embeddings (both custom trained) | historical German (19th to early 20th century), fiction and non-fiction | Reported model (~1.6 GB) |
free indirect | 0.51 | 0.87 | 0.36 | Skipgram with 300 dimensions & FLAIR embeddings (both custom trained) | historical and modern German (late 19th century to current), only fiction | Free indirect model (~788 MB) |
Historical German (19th to early 20th century) (fiction and non-fiction) was used for customizing/finetuning the Language Embeddings used for the recognizer modules.
- The fine-tuned BERT Model is available at huggingface.co: https://huggingface.co/redewiedergabe/bert-base-historical-german-rw-cased
- The custom-trained FLAIR model was integrated into the FLAIR framework and can be loaded with the following lines of code:
embeddings = FlairEmbeddings('de-historic-rw-forward')
andembeddings = FlairEmbeddings('de-historic-rw-backward')
This GitHub repository contains scripts that handle data input and output and optionally calculate test scores. They allow you to run the recognizers from the command line.
For this, a Python environment with the necessary modules has to be set up. We provide a requirements file and give some instructions how to set up a Python virtual environment to facilitate this.
The trained models must be downloaded separately before the recognizers are usable. Put them into the directory rwtagger/models. Models must always be named final-model.pt and be stored in a sub-folder matching their type (direct, indirect, reported or freeIndirect).
The software was developed with Python 3.7.0., but also runs with Python 3.6.8 and should work for newer Python versions as well.
The following instructions explain how to set up the necessary Python modules in a virtual enviroment. Of course you can also execute the recognizers in your regular Python environment, if the necessary modules are installed there.
We cannot cover all variations of the setup of a Python virtual environment here, so the instructions explain only how to do it with Anaconda Python under Windows. If this does not work for you, please consult other instructions regarding Python virtual environments, e.g. this one.
Setup with Anaconda Python under Windows
-
Make sure the you have at least Python 3.6.8 installed (3.7.0 is recommended; newer versions should work as well)
-
If you have no experience with Python, we recommend installing Anaconda Python; then proceed in the 'Anaconda Powershell Prompt' console to avoid problems with path variables (NOTE: Anaconda has two different 'Prompt' consoles. These instructions assume you use 'Anaconda Powershell Prompt')
-
If Python virtualenv is not already installed, execute the following code in the console:
pip install virtualenv
-
Download and unpack this Github project
-
Change into the directory tagger and execute the following code
- NOTE: The code below installs the CPU version of pytorch, which works for all computers. If you want to use a GPU instead, refer to the pytorch download page to get the correct syntax. However, for the GPU to work with pytorch your also have to make sure you have CUDA installed and configured correctly. For this, please refer to other guides, e.g. this one.
virtualenv venv cd venv .\Scripts\activate # --> You should now see '(base) (venv)' at the beginning of your prompt line. # Install pytorch. The code was tested with version 1.10.1. # If you don't want to use a GPU and are willing to risk getting a newer version, simply type: pip3 install torch torchvision torchaudio # Alternatively, if you want more control over the version or want to use a GPU, skip # the line above and refer to https://pytorch.org/get-started/locally to get the correct # syntax for installation. # Install all other required modules: pip install -r ..\requirements.txt # Change back to the rwtagger directory cd ..\rwtagger
-
To tokenize input texts, you need additional libraries for the NLTK module. We recommend installing them in the interactive mode:
- type
python
to open the Python interpreter. Then type the following:
- type
import nltk
nltk.download('punkt')
exit()
-
You can now execute the recognizers in this console window (after you have downloaded the Recognizer models). Make sure that venv is the active environment (should be visible in your prompt line). If you want to switch back to your regular Python environment, change back to venv/Scripts and type:
deactivate
After setting up the module and putting the models into the appropriate model folders, you can use the recognizers to annotate your texts. For this, execute the script rwtagger.py in your console.
This script can be used to annotate textual data with the STWR types direct, freeIndirect, indirect and reported. It runs on CPU by default, but can use GPU if flag -gpu is specified (and your pytorch installation is properly set up to use the GPU). It the flag -conf is set, confidence values for the annotations are given as well.
Data input can be plain text or tsv files (encoding: UTF-8 in both cases). Tsv files must contain a tab-separated column format with one token per line and two columns: Column 'tok' contains the tokens and is mandatory. Column 'sentstart' codes sentence boundaries ('yes' for the first token of a sentence, 'no' otherwise). If this column is missing, the text is treated like one sentence.
Result data will always be in tsv format. You can use the script util/tsv_to_excel.py to convert tsv files into excel files for convenience.
To view the program help, execute python rwtagger.py -h
In this mode, the script simply predicts the category for each token in the input files. The results are written into a column named after the predicted catgory (e.g. 'direct_pred').
Some statistics such as running time are written to a folder called 'result_stats' in your output directory.
In this mode, the script does the same as in predict mode, but additionally calculates f1 scores, recall and precision between the predicted values and a gold standard.
Input format must a tsv file with a column for each STWR type you want to calculate scores for. For example, for direct, the file must contain a column named 'direct'. For each token, the value must be either 'direct' (positive case) or 'x' (negative case) (just like the output format of the recognizer). The script adds an additional column 'direct_pred' and calculates the scores between those columns. A detailed analysis is written to a folder called 'result_stats' in your output directory.
You can use the script util/create_testformat_from_rwcorpus.py to convert any files that are in the column-based-text format of the corpus Redewiedergabe into a format that allows you to use them as input file for the rwtagger in test mode.
NOTE: It is safest to always place the option parameters after input_dir and output_dir.
The directory test contains some folders you can use for testing. Note that the data in the output folder will be overwritten whenever you call the script again.
python rwtagger.py input_dir output_dir
simplest call: expects an input folder of plain text UTF-8 coded files, tags all 4 STWR types and outputs tsv files with columns for each type; runs the tagger on CPU (Note: This call might take some time, as it loads and executes all 4 taggers one after the other)
python rwtagger.py input_dir output_dir -t direct indirect -conf
annotates only the types direct and indirect; outputs confidence values for each annotation; expects an input folder of plain text UTF-8 coded files
python rwtagger.py input_dir output_dir -gpu -f tsv
runs the tagger on GPU; input format is not plain text but tsv (similar to the output format of the tagger: one token per line and markers for sentence start; column names must be 'tok' and 'sentstart'); annotates all 4 STWR types
python rwtagger.py input_dir output_dir -m test -t reported
runs the tagger and also calculates test scores for the STWR type reported; input files must be tsv format and contain a column called 'reported' containing the gold standard annotations.