TibNorm is a utility for producing normalised versions of Tibetan texts to make them easier
for contemporary users to search and read, in line with current Tibetan writing conventions. As part
of the normalisation process, TibNorm:
- changes Tibetan numbers into Arabic numerals
- changes Tibetan brackets and quotation marks into the standard western equivalents
- changes non-standard “illegal” stacks into standard ones
- deletes a ། if found at the beginning of a line
- removes a ། if found after a ཀ, ག or ཤ, with or without a vowel
- adds a ་ between ང and །
- reduces two or more ་ to a single one
- changes ཌ་ or ཊ་ to གས་ unless preceded by a white space, tab, or new line.
TibNorm also expands abbreviations so that they are shown in their full form. For
abbreviations in classical Tibetan, TibNorm draws from the list of over 6,000 classical Tibetan
abbreviations compiled by Bruno Lainé of the Tibetan Manuscript Project Vienna (TMPV) as part of
the project’s Resources for Kanjur and Tanjur Studies. In TibNorm, the user can manually change the
flag in the abbreviations table to exclude any abbreviation that they don’t want to expand.
TibNormCSV applies TibNorm's normalisation operations to text entries in a .csv column, rather than to .txt files stored in a given directory.
Ensure you have pandas installed. We have tested this code with Python==3.12.2 and pandas==2.2.3.
- Ensure your CSV file contains a column titled 'paragraph' which holds the text you wish to normalise.
- Change your table path in src/config.ini
- Set table_path to the absolute path where the folder tables is located. No need to wrap in quotation marks.
- Open your terminal and navigate the directory under Tibnorm.
- Execute the following command.
python src/main.py path/to/inputcsv/directory path/to/outputcsv/directory
- Find the results in your stated output CSV directory. The normalised text will be inserted in a new column titled 'normalised_paragraph' occurring after the 'paragraph' column.
- The tables below contain signs or combinations to be processed (e.g., replaced with other signs, reduced to one sign, etc.)
- The columns of the tables are separated by a tab.
- Simple replacement of abbreviations, just as in table1.
- Columns:
- transcription: an abbreviated form.
- normalisation: a full-form.
- flag: 0 means that the replacement is cancelled, while 1 means that it is valid. You can modify this parameter in src/config.ini.
- Simple replacement except for abbreviations: e.g., ༠ → 0.
- This table contains combinations of more than two characters that are not allowed to come together: e.g., ག། → ག
- Columns:
- transcription: character(s) to be replaced with others.
- normalisation: character(s) with which the character(s) in transcription are to be replaced.
- Replacement using regular expressions: e.g., \n། → \n; ་་་་་་་་་་ → ་ (Multiple tsheg is reduced to one tsheg)
- This replacement is done by re.sub function, but it is slower than the simple replacement function (replace), which is used to normalise characters in table1 Therefore, whenever it is possible to normalise a character without using a regular expression, it is advisable to include it in table1.
- Columns:
- transcription: character(s) to be replaced with others. Regular expressions are applicable.
- normalisation: character(s) with which the character(s) in transcription are to be replaced. Regular expressions are applicable.
- Replacement with some exceptions: e.g., whitespace ( ) → tsheg (་), but spaces before and after numbers, alphabetic characters and ༄ should remain; ་། → །, but not when ་། is preceded by ང.
- Columns:
- transcription: character(s) to be replaced with others.
- normalisation: character(s) with which the character(s) in transcription are to be replaced.
- exception: If the character(s) in transcription appears before or after the character(s) in exception, the replacement is canceled.
- exc_len: This parameter signifies the maximum length of characters in exception. For instance, characters in [A-Za-z0-9\u4e00-\u9fff༄] represent individually a single character, thus having a maximum length of 1 (This is equivalent to [A-Za-z0-9\u4e00-\u9fff༄]{1}, though the Python code itself does not explicitly specify the length). Conversely, characters in (?:ང|ངི|ངུ|ངེ|ངོ) have lengths: ང is treated as a single character, while the others, combined with a vowel, are considered as two characters. Thus, the maximal length in this case is 2.
- scope: This parameter defines the scope within which exceptions are searched. When set to left, it means that the characters located on the left side of the target character within the range of exc_len are checked for exceptions. Conversely, right means the opposite, and when set to both sides are searched.
- flag: 0 means that the replacement is cancelled, while 1 means that it is valid. You can modify this parameter in src/config.ini.
- Consider whether the entry you want to add is part of other words. If it is, you will need to use regular expressions to define exceptions.
- Some regular expressions should be escaped by adding a backslash before them, e.g., \\n (\ + \n)
- It is assumed that the order of normalisation does not affect the final result, nevertheless for safety you place a new normalisation in the bottom of the table.
- When adding a new entry to a table, it's recommended to verify the success of the replacement and ensure that it doesn't impact other replacements, and to visualise the differences before and after adding the line for confirmation using a diff-tool.
- A character with a vowel, e.g., ཏེ (length=2), or a ligature, e.g., བཀྲམས (length=5), are computationally regarded as multiple characters. Thus, for example, if you refer to ཏ as a consonant, you should use regular expression, so that ཏ with any vowel is also included.
- When adding an abbreviated form and its full form in table1, it is advisable to use tsheg both before and after the abbreviated and full forms. This helps avoid mistaken replacements of the same form appearing in the middle of a syllable. However, a drawback is that an abbreviated form at the beginning of the sentence remains unreplaced (See issue).
TibNorm was developed by Yuki Kyogoku of Leipzig University then modified by Christina Sabbagh (TibNormCSV) of SOAS, University of London, for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany. Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.