TibNormCSV

    TibNorm is a utility for producing normalised versions of Tibetan texts to make them easier 
    for contemporary users to search and read, in line with current Tibetan writing conventions. As part 
    of the normalisation process, TibNorm:

    -        changes Tibetan numbers into Arabic numerals
    -        changes Tibetan brackets and quotation marks into the standard western equivalents
    -        changes non-standard “illegal” stacks into standard ones
    -        deletes a ། if found at the beginning of a line
    -        removes a ། if found after a ཀ, ག or ཤ, with or without a vowel
    -        adds a ་ between ང  and །
    -        reduces two or more ་  to a single one
    -        changes ཌ་ or ཊ་ to གས་ unless preceded by a white space, tab, or new line.

    TibNorm also expands abbreviations so that they are shown in their full form. For 
    abbreviations in classical Tibetan, TibNorm draws from the list of over 6,000 classical Tibetan 
    abbreviations compiled by Bruno Lainé of the Tibetan Manuscript Project Vienna (TMPV) as part of 
    the project’s Resources for Kanjur and Tanjur Studies. In TibNorm, the user can manually change the 
    flag in the abbreviations table to exclude any abbreviation that they don’t want to expand.

TibNormCSV applies TibNorm's normalisation operations to text entries in a .csv column, rather than to .txt files stored in a given directory.

How to use

Ensure you have pandas installed. We have tested this code with Python==3.12.2 and pandas==2.2.3.

Ensure your CSV file contains a column titled 'paragraph' which holds the text you wish to normalise.
Change your table path in src/config.ini
- Set table_path to the absolute path where the folder tables is located. No need to wrap in quotation marks.
Open your terminal and navigate the directory under Tibnorm.
Execute the following command.

python src/main.py path/to/inputcsv/directory path/to/outputcsv/directory

Find the results in your stated output CSV directory. The normalised text will be inserted in a new column titled 'normalised_paragraph' occurring after the 'paragraph' column.

Tables

Description of tables

The tables below contain signs or combinations to be processed (e.g., replaced with other signs, reduced to one sign, etc.)
The columns of the tables are separated by a tab.

Abbreviations

Simple replacement of abbreviations, just as in table1.
Columns:
1. transcription: an abbreviated form.
2. normalisation: a full-form.
3. flag: 0 means that the replacement is cancelled, while 1 means that it is valid. You can modify this parameter in src/config.ini.

Table1

Simple replacement except for abbreviations: e.g., ༠ → 0.
This table contains combinations of more than two characters that are not allowed to come together: e.g., ག། → ག
Columns:
1. transcription: character(s) to be replaced with others.
2. normalisation: character(s) with which the character(s) in transcription are to be replaced.

Table2

Replacement using regular expressions: e.g., \n། → \n; ་་་་་་་་་་ → ་ (Multiple tsheg is reduced to one tsheg)
This replacement is done by re.sub function, but it is slower than the simple replacement function (replace), which is used to normalise characters in table1 Therefore, whenever it is possible to normalise a character without using a regular expression, it is advisable to include it in table1.
Columns:
1. transcription: character(s) to be replaced with others. Regular expressions are applicable.
2. normalisation: character(s) with which the character(s) in transcription are to be replaced. Regular expressions are applicable.

Table3

Replacement with some exceptions: e.g., whitespace ( ) → tsheg (་), but spaces before and after numbers, alphabetic characters and ༄ should remain; ་། → །, but not when ་། is preceded by ང.
Columns:
1. transcription: character(s) to be replaced with others.
2. normalisation: character(s) with which the character(s) in transcription are to be replaced.
3. exception: If the character(s) in transcription appears before or after the character(s) in exception, the replacement is canceled.
4. exc_len: This parameter signifies the maximum length of characters in exception. For instance, characters in [A-Za-z0-9\u4e00-\u9fff༄] represent individually a single character, thus having a maximum length of 1 (This is equivalent to [A-Za-z0-9\u4e00-\u9fff༄]{1}, though the Python code itself does not explicitly specify the length). Conversely, characters in (?:ང|ངི|ངུ|ངེ|ངོ) have lengths: ང is treated as a single character, while the others, combined with a vowel, are considered as two characters. Thus, the maximal length in this case is 2.
5. scope: This parameter defines the scope within which exceptions are searched. When set to left, it means that the characters located on the left side of the target character within the range of exc_len are checked for exceptions. Conversely, right means the opposite, and when set to both sides are searched.
6. flag: 0 means that the replacement is cancelled, while 1 means that it is valid. You can modify this parameter in src/config.ini.

Things to pay attention to, when adding a new entry to a table.

Consider whether the entry you want to add is part of other words. If it is, you will need to use regular expressions to define exceptions.
Some regular expressions should be escaped by adding a backslash before them, e.g., \\n (\ + \n)
It is assumed that the order of normalisation does not affect the final result, nevertheless for safety you place a new normalisation in the bottom of the table.
When adding a new entry to a table, it's recommended to verify the success of the replacement and ensure that it doesn't impact other replacements, and to visualise the differences before and after adding the line for confirmation using a diff-tool.
A character with a vowel, e.g., ཏེ (length=2), or a ligature, e.g., བཀྲམས (length=5), are computationally regarded as multiple characters. Thus, for example, if you refer to ཏ as a consonant, you should use regular expression, so that ཏ with any vowel is also included.
When adding an abbreviated form and its full form in table1, it is advisable to use tsheg both before and after the abbreviated and full forms. This helps avoid mistaken replacements of the same form appearing in the middle of a syllable. However, a drawback is that an abbreviated form at the beginning of the sentence remains unreplaced (See issue).

Copyright

TibNorm was developed by Yuki Kyogoku of Leipzig University then modified by Christina Sabbagh (TibNormCSV) of SOAS, University of London, for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany. Please acknowledge the project in any use of these materials. Copyright for the project resides with the two universities.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
src		src
tables		tables
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TibNormCSV

How to use

Tables

Description of tables

Abbreviations

Table1

Table2

Table3

Things to pay attention to, when adding a new entry to a table.

Copyright

About

Releases

Packages

Languages

Divergent-Discourses/TibNormCSV

Folders and files

Latest commit

History

Repository files navigation

TibNormCSV

How to use

Tables

Description of tables

Abbreviations

Table1

Table2

Table3

Things to pay attention to, when adding a new entry to a table.

Copyright

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages