Description

This repository contains a tokenization (word segmentation) utility for Tibetan, **botok_tokenizer.py**. It is based on BoTok, a tokenizer developed by OpenPecha. Tokenization separates raw text into machine-readable lists of individual words for subsequent NLP tasks and Botok is a utility that can tokenize both modern and classical Tibetan. With Botok you have to turn every file into a string, whereas botok_tokenizer.py simplifies that process by allowing you to point to a whole folder/directory or a single .txt file. It also selects just the tokenizer element of Botok, rather than Botok's POS tagger element, which we have not incldued in this utility. To run the tokenizer, place the raw text files in a folder and then run botok_tokenizer.py.

botok_tokenizer.py was developed by James Engels of SOAS University of London for the Divergent Discourses project. The project is a joint study involving SOAS University of London and Leipzig University, funded by the AHRC in the UK and the DFG in Germany.

Please acknowledge the project in any use of these materials. Copyright for the project resides with the two univerisities.

Usage

botok_tokenizer.py is input agnostic, and can interpret either directories or individual files. If a directory is input, all text files in that directory will be tokenized and stored at the directory level holding botok_tokenizer.py.

With botok_tokenizer.py at the same level of file structure as the desired input, call it from the command line with the following template:

python3 botok_tokenizer.py <input_path> <output_path>

Note: do not include the angle brackets in actual usage.

This script simplifies the process of tokenization by allowing files or directories to be passed into BoTok's tokenizer as inputs, rather than strings.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
botok_tokenize.py		botok_tokenize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Usage

About

Releases 1

Packages

Contributors 2

Languages

Divergent-Discourses/tibetan_tokenizers

Folders and files

Latest commit

History

Repository files navigation

Description

Usage

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages