Skip to content

Commit

Permalink
Added book extraction script
Browse files Browse the repository at this point in the history
  • Loading branch information
christos-c authored Aug 5, 2017
1 parent 80767dd commit b905e89
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ at the Goethe Universität, has created tokenised versions of four languages
(Chinese, Japanese, Thai, Vietnamese). They are included in this collection but they can also be found
[here](https://www.hucompute.org/ressourcen/corpora).

Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools). If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
```
import xml.etree.ElementTree as ET
lang = 'English'
Expand All @@ -28,3 +28,12 @@ with open(lang + '.txt', 'w', encoding='utf-8') as out:
for n in root.iter('seg'):
out.write(n.text.strip() + '\n')
```
or for a specific book:
```
book_id = 'b.GEN'
with open(lang + '-' + book_id + '.txt', 'w', encoding='utf-8') as out:
for n in root.findall('.//div[@id="'+book_id+'"]/*seg'):
out.write(n.text.strip() + '\n')
```

Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools).

0 comments on commit b905e89

Please sign in to comment.