Added book extraction script

christos-c · Aug 5, 2017 · b905e89 · b905e89
1 parent 80767dd
commit b905e89
Showing 1 changed file with 10 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ at the Goethe Universität, has created tokenised versions of four languages
 (Chinese, Japanese, Thai, Vietnamese). They are included in this collection but they can also be found 
 [here](https://www.hucompute.org/ressourcen/corpora).
 
-Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools). If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
+If you are looking for a quick way to generating a raw text version of each Bible, you can use following Python snippet (replace `lang` with the name of the XML file):
 ```
 import xml.etree.ElementTree as ET
 lang = 'English'
@@ -28,3 +28,12 @@ with open(lang + '.txt', 'w', encoding='utf-8') as out:
     for n in root.iter('seg'):
         out.write(n.text.strip() + '\n')
 ```
+or for a specific book:
+```
+book_id = 'b.GEN'
+with open(lang + '-' + book_id + '.txt', 'w', encoding='utf-8') as out:
+    for n in root.findall('.//div[@id="'+book_id+'"]/*seg'):
+        out.write(n.text.strip() + '\n')
+```
+
+Follow this link for [a collection of tools for reading/processing the corpus](https://github.com/christos-c/bible-corpus-tools).