GitHub - BrendanJercich/pySentencizer: This Python module is a simple sentencizer, tokenizer, and parts-of-speech tagger for the English language.

BrendanJercich / pySentencizer Public

forked from mtancret/pySentencizer

Notifications You must be signed in to change notification settings
Fork 0
Star 0

This Python module is a simple sentencizer, tokenizer, and parts-of-speech tagger for the English language.

matthew.tancreti.net/pySentencizer.html

0 stars 4 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
pysentencizer		pysentencizer
test		test
README		README
example.py		example.py
setup.py		setup.py
test.py		test.py

Repository files navigation

README for pySentencizer
================================================================================
Description

This Python module is a simple sentencizer, tokenizer, and parts-of-speech
tagger for the English language. It isn't the best at any of these tasks (see
the links below for related projects), but it offers some unique features in a
fast and easy to use module. A main feature is that tokens preserve their
original formatting, so parts of the text can be reproduced with the original
punctuation and white space intact.
================================================================================
Installation

pySentencize is simple enough that all of the code is in one file:
pysentencizer/pysentencizer.py. If you want, you can copy this file, along
with the lexicon found in pysentencizer/en, directly into your project.

If you do want to install the pysentencizer module into Python, use the
following.
$ python setup.py install
================================================================================
How To Use

Here is a quick example.

$ python
>>> import pysentencizer
>>> sentencizer = pysentencizer.Sentencizer()
>>> print sentencizer.sentencize("Testing.")
['Testing'/sent_start/para_start/word/capital/verb/VBG, '.'/sent_end/para_end/punc]

The script example.py demonstrates the ability to tokenize and then print
sentences with original formatting intact.
$ ./example.py < test/doyle-test.txt

['Mr.'/sent_start/para_start/word/capital/abbr/noun/NNP, 'Sherlock'/word/capital/noun/NNP, 'Holmes'/word/capital/noun/NNP, ','/punc, 'who'/word/pronoun/WP, 'was'/word/verb/VBD, 'usually'/word/adverb/RB, 'very'/word/adverb/RB, 'late'/word/adjective/JJ, 'in'/word/preposition/IN, 'the'/word/adjective/DT, 'mornings'/word/noun/NNS, ','/punc, 'save'/word/verb/VB, 'upon'/word/preposition/IN, 'those'/word/adjective/DT, 'not'/word/adverb/RB, 'infrequent'/word/adjective/JJ, 'occasions'/word/noun/NNS, 'when'/word/adverb/WRB, 'he'/word/pronoun/PRP, 'was'/word/verb/VBD, 'up'/word/preposition/IN, 'all'/word/adjective/DT, 'night'/word/adjective/JJ, ','/punc, 'was'/word/verb/VBD, 'seated'/word/verb/VBN, 'at'/word/preposition/IN, 'the'/word/adjective/DT, 'breakfast'/word/noun/NN, 'table'/word/adjective/JJ, '.'/sent_end/punc, 'I'/sent_start/word/capital/acronym/pronoun/PRP, 'stood'/word/verb/VBD, 'upon'/word/preposition/IN, 'the'/word/adjective/DT, 'hearth-rug'/word/adjective/JJ, 'and'/word/conjunction/CC, 'picked'/word/verb/VBD, 'up'/word/preposition/IN, 'the'/word/adjective/DT, 'stick'/word/verb/VB, 'which'/word/adjective/WDT, 'our'/word/pronoun/PRP$, 'visitor'/word/noun/NN, 'had'/word/verb/VBD, 'left'/word/verb/VBN, 'behind'/word/preposition/IN, 'him'/word/pronoun/PRP, 'the'/word/adjective/DT, 'night'/word/adjective/JJ, 'before'/word/preposition/IN, '.'/sent_end/punc, 'It'/sent_start/word/capital/pronoun/PRP, 'was'/word/verb/VBD, 'a'/word/adjective/DT, 'fine'/word/adjective/JJ, ','/punc, 'thick'/word/adjective/JJ, 'piece'/word/noun/NN, 'of'/word/preposition/IN, 'wood'/word/noun/NN, ','/punc, 'bulbous-headed'/word/adjective/JJ, ','/punc, 'of'/word/preposition/IN, 'the'/word/adjective/DT, 'sort'/word/noun/NN, 'which'/word/adjective/WDT, 'is'/word/verb/VBZ, 'known'/word/verb/VBN, 'as'/word/preposition/IN, 'a'/word/adjective/DT, '"'/punc, 'Penang'/word/capital/noun/NNP, 'lawyer'/word/noun/NN, '.'/punc, '"'/sent_end/punc, 'Just'/sent_start/word/capital/adverb/RB, 'under'/word/adjective/JJ, 'the'/word/adjective/DT, 'head'/word/noun/NN, 'was'/word/verb/VBD, 'a'/word/adjective/DT, 'broad'/word/adjective/JJ, 'silver'/word/noun/NN, 'band'/word/noun/NN, 'nearly'/word/adverb/RB, 'an'/word/adjective/DT, 'inch'/word/noun/NN, 'across'/word/preposition/IN, '.'/sent_end/punc, '"'/sent_start/punc, 'To'/word/capital/preposition/TO, 'James'/word/capital/noun/NNP, 'Mortimer'/word/capital/noun/NNP, ','/punc, 'M.R.C.S.'/word/capital/abbr/acronym/adjective/CD, ','/punc, 'from'/word/preposition/IN, 'his'/word/pronoun/PRP$, 'friends'/word/noun/NNS, 'of'/word/preposition/IN, 'the'/word/adjective/DT, 'C.C.H.'/word/capital/abbr/acronym/adjective/CD, ','/punc, '"'/punc, 'was'/word/verb/VBD, 'engraved'/word/verb/VBN, 'upon'/word/preposition/IN, 'it'/word/pronoun/PRP, ','/punc, 'with'/word/preposition/IN, 'the'/word/adjective/DT, 'date'/word/noun/NN, '"'/punc, '1884'/number, '.'/punc, '"'/sent_end/punc, 'It'/sent_start/word/capital/pronoun/PRP, 'was'/word/verb/VBD, 'just'/word/adjective/JJ, 'such'/word/adjective/JJ, 'a'/word/adjective/DT, 'stick'/word/verb/VB, 'as'/word/preposition/IN, 'the'/word/adjective/DT, 'old-fashioned'/word/adjective/JJ, 'family'/word/adverb/RB, 'practitioner'/word/noun/NN, 'used'/word/verb/VBN, 'to'/word/preposition/TO, 'carry—dignified', ','/punc, 'solid'/word/adjective/JJ, ','/punc, 'and'/word/conjunction/CC, 'reassuring'/word/adjective/JJ, '.'/sent_end/para_end/punc]

>>> Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. 
>>> I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before. 
>>> It was a fine, thick piece of wood, bulbous-headed, of the sort which is known as a "Penang lawyer." 
>>> Just under the head was a broad silver band nearly an inch across. 
>>> "To James Mortimer, M.R.C.S., from his friends of the C.C.H.," was engraved upon it, with the date "1884." 
>>> It was just such a stick as the old-fashioned family practitioner used to carry—dignified, solid, and reassuring.
================================================================================
License

For licensing details, see individual file headers.
Software licensed under the Apache License, Version 2.0.
The parts-of-speech lexicon (en/brill-lexicon.dat) licensed under the MIT License.
Test texts downloaded from Project Gutenberg.
================================================================================
Related Projects

text-sentence
	A good Python sentence splitter, but it doesn't preserve the original
	formatting of the text.
	http://pypi.python.org/pypi/text-sentence/
nlplib
	Tags parts of speech using Eric Brill's lexicon and some predefined
	transition rules. Similar to tagging used in pySentencizer.
	http://jasonwiener.com/2006/01/20/simple-nlp-part-of-speech-tagger-in-python/
NLTK
	pySentencizer is meant to be small and easy to use. If you have more
	specific natural language processing requirements, you should be using
	NLTK or MontyLingua.
	http://www.nltk.org/
MontyLingua
	http://web.media.mit.edu/~hugo/montylingua/index.html#documentation
Brilltag
	Eric Brill's original parts-of-speech tagger.
	tagger https://github.com/aidanf/Brilltag