Refactoring and cleanup.

meyersbs · Dec 15, 2016 · 1f22d9d · 1f22d9d
1 parent 991bc99
commit 1f22d9d
Show file tree

Hide file tree

Showing 9 changed files with 88 additions and 179 deletions.
diff --git a/README.md b/README.md
@@ -1,19 +1,13 @@
 [![Build Status](https://travis-ci.org/meyersbs/SPLAT.svg?branch=master)](https://travis-ci.org/meyersbs/SPLAT) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](/LICENSE.md) [![codecov](https://codecov.io/gh/meyersbs/SPLAT/branch/master/graph/badge.svg)](https://codecov.io/gh/meyersbs/SPLAT)
  [![PyPI](https://img.shields.io/pypi/pyversions/SPLAT-library.svg?maxAge=2592000)](https://pypi.python.org/pypi/SPLAT-library/0.3.7) [![PyPI](https://img.shields.io/pypi/v/SPLAT-library.svg?maxAge=2592000)](https://pypi.python.org/pypi/SPLAT-library/0.3.7) [![Website](https://img.shields.io/website-up-down-green-red/http/splat-library.org.svg?maxAge=2592000)](http://splat-library.org/)
 
-<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/logo.svg" width="20%">
+<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/docs/logo.svg" width="20%">
 <br>
-<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/tag.svg" width="60%">
-
-# <em>WARNING!</em>
-Errors in the calculation of some metrics have been found. Calculations of the following metrics may be innaccurate:
-* Idea Density
-* Content Density
-* Syllable Counts
+<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/docs/tag.svg" width="60%">
 
 - - - -
 ## Contact Information
-&nbsp;&nbsp;&nbsp;&nbsp;Benjamin S. Meyers < <[email protected]> >
+&nbsp;&nbsp;&nbsp;&nbsp;Benjamin S. Meyers <[[email protected]](mailto:[email protected])>
 
 - - - -
 ## Project Description
@@ -22,7 +16,7 @@ SPLAT is a command-line application designed to make it easy for linguists (both
 SPLAT is designed to help you gather linguistic features from text files and it is assumed that most input files will not be already annotated. In order for SPLAT to function properly, you should ensure that the input files that you provide do not contain any annotations. Because there are so many variations of linguistic annotation schemes, it would simply be impossible to account for all of them in the initial parsing of input files; it is easier for you to remove any existing annotations than it is for me to do so.
 
 - - - -
-## System Requirementsgit 
+## System Requirements
 SPLAT is being developed and tested on 64-bit Ubuntu 15.10 with Python 3.4.3. Minimum requirements include:
 * Python 3.4 or Later
 * NLTK 3.1 or Later
@@ -34,11 +28,6 @@ SPLAT is being developed and tested on 64-bit Ubuntu 15.10 with Python 3.4.3. Mi
 2. Run the following in a command line:
 ``` bash
     pip3 install SPLAT-library
-
-    # Recommended, but not required.
-    echo 'alias splat="splat-cli"' >> ~/.bashrc
-    echo 'alias splat="splat-cli"' >> ~/.bash_profile
-    source .bashrc
 ```
 
 To uninstall run the following in a command line.
@@ -53,95 +42,18 @@ To uninstall run the following in a command line.
     splat --help                    # Provide helpful information
     splat --info                    # Display version and copyright information
     splat --usage                   # Display basic command line structure
-    splat bubble filename           # Display the raw text from the file
+    splat splat filename            # Display the raw text from the file
 ```
 
 - - - -
-## Analysis Functionality \& Usage
-#### Types \& Tokens
-```bash
-    splat tokens filename           # List all Tokens
-    splat types filename            # List all Types
-    splat ttr filename              # Calculate Type-Token Ratio
-    splat wc filename               # Word Count (Token Count)
-    splat uwc filename              # Unique Word Count (Type Count)
-```
-##### Parts-Of-Speech
-```bash
-    splat pos filename              # List Tokens with their Parts-Of-Speech
-    splat poscounts filename        # List Part-Of-Speech Tags with their Frequencies
-```
-#### Syntactic Complexity
-```bash
-    splat cdensity filename         # Calculate Content-Density
-    splat idensity filename         # Calculate Idea Density
-    splat flesch filename           # Calculate Flesch Readability Ease
-    splat kincaid filename          # Calculate Flesch-Kincaid Grade Level
-    splat yngve filename            # Calculate Yngve-Score
-    splat frazier filename          # Calculate Frazier-Score
-```
-#### Listing Content \& Function Words
-```bash
-    splat function filename         # List all Function Words
-    splat content filename          # List all Content Words
-    splat ufunction filename        # Unique Function Words
-    splat ucontent filename         # Unique Content Words
-    splat cfr filename              # Calculate Content-Function Ratio
-```
-#### Utterances \& Sentences
-```bash
-    splat utts filename             # List all Utterances
-    splat sents filename            # List all Sentences
-    splat alu filename              # Average Utterance Length
-    splat als filename              # Average Sentence Length
-    splat uttcount filename         # Utterance Count
-    splat sentcount filename        # Sentence Count
-    splat syllables filename        # Display Number of Syllables
-    splat wpu filename              # List the Number of Words in each Utterance
-    splat wps filename              # List the number of Words in each Sentence
-```
-#### Frequency Distributions
-```bash
-    splat mostfreq filename x       # List the x Most Frequent Words
-    splat leastfreq filename x      # List the x Least Frequent Words
-    splat plotfreq filename x       # Draw and Display a Frequency Graph
-```
-#### Disfluencies
-```bash
-    splat disfluencies filename     # Calculate various Disfluency Counts
-    splat dpa filename              # List the Number of Disfluencies per each Dialog Act
-    splat dpu filename              # List the Number of Disfluencies in each Utterance
-    splat dps filename              # List the Number of Disfluencies in each Sentence
-```
-#### Syntactic Parsing
-```bash
-    splat trees filename            # List Parse-Tree Strings for each Utterance
-    splat maxdepth filename         # Calculate Max Tree Depth
-    splat drawtrees filename        # Draw Parse Trees
-```
-#### Language Modeling
-```bash
-    splat unigrams filename         # List all Unigrams
-    splat bigrams filename          # List all Bigrams
-    splat trigrams filename         # List all Trigrams
-    splat ngrams filename n         # List all n-grams
-```
+## Functionality \& Usage
 
-- - - -
-## Annotation Functionality \& Usage
-```bash
-    splat annotate filename         # Semi-Automatically annotate the Utterances
-```
+Coming Soon!
 
 - - - -
 ## Acknowledgments
-I would like to thank Emily Prud'hommeaux and Cissi Ovesdotter-Alm for their guidance during my initial development process. I would also like to thank Bryan Meyers, my brother, letting me bounce ideas off of him, and for giving me wake-up calls when I was doing something in the less-than-intelligent (stupid) way.
 
-| Name | Email | Website | GitHub |
-|-----|-----|-----|-----|
-| Emily Prud'hommeaux | < <[email protected]> > | < [CLaSP](http://www.rit.edu/clasp/people.html) > | |
-| Cissi O. Alm | < <[email protected]> > | < [CLaSP](http://www.rit.edu/clasp/people.html) > | |
-| Bryan T. Meyers | < <[email protected]> > | < [DataDrake](http://www.datadrake.com/) > | < [GitHub](https://github.com/DataDrake) > |
+See [Acknowledgments](http://splat-library.org/#section5).
 
 - - - -
 ## Licensing

diff --git a/logo.svg → docs/logo.svg b/logo.svg → docs/logo.svg
diff --git a/tag.svg → docs/tag.svg b/tag.svg → docs/tag.svg
diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,2 @@
 nltk
 matplotlib
-jsonpickle
diff --git a/setup.py b/setup.py
@@ -2,7 +2,7 @@
 
 setup(
     name='SPLAT-library',
-    version='0.3.7',
+    version='0.3.8',
     description='Speech Processing & Linguistic Analysis Tool',
     long_description="SPLAT is a command-line application designed to make it easy for linguists (both computer-oriented and non-computer-oriented) to use the Natural Language Tool Kit (NLTK) for analyzing virtually any text file.\n\nSPLAT is designed to help you gather linguistic features from text files and it is assumed that most input files will not be already annotated. In order for SPLAT to function properly, you should ensure that the input files that you provide do not contain any annotations. Because there are so many variations of linguistic annotation schemes, it would simply be impossible to account for all of them in the initial parsing of input files; it is easier for you to remove any existing annotations than it is for me to do so.",
     url='http://splat-library.org',
@@ -23,8 +23,8 @@
         'splat.taggers',
         'splat.tokenizers'
     ],
-    download_url='https://github.com/meyersbs/SPLAT/archive/v0.3.7.tar.gz',
-    requires=['matplotlib', 'nltk', 'jsonpickle'],
+    download_url='https://github.com/meyersbs/SPLAT/archive/v0.3.8.tar.gz',
+    requires=['matplotlib', 'nltk'],
     classifiers=[
         'Development Status :: 3 - Alpha',
         'Intended Audience :: End Users/Desktop',

diff --git a/splat/SPLAT.py b/splat/SPLAT.py
@@ -425,7 +425,6 @@ def treestrings(self):
         """ Returns a list of parsers trees. """
         if self.__treestrings is None:
             self.__treestrings = TreeStringParser().get_parse_trees(self.__utterances)
-        #print("Treestrings: " + str(self.__treestrings))
         return self.__treestrings
 
     def drawtrees(self):
@@ -540,20 +539,32 @@ def dis(self):
     def splat(self):
         return self.__splat
 
+    def __str__(self):
+        """ Equivalent to Java's toString(). """
+        return self.splat()
+
     ##### JSON SERIALIZATION ###########################################################################################
 
     def dump(self, out_file):
+        """ Dumps the JSON dictionary of this SPLAT to the specified file. """
         json.dump(self.__dict__, out_file, default=jdefault)
 
     def dumps(self):
+        """ Returns a string representation of the JSON dictionary for this SPLAT. """
         return json.dumps(self.__dict__)
 
     def load(self, in_file):
+        """ Given a file containing a JSON dictionary of a SPLAT, load that dictionary into a new SPLAT object. """
         self.__dict__ = json.load(in_file)
 
     def loads(self, data_str):
+        """ Given a string containing a JSON dictionary of a SPLAT, load that dictionary into a new SPLAT object. """
         self.__dict__ = json.loads(data_str)
 
 
 def jdefault(o):
+    """
+    By default, JSON serialization doesn't do what I want it to do, so we have to explicitly tell it to serialize the
+    Python dictionary representation of the SPLAT object.
+    """
     return o.__dict__
diff --git a/splat/__init__.py b/splat/__init__.py
@@ -34,11 +34,13 @@
 except ImportError:
     print("Oops! It looks like some essential NLTK data was not downloaded. Let's fix that.")
     print("Downloading NLTK data...")
-    status = subprocess.call(["python3", "-m", "nltk.downloader", "stopwords", "names", "brown", "cmudict", "punkt", "averaged_perceptron_tagger"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    status = subprocess.call(["python3", "-m", "nltk.downloader", "stopwords", "names", "brown", "cmudict", "punkt",
+                              "averaged_perceptron_tagger"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     if status == 0:
         print("Essential NLTK data was successfully downloaded!")
     else:
-        print("Hmm... I couldn't download the essential NLTK data for you. I suggest running this command:\n\tpython3 -m nltk.downloader stopwords names punkt averaged_perceptron_tagger")
+        print("Hmm... I couldn't download the essential NLTK data for you. I suggest running this command:\n\tpython3"
+              "-m nltk.downloader stopwords names punkt averaged_perceptron_tagger")
 
 try:
     import matplotlib
@@ -49,8 +51,10 @@
     if status == 0:
         print("matplotlib was successfully installed!")
     else:
-        print("Hmm... I couldn't install matplotlib for you. You probably don't have root privileges. I suggest running this command:\n\tsudo pip3 install matplotlib")
+        print("Hmm... I couldn't install matplotlib for you. You probably don't have root privileges. I suggest running"
+              "this command:\n\tsudo pip3 install matplotlib")
 
 java_status = subprocess.call(["which", "java"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 if java_status != 0:
-    print("Java is not installed on your system. Java needs to be installed in order for me to do any part-of-speech tagging.\n\nPlease install java and try again.")
+    print("Java is not installed on your system. Java needs to be installed in order for me to do any part-of-speech"
+          "tagging.\n\nPlease install java and try again.")
diff --git a/splat/complexity/idea_density.py b/splat/complexity/idea_density.py
@@ -1,4 +1,7 @@
-import sys, re
+#!/usr/bin/env python3
+
+##### PYTHON IMPORTS ###################################################################################################
+import re
 
 ########################################################################################################################
 ##### INFORMATION ######################################################################################################
@@ -11,21 +14,16 @@
 ########################################################################################################################
 ########################################################################################################################
 
-##### GLOBAL VARIABLES #################################################################################################
-
-# global ADJ, ADV, VERB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX
-# global BEING, BECOMING, SEEMING, LINKING, CLINKING, CORREL, NEGPOL1, NEGPOL2
-
 ##### WORD CLASSES #####################################################################################################
 
 ADJ = ["JJ", "JJR", "JJS"]
 ADV = ["RB", "RBR", "RBS", "WRB"]
-VERB = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "BES"]
+VRB = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "BES"]
 NOUN = ["NN", "NNS", "NNP", "NNPS"]
 INTERR = ["WDT", "WP", "WPS", "WRB"]
 
 # By default, words classified as one of these parts-of-speech is considered a proposition.
-PROP = ADJ + ADV + VERB + INTERR + ["CC", "CD", "DT", "IN", "PDT", "POS", "PRP$", "PP$", "TO"]
+PROP = ADJ + ADV + VRB + INTERR + ["CC", "CD", "DT", "IN", "PDT", "POS", "PRP$", "PP$", "TO"]
 
 # A sentence consisting wholly of 'non-propositional fillers' is considered to be propositionless.
 FILLER = ["and", "or", "but", "if", "that", "just", "you", "know"]
@@ -130,7 +128,7 @@ def apply_counting_rules(word_list, speech_mode=False):
 	True when analyzing transcribed speech that contains repetitions and filler words. It may result in undercounting
 	of well-edited English.
 	"""
-	global ADJ, ADV, VERB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX, BEING, BECOMING, SEEMING, LINKING,\
+	global ADJ, ADV, VRB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX, BEING, BECOMING, SEEMING, LINKING,\
 		CLINKING, CORREL, NEGPOL1, NEGPOL2, PUNCT
 
 	"""
@@ -255,7 +253,7 @@ def apply_counting_rules(word_list, speech_mode=False):
 
 		##### RULE 054 #####
 		## "that/DT" or "this/DT" is a pronoun, not a determiner, if the following word is a verb or an adverb.
-		if (word_list[i-1].token == "that" or word_list[i-1].token == "this") and (contains(VERB, word_list[i].tag) or
+		if (word_list[i-1].token == "that" or word_list[i-1].token == "this") and (contains(VRB, word_list[i].tag) or
 			contains(ADV, word_list[i].tag)):
 			word_list[i-1].tag = "PRP"
 			word_list[i-1].rulenumber = 54
@@ -276,7 +274,7 @@ def apply_counting_rules(word_list, speech_mode=False):
 				dest = i # destination
 				while dest < len(word_list) - 1:
 					dest += 1
-					if word_list[dest].tag == "." or contains(VERB, word_list[dest].tag): break
+					if word_list[dest].tag == "." or contains(VRB, word_list[dest].tag): break
 				if dest > (i + 1):
 					word_list.insert(dest, WordObj(word_list[i].token, word_list[i].tag, True, True, 101))
 					word_list[i].tag = ""
@@ -383,7 +381,7 @@ def apply_counting_rules(word_list, speech_mode=False):
 
 		##### RULE 213 #####
 		## The bigram 'going to' is not a proposition when is immediately precedes a verb.
-		if contains(VERB, word_list[i].tag) and word_list[i-1].token == "to" and word_list[i-2].token == "going":
+		if contains(VRB, word_list[i].tag) and word_list[i-1].token == "to" and word_list[i-2].token == "going":
 			word_list[i-1].isprop = False
 			word_list[i-1].rulenumber = 213
 			word_list[i-2].isprop = False
@@ -471,14 +469,14 @@ def apply_counting_rules(word_list, speech_mode=False):
 
 		##### RULE 402 #####
 		## Bigrams of the form 'AUX VERB' are considered one proposition, not two.
-		if contains(VERB, word_list[i].tag) and contains(AUX, word_list[i-1].token):
+		if contains(VRB, word_list[i].tag) and contains(AUX, word_list[i-1].token):
 			word_list[i-1].isprop = False
 			word_list[i-1].rulenumber = 402
 
 		##### RULE 405 #####
 		## In trigrams of the form 'AUX NOT VERB', NOT and VERB are tagged as propositions. The same is true for
 		## trigrams of the form 'AUX ADV VERB'. For example: 'had always sung', 'would rather go'.
-		if (contains(VERB, word_list[i].tag) and (word_list[i-1].tag == "NOT") or
+		if (contains(VRB, word_list[i].tag) and (word_list[i-1].tag == "NOT") or
 		   (contains(ADV, word_list[i-1].tag)) and contains(AUX, word_list[i-2].token)):
 			word_list[i-2].isprop = False
 			word_list[i-2].rulenumber = 405