Skip to content

Commit

Permalink
Refactoring and cleanup.
Browse files Browse the repository at this point in the history
  • Loading branch information
meyersbs committed Dec 15, 2016
1 parent 991bc99 commit 1f22d9d
Show file tree
Hide file tree
Showing 9 changed files with 88 additions and 179 deletions.
104 changes: 8 additions & 96 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,13 @@
[![Build Status](https://travis-ci.org/meyersbs/SPLAT.svg?branch=master)](https://travis-ci.org/meyersbs/SPLAT) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](/LICENSE.md) [![codecov](https://codecov.io/gh/meyersbs/SPLAT/branch/master/graph/badge.svg)](https://codecov.io/gh/meyersbs/SPLAT)
[![PyPI](https://img.shields.io/pypi/pyversions/SPLAT-library.svg?maxAge=2592000)](https://pypi.python.org/pypi/SPLAT-library/0.3.7) [![PyPI](https://img.shields.io/pypi/v/SPLAT-library.svg?maxAge=2592000)](https://pypi.python.org/pypi/SPLAT-library/0.3.7) [![Website](https://img.shields.io/website-up-down-green-red/http/splat-library.org.svg?maxAge=2592000)](http://splat-library.org/)

<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/logo.svg" width="20%">
<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/docs/logo.svg" width="20%">
<br>
<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/tag.svg" width="60%">

# <em>WARNING!</em>
Errors in the calculation of some metrics have been found. Calculations of the following metrics may be innaccurate:
* Idea Density
* Content Density
* Syllable Counts
<img src="https://cdn.rawgit.com/meyersbs/SPLAT/master/docs/tag.svg" width="60%">

- - - -
## Contact Information
&nbsp;&nbsp;&nbsp;&nbsp;Benjamin S. Meyers < <[email protected]> >
&nbsp;&nbsp;&nbsp;&nbsp;Benjamin S. Meyers <[[email protected]](mailto:[email protected])>

- - - -
## Project Description
Expand All @@ -22,7 +16,7 @@ SPLAT is a command-line application designed to make it easy for linguists (both
SPLAT is designed to help you gather linguistic features from text files and it is assumed that most input files will not be already annotated. In order for SPLAT to function properly, you should ensure that the input files that you provide do not contain any annotations. Because there are so many variations of linguistic annotation schemes, it would simply be impossible to account for all of them in the initial parsing of input files; it is easier for you to remove any existing annotations than it is for me to do so.

- - - -
## System Requirementsgit
## System Requirements
SPLAT is being developed and tested on 64-bit Ubuntu 15.10 with Python 3.4.3. Minimum requirements include:
* Python 3.4 or Later
* NLTK 3.1 or Later
Expand All @@ -34,11 +28,6 @@ SPLAT is being developed and tested on 64-bit Ubuntu 15.10 with Python 3.4.3. Mi
2. Run the following in a command line:
``` bash
pip3 install SPLAT-library

# Recommended, but not required.
echo 'alias splat="splat-cli"' >> ~/.bashrc
echo 'alias splat="splat-cli"' >> ~/.bash_profile
source .bashrc
```

To uninstall run the following in a command line.
Expand All @@ -53,95 +42,18 @@ To uninstall run the following in a command line.
splat --help # Provide helpful information
splat --info # Display version and copyright information
splat --usage # Display basic command line structure
splat bubble filename # Display the raw text from the file
splat splat filename # Display the raw text from the file
```

- - - -
## Analysis Functionality \& Usage
#### Types \& Tokens
```bash
splat tokens filename # List all Tokens
splat types filename # List all Types
splat ttr filename # Calculate Type-Token Ratio
splat wc filename # Word Count (Token Count)
splat uwc filename # Unique Word Count (Type Count)
```
##### Parts-Of-Speech
```bash
splat pos filename # List Tokens with their Parts-Of-Speech
splat poscounts filename # List Part-Of-Speech Tags with their Frequencies
```
#### Syntactic Complexity
```bash
splat cdensity filename # Calculate Content-Density
splat idensity filename # Calculate Idea Density
splat flesch filename # Calculate Flesch Readability Ease
splat kincaid filename # Calculate Flesch-Kincaid Grade Level
splat yngve filename # Calculate Yngve-Score
splat frazier filename # Calculate Frazier-Score
```
#### Listing Content \& Function Words
```bash
splat function filename # List all Function Words
splat content filename # List all Content Words
splat ufunction filename # Unique Function Words
splat ucontent filename # Unique Content Words
splat cfr filename # Calculate Content-Function Ratio
```
#### Utterances \& Sentences
```bash
splat utts filename # List all Utterances
splat sents filename # List all Sentences
splat alu filename # Average Utterance Length
splat als filename # Average Sentence Length
splat uttcount filename # Utterance Count
splat sentcount filename # Sentence Count
splat syllables filename # Display Number of Syllables
splat wpu filename # List the Number of Words in each Utterance
splat wps filename # List the number of Words in each Sentence
```
#### Frequency Distributions
```bash
splat mostfreq filename x # List the x Most Frequent Words
splat leastfreq filename x # List the x Least Frequent Words
splat plotfreq filename x # Draw and Display a Frequency Graph
```
#### Disfluencies
```bash
splat disfluencies filename # Calculate various Disfluency Counts
splat dpa filename # List the Number of Disfluencies per each Dialog Act
splat dpu filename # List the Number of Disfluencies in each Utterance
splat dps filename # List the Number of Disfluencies in each Sentence
```
#### Syntactic Parsing
```bash
splat trees filename # List Parse-Tree Strings for each Utterance
splat maxdepth filename # Calculate Max Tree Depth
splat drawtrees filename # Draw Parse Trees
```
#### Language Modeling
```bash
splat unigrams filename # List all Unigrams
splat bigrams filename # List all Bigrams
splat trigrams filename # List all Trigrams
splat ngrams filename n # List all n-grams
```
## Functionality \& Usage

- - - -
## Annotation Functionality \& Usage
```bash
splat annotate filename # Semi-Automatically annotate the Utterances
```
Coming Soon!

- - - -
## Acknowledgments
I would like to thank Emily Prud'hommeaux and Cissi Ovesdotter-Alm for their guidance during my initial development process. I would also like to thank Bryan Meyers, my brother, letting me bounce ideas off of him, and for giving me wake-up calls when I was doing something in the less-than-intelligent (stupid) way.

| Name | Email | Website | GitHub |
|-----|-----|-----|-----|
| Emily Prud'hommeaux | < <[email protected]> > | < [CLaSP](http://www.rit.edu/clasp/people.html) > | |
| Cissi O. Alm | < <[email protected]> > | < [CLaSP](http://www.rit.edu/clasp/people.html) > | |
| Bryan T. Meyers | < <[email protected]> > | < [DataDrake](http://www.datadrake.com/) > | < [GitHub](https://github.com/DataDrake) > |
See [Acknowledgments](http://splat-library.org/#section5).

- - - -
## Licensing
Expand Down
File renamed without changes
File renamed without changes
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
nltk
matplotlib
jsonpickle
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

setup(
name='SPLAT-library',
version='0.3.7',
version='0.3.8',
description='Speech Processing & Linguistic Analysis Tool',
long_description="SPLAT is a command-line application designed to make it easy for linguists (both computer-oriented and non-computer-oriented) to use the Natural Language Tool Kit (NLTK) for analyzing virtually any text file.\n\nSPLAT is designed to help you gather linguistic features from text files and it is assumed that most input files will not be already annotated. In order for SPLAT to function properly, you should ensure that the input files that you provide do not contain any annotations. Because there are so many variations of linguistic annotation schemes, it would simply be impossible to account for all of them in the initial parsing of input files; it is easier for you to remove any existing annotations than it is for me to do so.",
url='http://splat-library.org',
Expand All @@ -23,8 +23,8 @@
'splat.taggers',
'splat.tokenizers'
],
download_url='https://github.com/meyersbs/SPLAT/archive/v0.3.7.tar.gz',
requires=['matplotlib', 'nltk', 'jsonpickle'],
download_url='https://github.com/meyersbs/SPLAT/archive/v0.3.8.tar.gz',
requires=['matplotlib', 'nltk'],
classifiers=[
'Development Status :: 3 - Alpha',
'Intended Audience :: End Users/Desktop',
Expand Down
13 changes: 12 additions & 1 deletion splat/SPLAT.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,6 @@ def treestrings(self):
""" Returns a list of parsers trees. """
if self.__treestrings is None:
self.__treestrings = TreeStringParser().get_parse_trees(self.__utterances)
#print("Treestrings: " + str(self.__treestrings))
return self.__treestrings

def drawtrees(self):
Expand Down Expand Up @@ -540,20 +539,32 @@ def dis(self):
def splat(self):
return self.__splat

def __str__(self):
""" Equivalent to Java's toString(). """
return self.splat()

##### JSON SERIALIZATION ###########################################################################################

def dump(self, out_file):
""" Dumps the JSON dictionary of this SPLAT to the specified file. """
json.dump(self.__dict__, out_file, default=jdefault)

def dumps(self):
""" Returns a string representation of the JSON dictionary for this SPLAT. """
return json.dumps(self.__dict__)

def load(self, in_file):
""" Given a file containing a JSON dictionary of a SPLAT, load that dictionary into a new SPLAT object. """
self.__dict__ = json.load(in_file)

def loads(self, data_str):
""" Given a string containing a JSON dictionary of a SPLAT, load that dictionary into a new SPLAT object. """
self.__dict__ = json.loads(data_str)


def jdefault(o):
"""
By default, JSON serialization doesn't do what I want it to do, so we have to explicitly tell it to serialize the
Python dictionary representation of the SPLAT object.
"""
return o.__dict__
12 changes: 8 additions & 4 deletions splat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,13 @@
except ImportError:
print("Oops! It looks like some essential NLTK data was not downloaded. Let's fix that.")
print("Downloading NLTK data...")
status = subprocess.call(["python3", "-m", "nltk.downloader", "stopwords", "names", "brown", "cmudict", "punkt", "averaged_perceptron_tagger"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
status = subprocess.call(["python3", "-m", "nltk.downloader", "stopwords", "names", "brown", "cmudict", "punkt",
"averaged_perceptron_tagger"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if status == 0:
print("Essential NLTK data was successfully downloaded!")
else:
print("Hmm... I couldn't download the essential NLTK data for you. I suggest running this command:\n\tpython3 -m nltk.downloader stopwords names punkt averaged_perceptron_tagger")
print("Hmm... I couldn't download the essential NLTK data for you. I suggest running this command:\n\tpython3"
"-m nltk.downloader stopwords names punkt averaged_perceptron_tagger")

try:
import matplotlib
Expand All @@ -49,8 +51,10 @@
if status == 0:
print("matplotlib was successfully installed!")
else:
print("Hmm... I couldn't install matplotlib for you. You probably don't have root privileges. I suggest running this command:\n\tsudo pip3 install matplotlib")
print("Hmm... I couldn't install matplotlib for you. You probably don't have root privileges. I suggest running"
"this command:\n\tsudo pip3 install matplotlib")

java_status = subprocess.call(["which", "java"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if java_status != 0:
print("Java is not installed on your system. Java needs to be installed in order for me to do any part-of-speech tagging.\n\nPlease install java and try again.")
print("Java is not installed on your system. Java needs to be installed in order for me to do any part-of-speech"
"tagging.\n\nPlease install java and try again.")
26 changes: 12 additions & 14 deletions splat/complexity/idea_density.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
import sys, re
#!/usr/bin/env python3

##### PYTHON IMPORTS ###################################################################################################
import re

########################################################################################################################
##### INFORMATION ######################################################################################################
Expand All @@ -11,21 +14,16 @@
########################################################################################################################
########################################################################################################################

##### GLOBAL VARIABLES #################################################################################################

# global ADJ, ADV, VERB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX
# global BEING, BECOMING, SEEMING, LINKING, CLINKING, CORREL, NEGPOL1, NEGPOL2

##### WORD CLASSES #####################################################################################################

ADJ = ["JJ", "JJR", "JJS"]
ADV = ["RB", "RBR", "RBS", "WRB"]
VERB = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "BES"]
VRB = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "BES"]
NOUN = ["NN", "NNS", "NNP", "NNPS"]
INTERR = ["WDT", "WP", "WPS", "WRB"]

# By default, words classified as one of these parts-of-speech is considered a proposition.
PROP = ADJ + ADV + VERB + INTERR + ["CC", "CD", "DT", "IN", "PDT", "POS", "PRP$", "PP$", "TO"]
PROP = ADJ + ADV + VRB + INTERR + ["CC", "CD", "DT", "IN", "PDT", "POS", "PRP$", "PP$", "TO"]

# A sentence consisting wholly of 'non-propositional fillers' is considered to be propositionless.
FILLER = ["and", "or", "but", "if", "that", "just", "you", "know"]
Expand Down Expand Up @@ -130,7 +128,7 @@ def apply_counting_rules(word_list, speech_mode=False):
True when analyzing transcribed speech that contains repetitions and filler words. It may result in undercounting
of well-edited English.
"""
global ADJ, ADV, VERB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX, BEING, BECOMING, SEEMING, LINKING,\
global ADJ, ADV, VRB, NOUN, INTERR, PROP, FILLER, BE, NT, COMEGO, AUX, BEING, BECOMING, SEEMING, LINKING,\
CLINKING, CORREL, NEGPOL1, NEGPOL2, PUNCT

"""
Expand Down Expand Up @@ -255,7 +253,7 @@ def apply_counting_rules(word_list, speech_mode=False):

##### RULE 054 #####
## "that/DT" or "this/DT" is a pronoun, not a determiner, if the following word is a verb or an adverb.
if (word_list[i-1].token == "that" or word_list[i-1].token == "this") and (contains(VERB, word_list[i].tag) or
if (word_list[i-1].token == "that" or word_list[i-1].token == "this") and (contains(VRB, word_list[i].tag) or
contains(ADV, word_list[i].tag)):
word_list[i-1].tag = "PRP"
word_list[i-1].rulenumber = 54
Expand All @@ -276,7 +274,7 @@ def apply_counting_rules(word_list, speech_mode=False):
dest = i # destination
while dest < len(word_list) - 1:
dest += 1
if word_list[dest].tag == "." or contains(VERB, word_list[dest].tag): break
if word_list[dest].tag == "." or contains(VRB, word_list[dest].tag): break
if dest > (i + 1):
word_list.insert(dest, WordObj(word_list[i].token, word_list[i].tag, True, True, 101))
word_list[i].tag = ""
Expand Down Expand Up @@ -383,7 +381,7 @@ def apply_counting_rules(word_list, speech_mode=False):

##### RULE 213 #####
## The bigram 'going to' is not a proposition when is immediately precedes a verb.
if contains(VERB, word_list[i].tag) and word_list[i-1].token == "to" and word_list[i-2].token == "going":
if contains(VRB, word_list[i].tag) and word_list[i-1].token == "to" and word_list[i-2].token == "going":
word_list[i-1].isprop = False
word_list[i-1].rulenumber = 213
word_list[i-2].isprop = False
Expand Down Expand Up @@ -471,14 +469,14 @@ def apply_counting_rules(word_list, speech_mode=False):

##### RULE 402 #####
## Bigrams of the form 'AUX VERB' are considered one proposition, not two.
if contains(VERB, word_list[i].tag) and contains(AUX, word_list[i-1].token):
if contains(VRB, word_list[i].tag) and contains(AUX, word_list[i-1].token):
word_list[i-1].isprop = False
word_list[i-1].rulenumber = 402

##### RULE 405 #####
## In trigrams of the form 'AUX NOT VERB', NOT and VERB are tagged as propositions. The same is true for
## trigrams of the form 'AUX ADV VERB'. For example: 'had always sung', 'would rather go'.
if (contains(VERB, word_list[i].tag) and (word_list[i-1].tag == "NOT") or
if (contains(VRB, word_list[i].tag) and (word_list[i-1].tag == "NOT") or
(contains(ADV, word_list[i-1].tag)) and contains(AUX, word_list[i-2].token)):
word_list[i-2].isprop = False
word_list[i-2].rulenumber = 405
Expand Down
Loading

0 comments on commit 1f22d9d

Please sign in to comment.