-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmilestone-report.Rmd
176 lines (125 loc) · 11.4 KB
/
milestone-report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "Data Science Capstone Milestone Report"
author: "Rob Moore"
date: "December 28, 2015"
output: html_document
---
# Data Provenance
The data used for this project are from the [HC Corpora project](http://www.corpora.heliohost.org/)
which is run presumably by its namesake Hans Christensen. The goal of the HC Corpora is to collect
examples of languages as used in their current context by crawling the web and collecting text from
publicly available sources, namely blogs, Twitter feeds, and newspaper stories. The collection currently
has corpora for over 60 languages or variants (such as US or UK English).
The HC Corpora project maintains only about 50% of the original source data by design and although
the crawler attempts to capture material solely in the given language, HC readily acknowledge that
other languages are typically part of the contents of a given language's corpus. This can occur
accidentally, as in the case where the crawler misidentifies the language, or purposefully in a context
where another language is 'borrowed' as part of ordinary expression.
We are tasked with using the US English corpora as a basis of creating a model to suggest the next
word in a sentence given the words that have preceded in order to aid an user, for example, in
writing text within an environment such as a mobile phone where typing can be tedious and error-prone.
The US Corpus is made up of three files which each contain a distinct source of data: blogs, Twitter,
or newspaper stories. The HC Corpora project provides the [following approximate statistics](http://www.corpora.heliohost.org/statistics.html#EnglishUSCorpus)
for the US English corpus:
| | Words | Characters | Letters | Lines | Mean Word Length | Mean Words/Line |
|-----------|-------------|-------------|-------------|-----------|------------------|-----------------|
| Blogs | 37,242,000 | 206,824,000 | 163,815,000 | 899,000 | 4.4 | 41.41 |
| Newspapers| 34,275,000 | 203,223,000 | 162,803,000 | 1,010,000 | 4.75 | 33.93 |
| Twitter | 29,876,000 | 162,122,000 | 125,998,000 | 2,360,000 | 4.22 | 12.66 |
| Total | 101,393,000 | 572,170,000 | 452,617,000 | 4,269,000 | 4.46 | 23.75 |
The three formats differ in the accepted or enforced norms of their respective mediums. For example, Twitter's 140 character limit forces the use of short messages with a mean of 12.66 words per line. Likewise, newspapers
typically are restricted in story length due to space and concerns such as user attention and screen real
estate that can be used for advertisements. As a result, the average length of newspaper articles is well
above that of Twitter's messages but nearly a quarter less than blog entries where writers are given full
reign in terms of expressing themselves and where the expression "TL;DR" may have originated.
# Data Pre-processing
After downloading the corpus zip file and extracting it, we used the `sample_lines` function from
the [LaF package](https://cran.r-project.org/web/packages/LaF/index.html) to take a random sample of 60% of each
of the files: `final/en_US/en_US.twitter.txt`, `final/en_US/en_US.blogs.txt`, and `final/en_US/en_US.news.txt`. While the original files identify the site the text is from, the date it was collected, the type of item it is, and a categorization, the files we have been provided only contain the text itself.
Much of the text processing was done using the [Quanteda package](https://cran.r-project.org/web/packages/quanteda/index.html). We used this package to tokenize the texts into three separate variables and kept this separation throughout the rest of the processing. We tokenized each data source into sentences to aid in the usage sentence markers as discussed below.
After tokenization, we filtered out profanities by using a [list](https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) taken from Shutterstock and using regular expressions to swap out a dummy token ("_UNK_") for each objectionable word so that the context it existed in was preserved. We did this following the technique detailed by Jurafsky and Martin in [chapter 4](https://web.stanford.edu/~jurafsky/slp3/slides/4_LM.pdf) of their [_Speech and Language Processing_](https://web.stanford.edu/~jurafsky/slp3/). The approach they detail is for a different application, namely the substitution of unknown words when using a fixed or 'discovered' vocabulary, but seemed to lend itself well for the same reasons.
Due to the extensive processing of the profanity removal, we employed the [parallel package](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) so that we could utilize the multiple CPUs in our equipment. Also, we found that usage of the `perl` parameter in `gsub` made for a significant performance achievement over the default posix value.
We again utilized the Quanteda package to create unigrams, bigrams, and trigrams of each text source. Prior to processing bigrams and trigrams, we followed the approach of inserting beginning and end of sentence tokens as specified by Jurafsky and Martin (see above). The Quanteda package makes use of document-frequency matrices or DFMs and this provided a means to store the ngram data efficiently as well as query it for frequency counts by features.
Finally, we extracted the frequency counts using the `topfeatures` function which, despite its name, allows
for the counts for all features in each DFM.
# Exploratory analysis
We broke up the different text sources as separately and looked at the data as independent units rather than as a whole. We analyzed the data sources by evaluating the number of lines, sentences, tokens, and types found in each.
```{r chunk1, echo = FALSE}
load("all.RData")
options(scipen=12)
```
| | Lines | Sentences | Tokens (N) | Types (V) |
|-----------|---------------------|---------------------|------------------|-----------------|
| Blogs | `r nBlogLines` | `r blogsSentences` | `r blogsTokens` | `r blogsTypes` |
| Newspapers| `r nNewsLines` | `r newsSentences` | `r newsTokens` | `r newsTypes` |
| Twitter | `r nTwitterLines` | `r tweetsSentences` | `r tweetsTokens` | `r tweetsTypes` |
| Total | `r nTotalLines` | `r totalSentences` | `r totalTokens` | `r totalTypes` |
We examine here the difference in term frequency across Tweets of various ngram types: unigram, bigram, and trigram. The frequency for bigrams and trigrams is nearly identical.
```{r chunk2, echo = FALSE, fig.width = 3}
old.par <- par(mfrow=c(3, 1))
plot(tweetsUnigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency for Unigrams", xlab = "Rank")
plot(tweetsBigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency for Bigrams", xlab = "Rank")
plot(tweetsTrigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency for Trigrams", xlab = "Rank")
par(old.par)
```
Here we see the top unigrams for each data source (from the top: Tweets, blogs, and news) in terms of frequency and the percentage of the text that each item makes up as well as the term frequency for each type.
```{r chunk3, echo = FALSE}
old.par <- par(mfrow=c(3,2))
tweetsUnigramPct <- tweetsUnigramFreq / tweetsTokens
plot(tweetsUnigramPct[1:10], type="b", xlab="Top Ten Unigrams for Tweets", ylab="Percentage of Full Text", xaxt ="n")
axis(1, 1:10, labels = names(tweetsUnigramPct[1:10]))
plot(tweetsUnigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency", xlab = "Rank")
blogsUnigramPct <- blogsUnigramFreq[1:10] / blogsTokens
plot(blogsUnigramPct[1:10], type="b", xlab="Top Ten Unigrams for Blogs", ylab="Percentage of Full Text", xaxt ="n")
axis(1, 1:10, labels = names(blogsUnigramPct[1:10]))
plot(blogsUnigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency", xlab = "Rank")
newsUnigramPct <- newsUnigramFreq[1:10] / newsTokens
plot(newsUnigramPct[1:10], type="b", xlab="Top Ten Unigrams for News", ylab="Percentage of Full Text", xaxt ="n")
axis(1, 1:10, labels = names(newsUnigramPct[1:10]))
plot(newsUnigramFreq[1:100], log = "y", cex = .6, ylab = "Term frequency", xlab = "Rank")
par(old.par)
```
As is obvious in the graph below, the data follows Zipf's Law very closely. This law simply asserts that the frequency
or words drops precipitously as its rank decreases or that most frequent word will occur approximately twice as
often as the second most frequent word. As we can see here, when graphed using a log scale, it appears as a straight
line with a slope of -1.
```{r chunk 4, echo = FALSE, fig.subcap="Zipf's Law in Action", fig.width = 3}
# demonstrate zipf's law - plot log frequency by log rank
old.par <- par(mfrow=c(3, 1))
plot(log10(1:100), log10(tweetsUnigramFreq[1:100]), xlab="log(rank)", ylab="log(frequency)", main = "Top 100 Tweet Words")
plot(log10(1:100), log10(blogsUnigramFreq[1:100]), xlab="log(rank)", ylab="log(frequency)", main = "Top 100 Blog Words")
plot(log10(1:100), log10(newsUnigramFreq[1:100]), xlab="log(rank)", ylab="log(frequency)", main = "Top 100 News Words")
par(old.par)
```
# Future Directions
Our goal is fairly simple in that we would like to take the ngrams we produced above and create
a probabilistic model which will indicate the most likely candidates for the next word in the
sentence. Based on the data we have computed, we now have the basic counts required to calculate
these probabilities. We will use a conventional Markov Chain approach which allows us to narrow the 'candidate' word choices using the immediately preceding words in the sentence as clues. We have sufficient test data to allow for us to generate a perplexity measure which will provide confidence in the power of our model in usage with similar document types.
Currently, we are in the process of building out the algorithm to calculate the percentage
likelihood and to correct for certain issues that arise in the context of these calculations, such
as words which are absent from the corpus but are nevertheless valid usage in the language. With
such a large corpus, we anticipate being able to use a simpler approach to accommodating these
distortions in the data (see [Large Language Models in Machine Learning](http://www.aclweb.org/anthology/D07-1090.pdf) for more detail).
In terms of our application, we are ever-mindful of the [limited CPU and RAM resources](http://shiny.rstudio.com/articles/shinyapps.html#application-instances) that we will have in the
shiny.io environment. Accordingly, we look to make our current implementation more efficient computationally
and from the perspective of its memory footprint. As it stands, the model that we have implemented is relatively performant in higher-end hardware but will not be viable in such a restricted environment as
our deployment target. Our intuition is that it will be possible to generate a model which will take
much less memory and can be persisted rather than generated on deployment.
In terms of the application itself, we are contemplating dividing up the corpus by its different types prior to having the user type in a sentence for completion. As prediction is more accurate with similar corpus types, we expect that the user experience will be more satisfying.
# Code
Below we capture the R code chunks used to produce the analysis above.
```{r echo = FALSE}
library(knitr)
read_chunk('milestone.R', labels = "chunkTest")
```
```{r chunkTest, eval = FALSE}
```
```{r chunk2, eval = FALSE}
```
```{r chunk3, eval = FALSE}
```
```{r chunk4, eval = FALSE}
```
```{r chunk5, eval = FALSE}
```