Sentiment analysis - DT matrix

As I work with various packages related to text manipulation, I am beginning to realize what a mess the R package ecosystem can turn into. A variety of packages written by different contributers with no coordination amongst packages, overlapping functionality, colliding nomenclature. Many functions for “convenience” when base R could do the job. I also noticed this with packages like dplyr. I have commenced learning dplyr on multiple occasions only to find I don’t need it - I can do everything with base R without loading an extra package and learning new terminology. The problem I now encounter is that as these packages gain in popularity, code snippets and examples use them and I need to learn and understand the packages to make sense of the examples.

In my previous post on text manipulation I discussed the process of creating a corpus object. In this post I will investigate what can be done with a document term matrix. Starting with the previous post’s corpus:

1
2
3

dtm <- DocumentTermMatrix(corp)

There are a variety of methods available to inspect the dtmatrix:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> dtm
<<DocumentTermMatrix (documents: 17, terms: 5500)>>
Non-/sparse entries: 18083/75417
Sparsity : 81%
Maximal term length: 26
Weighting : term frequency (tf)
> dim(dtm)
[1] 17 5500
> inspect(dtm[2, 50:100])
<<DocumentTermMatrix (documents: 1, terms: 51)>>
Non-/sparse entries: 10/41
Sparsity : 80%
Maximal term length: 9
Weighting : term frequency (tf)

Terms
Docs accentu accept access accid accompani accord account accumul
chapter02.xhtml 0 0 0 0 0 1 1 0
Terms
Docs accus accustom ach achiev acid acquaint acquir acr across
chapter02.xhtml 0 0 0 0 0 0 0 0 0

Note the sparsity is 81%. Remove sparse terms and inspect:

1
2
3
4
5
6
7
> dtms <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum.   
> dtms
<<DocumentTermMatrix (documents: 17, terms: 66)>>
Non-/sparse entries: 1082/40
Sparsity : 4%
Maximal term length: 26
Weighting : term frequency (tf)

Now sparsity is down to 4%. Calculate word frequencies and plot as a histogram.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)   
> head(freq, 14)
the said like one rock girl now littl miss look mrs know day
367 204 188 184 180 170 167 164 163 148 142 133 122
come
116
> wf <- data.frame(word=names(freq), freq=freq)
> head(wf)
word freq
the the 367
said said 204
like like 188
one one 184
rock rock 180
girl girl 170

>wf$nc <- sapply(as.character(wf$word),nchar)
>wf <- wf[wf$nc >3,]

>p <- ggplot(subset(wf, freq>60), aes(word, freq))
>p <- p + geom_bar(stat="identity")
>p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
>p

We can use hierarchal clustering to group related words. I wouldn’t read much meaning into this for Picnic, but it is comforting to see the xml/html terms clustering together in the third group - a sort of positive control.

1
2
3
4
5
6
7
8
9

library(cluster)
dtms <- removeSparseTerms(dtm, 0.05) # This makes a matrix that is 10% empty space, maximum.
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=4) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=4, border="red") # draw dendogram with red borders around the 4 clusters

We can also use K-means clustering:

1
2
3
4
library(fpc)   
d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Back here I didn’t mention that when creating the epub, it would display fine on my computer, but would not display on my Nook. A solution was to pass the file through Calibre. I diff’ed files coming out of Calibre with my originals but was not able to determine the minimum set of changes required for Nook compatibility. You can download the Calibre modified epub here, and the original here. If you determine what those Nook requirements are, please inform me.

Share