Sentiment analysis

In my previous post on text manipulation I discussed the process of OCR and text munging to create a list of chapter contents. In this post I will investigate what can be done with a data-frame, and future posts will discuss using a corpus, and Document Term matrix.

Each chapter is an XML file so read those into a variable and inspect:

1
2
3
4
5
6
7
8
9
10
11

##Indicate working directory
>setwd("~/pahr/sentiment/")
> all.files <- list.files(paste(getwd(), "/xhtmlfiles", sep=""))
> allfiles
[1] "chapter10.xhtml" "chapter11.xhtml" "chapter12.xhtml" "chapter13.xhtml"
[5] "chapter14.xhtml" "chapter15.xhtml" "chapter16.xhtml" "chapter17.xhtml"
[9] "chapter1.xhtml" "chapter2.xhtml" "chapter3.xhtml" "chapter4.xhtml"
[13] "chapter5.xhtml" "chapter6.xhtml" "chapter7.xhtml" "chapter8.xhtml"
[17] "chapter9.xhtml"
>

Create a dataframe that will provide the worklist through which I will process, as well as hold data about each chapter. The dataframe will contain a row for each chapter and tally information such as:

  • bname: base name of the chapter XML file
  • chpt: chapter number
  • paragraphs: number of paragraphs
  • total: total number of words
  • nosmall: number of small (<4 characters) words
  • uniques: number of unique words
  • nonstop: number of non-stop words
  • unnstop: number of unique non-stop words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

d <- data.frame(matrix(ncol = 9, nrow = length(all.files)))
names(d) <- c("file.name","bname","chpt","paragraphss","total","nosmall","uniques","nonstop","unnstop")

d$file.name <- all.files
for(i in 1:nrow(d)){
numc <- nchar(d[i,"file.name"])
d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 6)
d[i,"chpt"] <- as.integer(substring( d[i,"file.name"], 8, numc - 6))
}
d <- d[order(d$chpt),]

> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 NA NA NA NA NA
10 chapter2.xhtml chapter2 2 NA NA NA NA NA
11 chapter3.xhtml chapter3 3 NA NA NA NA NA
12 chapter4.xhtml chapter4 4 NA NA NA NA NA
13 chapter5.xhtml chapter5 5 NA NA NA NA NA
14 chapter6.xhtml chapter6 6 NA NA NA NA NA
unnstop
9 NA
10 NA
11 NA
12 NA
13 NA
14 NA
>

I will read the chapter XML files into a list and at the same time count the number of paragraphs per chapter:

1
2
3
4
5
6
7
8
chpts  <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
chpts[[chpt.num]] <-xmlToList( paste( getwd(), "/xhtmlfiles/", d[i,"file.name"], sep=""))
d[i,"lines"] <- length(chpts[[chpt.num]]$body )
}

Each quote from a character is given its own paragraph, so a high paragraph count is indicative of lots of conversation.

Next create a list for each parameter I would like to extract. Stop words are provided by the tidytext package

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
library(tidytext)
total <- vector(mode="list", length=nrow(d))
nosmall <- vector(mode="list", length=nrow(d))
un <- vector(mode="list", length=nrow(d)) ##uniques no blanks
data(stop_words) #from tidytext package
non.stops <- vector(mode="list", length=nrow(d))
unstops <- vector(mode="list", length=nrow(d))

for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
total[[chpt.num]] <- strsplit(gsub( "[[:punct:]]", "", chpts[[chpt.num]])[2], " ", fixed=TRUE)
d[i,"total"] <- length(total[[chpt.num]][[1]] )
##eliminate words with fewer than 4 characters
nosmall[[chpt.num]] <- total[[chpt.num]][[1]][!(nchar(total[[chpt.num]][[1]] )<4)]
d[i,"nosmall"] <- length(nosmall[[chpt.num]] )
##uniques
un[[chpt.num]] <- unique(nosmall[[chpt.num]])
d[i,"uniques"] <- length( un[[chpt.num]] )
##no stops (but not unique)
non.stops[[chpt.num]] <-nosmall[[chpt.num]][!(nosmall[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"nonstops"] <- length(non.stops[[chpt.num]] )
##no stops AND unique
unstops[[chpt.num]] <-un[[chpt.num]][!(un[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"unstop"] <- length(unstops[[chpt.num]] )
}

> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 50 5151 2854 1649 NA
10 chapter2.xhtml chapter2 2 59 3490 1844 1077 NA
11 chapter3.xhtml chapter3 3 42 2904 1632 971 NA
12 chapter4.xhtml chapter4 4 59 4064 2011 1066 NA
13 chapter5.xhtml chapter5 5 100 6216 3267 1572 NA
14 chapter6.xhtml chapter6 6 48 3305 1741 1028 NA
unnstop nonstops unstop
9 NA 2061 1414
10 NA 1228 883
11 NA 1107 786
12 NA 1290 843
13 NA 2124 1306
14 NA 1171 835
>

plot(d$chpt, d$total, type="o", ylab="Words", xlab="Chapter Number", main="Words by Chapter", ylim=c(0,9000))
points(d$chpt, d$nosmall, type="o", col="lightblue")
points(d$chpt, d$uniques, type="o", col="blue")
points(d$chpt, d$nonstops, type="o", col="red")
points(d$chpt, d$unstop, type="o", col="orange")

# get the range for the x and y axis
legend(1, 9000, c("Total words","Big words (> 3 chars)","Unique(Big words)","Non-stop(Big words)","Unique*Non-stop(Big words)"), col=c("black","lightblue","blue","red","orange"),lty=1, pch=1, cex=0.9)

The word count trends are the same for all categories, which is expected. I am interested in the “Non-stop(Big words)”, the red line, as I don’t want to normalize word dosage i.e. if the word “happy” is present 20 times, I want the 20x dosage of the happiness sentiment that I wouldn’t get using unique words. To visually inspect the word list I will simply pull out the first 50 words from each category for chapter 2:

Comparing nosmall to non.stops the first two words eliminated are words 9 and 24, “several” and “through”, two words I would agree don’t contribute to sentiment or content.

Next I will make a wordcloud of the entire book. To do so I must get the words into a dataframe.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
word <- non.stops[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))

for(i in 2:17){
word <- non.stops[[i]]
chpt <- rep(i, length(word))
holder <- cbind(chpt, word)
pahr.words <- rbind(pahr.words, holder)
rm(holder)
}

library('wordcloud')
wordcloud(pahr.words$word, max.words = 100, random.order = FALSE)

Appropriately “rock” is the most frequent word. The word cloud contains many proper nouns. I will make a vector of these nouns, remove them from the collection of words and re-plot:

1
2
3
4
5
6
7
8
9
10
11
> prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley" )
> cloud.words <- as.character(pahr.words$word)
> ioi <- (cloud.words %in% prop.nouns)
> summary(ioi)
Mode FALSE TRUE NA's
logical 24524 1194 0
> cw2 <- cloud.words[!ioi]
>
> wordcloud(cw2, max.words = 100, random.order = FALSE)
>

In the next post I will look at what can be done with a corpus.

Share