Sentiment analysis - Corpus

In a previous post on text manipulation I discussed text mining manipulations that could be performed with a data frame. In this post I will explore what can be done with a corpus. Start by importing the text manipulation package tm. tm has many useful methods for creating a corpus from various sources. My texts are in a directory as xhtml files, one per chapter. I will use VCorpus(DirSource()) to read the files into a corpus data object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> library(tm)
>
> myfiles <- paste(getwd(),"/xhtmlfiles",sep="")
> corp <- VCorpus(DirSource(myfiles), readerControl = list(language="en"))
>
> length(corp)
[1] 17
> corp[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 14639

> writeLines(as.character(corp[[2]]))
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">

<head><meta charset="UTF-8" /></head>
<body><p>Chapter 11</p><p> </p><p> Mrs Fitzhubert at the breakfast table looked out on to the mist-shrouded garden, and decided to instruct the maids to begin putting away the chintzes preparatory to the....

The variable “corp” is a 17 member list, each member containing a chapter. tm provides many useful methods for word munging, referred to as “transformations”. Transformations are applied with the tm_map() function. Below I remove white space, remove stop words, stem (i.e. remove common endings like “ing”, “es”, “s”), etc.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = TRUE)
prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley","pp","p" )
corp <- tm_map(corp, removeWords, prop.nouns)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, stripWhitespace)

> writeLines(as.character(corp[[2]]))
xml version10 encodingutf-8
html xmlnshttpwwww3org1999xhtml

headmeta charsetutf-8 head
bodypchapt 2pp
manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spre

A corpus object allows for the addition of meta data. I will add two events per chapter, which may be useful as overlays during graphing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

meta(corp[[1]], "event1") <- "Exposition of main characters"
meta(corp[[1]], "event2") <- "Journey to the rock"
meta(corp[[2]], "event1") <- "Picnic"
meta(corp[[2]], "event2") <- "Crossing the creak"
meta(corp[[3]], "event1") <- "A surprising number of human beings are without purpose..."
meta(corp[[3]], "event2") <- "Edith screams, girls disappear"
meta(corp[[4]], "event1") <- "Sarah hasn't memorized "The Hesperus""
meta(corp[[4]], "event2") <- "Drag returns from the Rock"
meta(corp[[5]], "event1") <- "Michael interviewed by Constable Bumpher"
meta(corp[[5]], "event2") <- "The red cloud"
meta(corp[[6]], "event1") <- "The garden party"
meta(corp[[6]], "event2") <- "Mike decides to search for the girls"
meta(corp[[7]], "event1") <- "Mike decides to spend the night on the rock"
meta(corp[[7]], "event2") <- "Mike hallucinates on the rock"
meta(corp[[8]], "event1") <- "Michael rescued on the rock"
meta(corp[[8]], "event2") <- "Irma is found alive"
meta(corp[[9]], "event1") <- "Letters to/from parents"
meta(corp[[9]], "event2") <- "Sara informed of her debts to the school"
meta(corp[[10]], "event1") <- "Visit from the Spracks"
meta(corp[[10]], "event2") <- "Michael and Irma meet, date, break up"
meta(corp[[11]], "event1") <- "Michael avoids luncheon with Irma"
meta(corp[[11]], "event2") <- "Fitzhuberts entertain Irma"
meta(corp[[12]], "event1") <- "Irma visits the gymnasium"
meta(corp[[12]], "event2") <- "Mademoiselle de Poitiers threatens Dora Lumley"
meta(corp[[13]], "event1") <- "Reg collects his sister Dora"
meta(corp[[13]], "event2") <- "Reg and Dora die in a fire"
meta(corp[[14]], "event1") <- "Albert describes a dream about his kid sister"
meta(corp[[14]], "event2") <- "Mr Leopold thanks Albert with a cheque"
meta(corp[[15]], "event1") <- "Mrs Appleyard lies about Sara's situation"
meta(corp[[15]], "event2") <- "Mademoiselle de Poitiers reminisces about Sara"
meta(corp[[16]], "event1") <- "Mademoiselle de Poitiers letter to Constable Bumpher"
meta(corp[[16]], "event2") <- "Sara found dead"
meta(corp[[17]], "event1") <- "Newspaper extract"
meta(corp[[17]], "event2") <- ""

> meta(corp[[2]], "event2")
[1] "Crossing the creak"

The corpus object is a list of lists. The main object has 17 elements, one for each chapter, but each chapter element is also a list. The “content” variable of the list is a list of the original xml file contents, with each element being either xml notation, a blank line, or a paragraph of text. Looking at the second chapter’s contents corp[[2]]$content is a list of 18 elements. The first paragraph of the chapter begins with element 6:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> length(corp[[2]]$content)
[1] 18
> corp[[2]]$content[1]
[1] "xml version10 encodingutf-8"
> corp[[2]]$content[2]
[1] "html xmlnshttpwwww3org1999xhtml"
> corp[[2]]$content[3]
[1] ""
> corp[[2]]$content[4]
[1] "headmeta charsetutf-8 head"
> corp[[2]]$content[5]
[1] "bodypchapt 2pp"
> corp[[2]]$content[6]
[1] " manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spread gum in addit chicken pie angel cake jelli tepid banana insepar australian picnic cook provid handsom ice cake shape heart tom oblig cut mould piec tin mr boil two immens billycan tea fire bark leav now enjoy pipe shadow drag keep watch eye hors tether shade"
>

This corpus is the end of the preprocessing stage of the document and will be the input for a document term matrix discussed in the next post

Share