Prologue

This tutorial is a continuation of the series devoted to text analysis on the returned students press corpus. In the the previous tutorials, we use the package quanteda to perform text statistics on our corpus. In this new instalment, we rely on the same package to explore the relations between words, known as “co-occurrences” or “collocations”.

This experiment is based on this excellent tutorial.

This tutorial will demonstrate how to perform co-occurrence analysis with R and the quanteda-package. It is shown how different significance measures can be used to extract semantic links between words.

Load the quanteda-package and define a few already known default variables:

options(stringsAsFactors = FALSE)
library(quanteda)


# read the RS dataset
library(readr)
textdata <- read_csv("Data/rs_full_text.csv",
col_types = cols(X1 = col_skip())) %>% distinct()

# create the corpus object with an additional variable for year

rs_corpus <- corpus(textdata$Text, docnames = textdata$DocID, docvars = data.frame(year = substr(textdata$Date, 0, 4)))


Inspect the corpus:

# original corpus length and its first document
ndoc(rs_corpus)
## [1] 2739


The corpus contains 2739 unique documents (press articles).

Inspect the first 200 lines of the first document:

substr(texts(rs_corpus)[1], 0, 200)
##                                                                                                                                                                                                 1324674682 
## "                                           Many University Men on Shanghai Mayor's Staff                 Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced th"

Sentence detection

The separation of the text into semantic analysis units is important for co-occurrence analysis. Context windows can be for instance documents, paragraphs or sentences or neighboring words. One of the most frequently used context window is the sentence.

Documents are decomposed into sentences. Sentences are defined as a separate (quasi-)documents in a new corpus object of the quanteda-package. The further application of the quanteda-package functions remains the same. In contrast to previous exercises, however, we now use sentences which are stored as individual documents in the body.

Important: The sentence segmentation must take place before the other preprocessing steps because the sentence-segmentation-model relies on intact word forms and punctuation marks.

The following code uses a quanteda function to reshape the corpus into sentences:

corpus_sentences <- corpus_reshape(rs_corpus, to = "sentences")

ndoc(corpus_sentences)
## [1] 10445


The reshaped corpus contains 10445 documents (sentences), which is abnormally low given the relatively large number of articles, except if the original corpus is made up of a majority of short articles (less than 5 sentences each on average). As we show in another tutorial,, this may reflect missing punctuation or issues in text segmentation prior to the extraction of the documents with the enpchina package.

Let’s inspect a few examples:

texts(corpus_sentences)[1]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1324674682.1 
## "Many University Men on Shanghai Mayor's Staff                 Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced the appointment of Cabinet of ten members practically all of whom are graduates of Chinese colleges or are returned students from America or Europe The list includes the following                 Police commissioner Shen Pu Jen Assistant commissioner Chang Zien Chun Graduate of Nanyang College Commissioner of Revenue Shu Zien Fu banker formerly director of Rank of China in Hangchow Comniiss ion of Public Works Dr Shen Chung Ye1' Graduate of Tung Chi and returned from Germany got his Ph in civil engineering in Germ my Commissioner of Public Utilitv Huang Bah Ziau Graduate of Tung Clri and returned student from Germany formerly er of works in Hankow Commissioner of Charity Dept Wang Han Tsc member of local gentry Commissioner of Harbor Dept Lee N'ien Tse returned student from Germany formerly commissioner of industry in Shansi Province Commissioner of Education Chnu Ching Nung American returned student formerly of Shanghai College and Kuang Hua University Commissioner of agriculture commerce and labor Pan Kong Zu tS JR formerly journalist in Shanghai member of the political committee in Shanghai Commissioner of Land Chcni Ycen Belgian returned student formerly the Chinese director of China- l rench Technical College in Shanghai 10) Commissioner of Health Dr Hu Hou-ki an American returned student who has been connected with the Rockefeller Hospital at Peking In the Department of Health there are also four other returned students who will have charge of the various divisions and laboratories"


texts(corpus_sentences)[2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1326716896.1 
## "THREE American returned students two British returned students one French returned student one German returned student one graduate of the Shensi University and two Japanese returned students were elected at the first section of the long-expected elections tor the Senate held on June 20' and 21 in Peking For the first time in the history of the Republic had so many foreign-educated Chinese                 voted Over four hundred registered and some three hundred cast their votes The majority of them are graduates of Western or Japanese universities They questioned the legality of the new Parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein THOSE who were elected Senators are Tsur former president of the Tsing Hua College Ho Yen-sun secretary to Liang Shih-yi and Chen Huai-chang Confucianist who are returned students from America Lo Hung-nien of the Bank of China and Wang Shih-ching editor of the Peking Daily News returned students from England Wu Chun from Ger many Wu Ching-lien from France former Minister to Italy Ting Yung and Wei Sze-kan from Japan and Hsu Shih graduate of the Shensi University They were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to June 10, 1918, authors whose books have been recognized by the Ministry of Education The voting took place in the Hall of the House of Representa tives and the election was presided over bv Fu Tseng-hsiang Minister of Education"


In these two examples, we observe that the algorithm was not able to atomize the article into sentences since the punctuation was missing in the original text. Therefore we will not actually rely on sentence as the text unit for analyzing collocations. What is referred to as “sentence” in this tutorial, may be an actual sentence when the punctuation was correctly transcribed in the original article, or it may a compound of successive sentences from a given article (and sometimes, the entire article) when punctuation was missing or incorrectly transcribed.

CAUTION: When the tokenization goes well, the size of the newly decomposed corpus usually increases significantly. Older computers may get in trouble because of insufficient memory during this preprocessing step.

Now we are returning to our usual pre-processing chain and apply it on the separated sentences:

# Preprocessing of the corpus of sentences
corpus_tokens <- corpus_sentences %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower()  %>% 
  tokens_remove(pattern = stopwords(), padding = T)

# calculate multi-word unit candidates
rs_collocations <- textstat_collocations(corpus_tokens, min_count = 25)

rs_collocations_reduced <- rs_collocations %>% filter(!collocation %in% c("years ago", "china press", "road tel", "govern ment", "road telephone", "years ago", "last week", "last year", "took place", "per month", "per cent", "daily news", "chin ese", "let us", "co ltd", "page col", "take place", "turner property", "palmer turner", "chi nese", "address box", "pre sent", "re turned"))

corpus_tokens <- tokens_compound(corpus_tokens, rs_collocations_reduced)


Again, we create a document-term-matrix. Only word forms which occur at least 10 times should be taken into account. An upper limit is not set (Inf = infinite):

minimumFrequency <- 10


Additionally, we are interested in the joint occurrence of words in a sentence. For this, we do not need the exact count of how often the terms occur, but only the information whether they occur together or not. This can be encoded in a binary document-term-matrix. The parameter weighting in the control options calls the weightBin function. This writes a 1 into the DTM if the term is contained in a sentence and 0 if not:

# Create DTM, prune vocabulary and set binary values for presence/absence of types
binDTM <- corpus_tokens %>% 
  tokens_remove("") %>%
  dfm() %>% 
  dfm_trim(min_docfreq = minimumFrequency, max_docfreq = Inf) %>% 
  dfm_weight("boolean")

Counting co-occurrences

The counting of the joint word occurrence is easily possible via a matrix multiplication on the binary DTM. For this purpose, the transposed matrix (dimensions: nTypes x nDocs) is multiplied by the original matrix (nDocs x nTypes), which as a result encodes a term-term matrix (dimensions: nTypes x nTypes).

# Matrix multiplication for cooccurrence counts
coocCounts <- t(binDTM) %*% binDTM


Let’s look at a snippet of the result. The matrix has nTerms rows and columns and is symmetric. Each cell contains the number of joint occurrences. In the diagonal, the frequencies of single occurrences of each term are encoded:

as.matrix(coocCounts[202:205, 202:205])
##            books recognized ministry voting
## books        318         22       38      3
## recognized    22        175       29      2
## ministry      38         29      788      9
## voting         3          2        9     36


Interpret as follows: “voting” appears together 12 times with ministry in the 10445 sentences (or multi-sentence segments of articles) in the RS collection. “Voting” alone occurs 35 times.

Statistical significance

In order to not only count joint occurrence we have to determine their significance. Different significance-measures can be used. We need also various counts to calculate the significance of the joint occurrence of a term i (coocTerm) with any other term j:

  • k - Number of all context units in the corpus
  • ki - Number of occurrences of coocTerm
  • kj - Number of occurrences of comparison term j
  • kij - Number of joint occurrences of coocTerm and j

These quantities can be calculated for any term coocTerm as follows

coocTerm <- "returned"
k <- nrow(binDTM)
ki <- sum(binDTM[, coocTerm])
kj <- colSums(binDTM)
names(kj) <- colnames(binDTM)
kij <- coocCounts[coocTerm, ]


An implementation in R for Mutual Information, Dice, and Log-Likelihood may look like this. At the end of each formula, the result is sorted so that the most significant co-occurrences are at the first ranks of the list:

########## MI: log(k*kij / (ki * kj) ########
mutualInformationSig <- log(k * kij / (ki * kj))
mutualInformationSig <- mutualInformationSig[order(mutualInformationSig, decreasing = TRUE)]

########## DICE: 2 X&Y / X + Y ##############
dicesig <- 2 * kij / (ki + kj)
dicesig <- dicesig[order(dicesig, decreasing=TRUE)]

########## Log Likelihood ###################
logsig <- 2 * ((k * log(k)) - (ki * log(ki)) - (kj * log(kj)) + (kij * log(kij)) 
               + (k - ki - kj + kij) * log(k - ki - kj + kij) 
               + (ki - kij) * log(ki - kij) + (kj - kij) * log(kj - kij) 
               - (k - ki) * log(k - ki) - (k - kj) * log(k - kj))
logsig <- logsig[order(logsig, decreasing=T)]


The result of the four variants for the statistical extraction of co-occurrence terms is shown in a data frame below:

# Put all significance statistics in one Data-Frame
resultOverView <- data.frame(
  names(sort(kij, decreasing=T)[1:10]), sort(kij, decreasing=T)[1:10],
  names(mutualInformationSig[1:10]), mutualInformationSig[1:10], 
  names(dicesig[1:10]), dicesig[1:10], 
  names(logsig[1:10]), logsig[1:10],
  row.names = NULL)
colnames(resultOverView) <- c("Freq-terms", "Freq", "MI-terms", "MI", "Dice-Terms", "Dice", "LL-Terms", "LL")
print(resultOverView)
##    Freq-terms Freq          MI-terms       MI Dice-Terms      Dice LL-Terms
## 1    returned  798          returned 2.571770   returned 1.0000000 shanghai
## 2       china  544              fide 2.253316       last 0.2703901    china
## 3    shanghai  473         conqueror 2.215095   shanghai 0.2664038       mr
## 4     chinese  461           stamped 2.140987       time 0.2635659     time
## 5          mr  389       boxer_funds 2.119785     return 0.2616382     last
## 6         one  382 shanghai_departed 2.112238         mr 0.2604620   return
## 7         now  368            misled 2.060944   received 0.2564103  chinese
## 8        time  357                gr 2.060944   students 0.2540943     made
## 9        made  357        accusation 2.060944    america 0.2527806 received
## 10       also  347              cali 2.060944       made 0.2480028      now
##          LL
## 1  419.3500
## 2  347.3295
## 3  334.0355
## 4  325.4366
## 5  321.5680
## 6  281.8972
## 7  278.5335
## 8  276.6139
## 9  266.4750
## 10 264.5079


It can be seen that frequency is a bad indicator of meaning constitution. Mutual information (MI) emphasizes rather rare events in the data. Dice and Log-likelihood yield very well interpretable contexts.

Visualization of co-occurrence

In the following, we create a network visualization of significant co-occurrences.

For this, we provide the calculation of the co-occurrence significance measures, which we have just introduced, as single function in the file calculateCoocStatistics.R. This function can be imported into the current R-Session with the source command:

# Read in the source code for the co-occurrence calculation
source("https://tm4ss.github.io/calculateCoocStatistics.R")
# Definition of a parameter for the representation of the co-occurrences of a concept
numberOfCoocs <- 15
# Determination of the term of which co-competitors are to be measured.
coocTerm <- "returned"


We use the imported function calculateCoocStatistics to calculate the co-occurrences for the target term “returned”.

coocs <- calculateCoocStatistics(coocTerm, binDTM, measure="LOGLIK")


Display the numberOfCoocs main terms:

print(coocs[1:numberOfCoocs])
## shanghai    china       mr     time     last   return  chinese     made 
## 419.3500 347.3295 334.0355 325.4366 321.5680 281.8972 278.5335 276.6139 
## received      now students  america  present     also      men 
## 266.4750 264.5079 261.4810 257.1062 249.9254 249.1432 241.7029


To acquire an extended semantic environment of the target term, “secondary co-occurrence” terms can be computed for each co-occurrence term of the target term. This results in a graph that can be visualized with special layout algorithms (e.g. Force Directed Graph).

Network graphs can be evaluated and visualized in R with the igraph-package. Any graph object can be created from a three-column data-frame. Each row in that data-frame is a triple. Each triple encodes an edge-information of two nodes (source, sink) and an edge-weight value.

For a term co-occurrence network, each triple consists of the target word, a co-occurring word and the significance of their joint occurrence. We denote the values with from, to, sig.

resultGraph <- data.frame(from = character(), to = character(), sig = numeric(0))


The process of gathering the network for the target term runs in two steps. First, we obtain all significant co-occurrence terms for the target term. Second, we obtain all co-occurrences of the co-occurrence terms from step one.

Intermediate results for each term are stored as temporary triples named tmpGraph. With the rbind command (“row bind”, used for concatenation of data-frames) all tmpGraph are appended to the complete network object stored in resultGraph:

# The structure of the temporary graph object is equal to that of the resultGraph
tmpGraph <- data.frame(from = character(), to = character(), sig = numeric(0))

# Fill the data.frame to produce the correct number of lines
tmpGraph[1:numberOfCoocs, 3] <- coocs[1:numberOfCoocs]
# Entry of the search word into the first column in all lines
tmpGraph[, 1] <- coocTerm
# Entry of the co-occurrences into the second column of the respective line
tmpGraph[, 2] <- names(coocs)[1:numberOfCoocs]
# Set the significances
tmpGraph[, 3] <- coocs[1:numberOfCoocs]

# Attach the triples to resultGraph
resultGraph <- rbind(resultGraph, tmpGraph)


We iterate over the most significant numberOfCoocs co-occurrences of the search term (may take a while…enough time to grab a cup of coffee ;-)):

for (i in 1:numberOfCoocs){
  
  # Calling up the co-occurrence calculation for term i from the search words co-occurrences
  newCoocTerm <- names(coocs)[i]
  coocs2 <- calculateCoocStatistics(newCoocTerm, binDTM, measure="LOGLIK")
  
  #print the co-occurrences
  coocs2[1:10]
  
  # Structure of the temporary graph object
  tmpGraph <- data.frame(from = character(), to = character(), sig = numeric(0))
  tmpGraph[1:numberOfCoocs, 3] <- coocs2[1:numberOfCoocs]
  tmpGraph[, 1] <- newCoocTerm
  tmpGraph[, 2] <- names(coocs2)[1:numberOfCoocs]
  tmpGraph[, 3] <- coocs2[1:numberOfCoocs]
  
  #Append the result to the result graph
  resultGraph <- rbind(resultGraph, tmpGraph[2:length(tmpGraph[, 1]), ])
}


As a result, resultGraph now contains all numberOfCoocs * numberOfCoocs edges of a term co-occurrence network:

# Sample of some examples from resultGraph
resultGraph[sample(nrow(resultGraph), 6), ]


The package iGraph offers multiple graph visualizations for graph objects. Graph objects can be created from triple lists, such as those we just generated. In the next step we load the package iGraph and create a visualization of all nodes and edges from the object resultGraph:

require(igraph)

# set seed for graph plot
set.seed(1)

# Create the graph object as undirected graph
graphNetwork <- graph.data.frame(resultGraph, directed = F)

# Identification of all nodes with less than 2 edges
verticesToRemove <- V(graphNetwork)[degree(graphNetwork) < 2]
# These edges are removed from the graph
graphNetwork <- delete.vertices(graphNetwork, verticesToRemove) 

# Assign colors to nodes (search term blue, others orange)
V(graphNetwork)$color <- ifelse(V(graphNetwork)$name == coocTerm, 'cornflowerblue', 'orange') 

# Set edge colors
E(graphNetwork)$color <- adjustcolor("DarkGray", alpha.f = .5)
# scale significance between 1 and 10 for edge width
E(graphNetwork)$width <- scales::rescale(E(graphNetwork)$sig, to = c(1, 10))

# Set edges with radius
E(graphNetwork)$curved <- 0.15 
# Size the nodes by their degree of networking (scaled between 5 and 15)
V(graphNetwork)$size <- scales::rescale(log(degree(graphNetwork)), to = c(5, 15))

# Define the frame and spacing for the plot
par(mai=c(0,0,1,0)) 

# Final Plot
plot(
  graphNetwork,             
  layout = layout.fruchterman.reingold, # Force Directed Layout 
  main = paste(coocTerm, 'Graph'),
  vertex.label.family = "sans",
  vertex.label.cex = 0.8,
  vertex.shape = "circle",
  vertex.label.dist = 0.5,          # Labels of the nodes moved slightly
  vertex.frame.color = adjustcolor("darkgray", alpha.f = .5),
  vertex.label.color = 'black',     # Color of node names
  vertex.label.font = 2,            # Font of node names
  vertex.label = V(graphNetwork)$name,      # node names
  vertex.label.cex = 1 # font size of node names
)

References

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.