Prologue

This tutorial is a continuation of the series devoted to text analysis on the returned students press corpus. In the the previous tutorial, we use the package quanteda to perform basic text statistics on the corpus. In this new instalment, we rely on the same package to extract key terms from document and (sub-)collections with TF-IDF and the log-likelihood statistic and a reference corpus. We also show how it is possible to hande multi-word units such as "United States.

This experiment is based on this excellent tutorial.

Multi-word tokenization

Like in the previous tutorial we read the CSV data file containing the press articles related to the returned students and we preprocess the corpus object with a sequence of quanteda functions:

options(stringsAsFactors = FALSE)
library(quanteda)

# read the RS corpus data
library(readr)
rs_full_text <- read_csv("Data/rs_full_text.csv",
col_types = cols(X1 = col_skip()))
## Warning: Missing column names filled in: 'X1' [1]
textdata <- rs_full_text %>% distinct()

rs_corpus <- corpus(textdata$Text, docnames = textdata$DocID)

# Preprocessing of the corpus
corpus_tokens <- rs_corpus %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower()  %>% 
  tokens_remove(pattern = stopwords(), padding = T)

In addition, we introduce handling of multi-word units (MWUs), also known as collocations in linguistics. MWUs are words comprising two or more semantically related tokens, such as machine learning’, which form a distinct new sense. Further, named entities such as “Chiang Kai-shek” or “United States” can be regarded as collocations, too. They can be inferred automatically with a statistical test. If two terms occur significantly more often as direct neighbors as expected by chance, they can be treated as collocations.

Quanteda provides two functions for handling MWUs:

  • textstat_collocations performs a statistical test to identify collocation candidates.
  • tokens_compound concatenates collocation terms in each document with a separation character, e.g. _. By this, the two terms are treated as a single new vocabulary type for any subsequent text processing algorithm.

First we calculate multi-word unit candidates:

rs_collocations <- textstat_collocations(corpus_tokens, min_count = 25)


We check top collocations (top 25):

head(rs_collocations, 25)


As expected, the term we use for building the corpus topped the list (returned students, returned student). More interestingly, the next most frequent collocate (United States) reflect the huge attraction that this country exerted on Chinese overseas students. The United States actually emerged as top choice destination for Chinese students abroad from the First Remission of the Boxer Indemnity onwards (1908).

We check bottom collocations (25):

tail(rs_collocations, 25)


Caution: For the calculation of collocation statistics being aware of deleted stop words, you need to add the parameter padding = T to the tokens_remove function above.

If you do not like all of the suggested collocation pairs to be considered as MWUs in the subsequent analysis, you can simply remove rows containing unwanted pairs from the rs_collocations object:

rs_collocations_reduced <- rs_collocations %>% filter(!collocation %in% c("years ago", "china press", "road tel", "govern ment", "road telephone", "years ago", "last week", "last year", "took place", "per month", "per cent", "daily news", "chin ese", "let us", "co ltd", "page col", "take place", "turner property", "palmer turner", "chi nese", "address box", "pre sent", "re turned")) 


In this example, we remove the false collocations that refer to OCR errors (“chi nese”) and the poorly-informative ones that refer to advertisements (road tel, turner property), newspaper titles (“china press”, “daily news”) or vague time indicators (“last week”, “took place”, “per month”), although the latter may indicate the importance of club meetings in the social life of the returned students:

# compound collocations
corpus_tokens <- tokens_compound(corpus_tokens, rs_collocations_reduced)


Finally, we create a Document-Term-Matrix as usual, but this time with unigram tokens and concatenated MWU tokens:

# Create DTM (also remove padding empty term)
DTM <- corpus_tokens %>% 
  tokens_remove("") %>%
  dfm() 

dim(DTM)
## [1]   2739 112928


The new DTM contains 2739 and 112928 distinct tokens (words or multi-word units).

TF-IDF

A widely used method to weight terms according to their semantic contribution to a document is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.

Let us compute TF-IDF weights for all terms in the American periodical China Weekly Review for instance:

# Compute IDF: log(N / n_i)
number_of_docs <- nrow(DTM)
term_in_docs <- colSums(DTM > 0)
idf <- log(number_of_docs / term_in_docs)

# Compute TF
cwr <- which(textdata$Source == "The China Weekly Review")
tf <- as.vector(DTM[cwr, ])

# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)


The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the China Weekly Review, according to this weight:

sort(tf_idf, decreasing = T)[1:20]
##      <NA>     iiing      hiug      <NA>      <NA>      <NA>     pro-j     chahi 
## 1266.4557 1092.3180 1084.4027  920.5966  664.4425  617.3972  601.5665  585.7358 
## chung-tai tub-baths      <NA>      <NA>      <NA>  murtin's      <NA>  hsi-hung 
##  570.5539  569.9051  561.9897  530.3283  510.7787  498.6669  491.1097  482.8362 
##    cvw1uu      <NA>      <NA>      <NA> 
##  474.9209  474.9209  467.0055  463.5380


Unfortunately, the words with the highest td-idf are non-words resulting from errors in Optical Character Recognition (OCR). But then the tf-idf is a useful measure for detecting the most frequent OCR errors.

We can also focus on a single document or a single year. For instance, if we focus on the article titled “ENGINEERING EDUCATION IN CHINA” published in the Peking Gazette in 1917 (April 17, 1917) (DocID: 19170407). Only the 20 top words are displayed:

# Compute TF
doc1416611994 <- which(textdata$DocID == "1416611994")
tf2 <- as.vector(DTM[doc1416611994, ])

# Compute TF-IDF
tf_idf2 <- tf2 * idf
names(tf_idf2) <- colnames(DTM)

sort(tf_idf2, decreasing = T)[1:20]
##   technical engineering      german      school     textile    thorough 
##   30.876167   18.389568   12.142224   11.888412   11.463011   10.021912 
##         ure    students     machine     college       equip     english 
##    9.741651    9.721472    9.052525    9.050809    8.720000    8.645698 
##   work_done   graduates        ita-       year9      schonl  machanical 
##    8.503573    8.020803    7.915348    7.915348    7.915348    7.915348 
##    etectric       t-ian 
##    7.915348    7.915348


The terms are highly indicative of the topic of the article (engineering). If we would have just relied upon term frequency, we would have obtained a list of stop words as most important terms. By re-weighting with inverse document frequency, we can see a heavy focus on engineering and education as expected from the title of the article. Besides general terms related to engineering (technical, engineering, machine) or education in general (school, students, college, graduates), we also find more specific terms related to specialized fields or applications (textile, mechanical, engineering) and country of education (German, English). Incidentally, the list also contain words with OCR mistakes (“schonl” instead of “school”, “machanical” instead of “mechanical”, etc).

Let’s now focus on the year 1917:

# First create a new variable for the year
textdata$year <- substr(textdata$Date, 0, 4)
# Compute TF for the year 1917 
doc1917 <- which(textdata$year == "1917")
tf3 <- as.vector(DTM[doc1917, ])

# Compute TF-IDF
tf_idf3 <- tf3 * idf
names(tf_idf3) <- colnames(DTM)

sort(tf_idf3, decreasing = T)[1:20]
##    5anton      chlh      funs journal's      <NA>   advanta      <NA>    kwangs 
##  538.2437  403.6828  261.2065  253.2911  248.1040  245.3758  239.6246  231.7690 
##    lifted    missee      <NA>      isze      <NA>  genei'al    amonir corneille 
##  231.5667  221.6297  221.6297  213.7144  205.7991  202.2216  189.9684  182.0530 
##      <NA>      <NA>   patriek   enables 
##  182.0530  167.1443  166.2223  163.5650


Again, poorly-ocerized tokens top the list. The larger the text on which we apply tf-idf, the more likely we are to obtain such bizarre words.

By the way, the quanteda-package provides a convenient function for computing tf-idf weights of a given DTM: dfm_tfidf(DTM).

Log-likelihood ratio test

We now use a more sophisticated method with a comparison corpus (or sample in our corpus) and the log likelihood statistic.

The great advantage of the log-likelihood ratio test is to take the size of corpora into account. We can use it to compare (word frequency across) different corpora, or in our case, different periodicals, genre of articles, successive decades or sub-period. In the first step, we define a reference/comparison sample and a target sample. Then we apply log-likelihood ratio on the two corpus and we compare the results. CAUTION: The preprocessing of the comparison corpus must be identical to the preprocessing Of the target corpus to achieve meaningful results!

  • For comparing genres or periodicals, we sample/split our corpus into as many periodicals (12) or genres of articles (10) our corpus contains. We define a reference/comparison corpus/sample and target corpora to compare.
  • For comparing decades or periods, we sample/split our corpus into a series of mini corpus/samples - as many samples as we have decades or sub-periods.

If we want to compare word frequencies over decades, for instance, we first need to define a reference/comparison decade and a target decade. Then we apply a series of log-likelihood ratio test on the successive decades. We can start by comparing the first and the last decade in our corpus. Aa an alternative, we can also define a split year (e.g. end 1905, 1911, end of WWI…) and compare frequencies over years.

Let’s start with a simple experiment and compare the vocabulary in the two most important periodicals - the American China Weekly Review and the British North-China Herald. First we define the target corpus, for instance, the collection of all articles published in the China Weekly Review:

targetDTM <- DTM

termCountsTarget <- as.vector(targetDTM[cwr, ])
names(termCountsTarget) <- colnames(targetDTM)


In termCountsTarget we have the tf for the China Weekly Review again.

As a comparison corpus, we select all the articles published in the British periodical North-China Herald:

# create comparison corpus 

textdata_compare <- textdata %>% filter(Source == "The North China Herald")
corpus_compare <- corpus(textdata_compare$Text, docnames = textdata_compare$DocID) 

corpus_compare_tokens <- corpus_compare %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower()  %>% 
  tokens_remove(pattern = stopwords(), padding = T)

# Create DTM
comparisonDTM <- corpus_compare_tokens %>% 
  tokens_compound(rs_collocations) %>%
  tokens_remove("") %>%
  dfm() 

termCountsComparison <- colSums(comparisonDTM)


In termCountsComparison we now have the frequencies of all (target) terms in the comparison corpus.

Let us now calculate the log-likelihood ratio test by comparing frequencies of a term in both corpora, taking the size of both corpora into account. First for a single term, for instance “student”:

# Loglikelihood for a single term
term <- "china"

# Determine variables
a <- termCountsTarget[term]
b <- termCountsComparison[term]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)

# Compute log likelihood test
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1))
t2 <- b * log((b/Expected2))
logLikelihood <- 2 * (t1 + t2)

print(logLikelihood)
##   china 
## 4410.24


The LL value indicates whether the term occurs significantly more frequently / less frequently in the target counts than we would expect from the observation in the comparative counts. Specific significance thresholds are defined for the LL values:

95th percentile; 5% level; p < 0.05; critical value = 3.84 99th percentile; 1% level; p < 0.01; critical value = 6.63 99.9th percentile; 0.1% level; p < 0.001; critical value = 10.83 99.99th percentile; 0.01% level; p < 0.0001; critical value = 15.13

With R it is easy to calculate the LL-value for all terms at once. This is possible because many computing operations in R can be applied not only to individual values, but to entire vectors and matrices. For example, a / 2 results in a single value a divided by 2 if a is a single number. If a is a vector, the result is also a vector, in which all values are divided by 2.

ATTENTION: A comparison of term occurrences between two documents/corpora is actually only useful if the term occurs in both units. Since, however, we also want to include terms which are not contained in the comparative corpus (the termCountsComparison vector contains 0 values for these terms), we simply add 1 to all counts during the test. This is necessary to avoid NaN values which otherwise would result from the log-function on 0-values during the LL test. Alternatively, the test could be performed only on terms that actually occur in both corpora.

First, let’s have a look into the set of terms only occurring in the target document, but not in the comparison corpus:

# use set operation to get terms only occurring in target document
uniqueTerms <- setdiff(names(termCountsTarget), names(termCountsComparison))
# Have a look into a random selection of terms unique in the target corpus
sample(uniqueTerms, 20)
##  [1] "cliemiats"        "largely-attended" "vivian"           "linguan"         
##  [5] "pecans"           "rocke"            "ayencles"         "tso-ling"        
##  [9] "iton"             "svniphonv"        "non-communist"    "alwa"            
## [13] "1478j"            "tented"           "subconsciously"   "pocket-size"     
## [17] "shik"             "housi"            "kuverttl"         "cartas"


Now we calculate the statistics the same way as above, but with vectors. But, since there might be terms in the targetCounts which we did not observe in the comparison corpus, we need to make both vocabularies matching. For this, we append unique terms from the target as zero counts to the comparison frequency vector.

Moreover, we use a little trick to check for zero counts of frequency values in a or b when computing t1 or t2. If a count is zero the log function would produce an NaN value, which we want to avoid. In this case the a == 0 resp. b == 0 expression add 1 to the expression which yields a 0 value after applying the log function:

# Create vector of zeros to append to comparison counts
zeroCounts <- rep(0, length(uniqueTerms))
names(zeroCounts) <- uniqueTerms
termCountsComparison <- c(termCountsComparison, zeroCounts)

# Get list of terms to compare from intersection of target and comparison vocabulary
termsToCompare <- intersect(names(termCountsTarget), names(termCountsComparison))

# Calculate statistics (same as above, but now with vectors!)
a <- termCountsTarget[termsToCompare]
b <- termCountsComparison[termsToCompare]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1) + (a == 0))
t2 <- b * log((b/Expected2) + (b == 0))
logLikelihood <- 2 * (t1 + t2)

# Compare relative frequencies to indicate over/underuse
relA <- a / c
relB <- b / d


Display ten most underused term:

# underused terms 
head(logLikelihood[relA < relB], 10)
##        many  university         men    shanghai       staff         gen 
## 1410.433358  553.887485 1469.257425 2295.913060  198.051158  241.174005 
##       hwang          fu       mayor     chinese 
##    3.173342  117.361363   60.293501 4372.478139

Let’s take a look at the results: The 50 more frequently used / less frequently used terms, and then the more frequently used terms compared to their frequency:

Terms overused in the China Weekly Review compared to the North China Herald:

# top terms (overuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=TRUE)[1:50]
##      china    chinese         mr        one   shanghai       made       time 
##  4410.2402  4372.4781  3665.5234  3335.1826  2295.9131  1904.0053  1808.8050 
##        now     peking        may government       said       also     people 
##  1803.4403  1766.0244  1667.5913  1639.0312  1615.8135  1558.1110  1542.8078 
##       work        new        men        two       upon       many          s 
##  1540.8788  1518.8936  1469.2574  1437.5240  1410.5506  1410.4334  1357.8082 
##    foreign       much        ing    present   students      first    country 
##  1347.0838  1326.4570  1310.5903  1284.9000  1272.5102  1245.5368  1243.9501 
##        can      great         re        mrs         dr       must         ed 
##  1243.7744  1156.6832  1126.5365  1120.1898  1095.1266  1086.8697  1069.4163 
##      japan    general   japanese       well       good     canton       tion 
##  1066.2430  1040.8562  1021.8162   950.4160   940.8960   939.3093   934.5493 
##        way    british       even        man      given        day       make 
##   929.7893   920.2692   900.8410   891.7091   880.6025   878.2485   866.3224 
##      order 
##   847.2824


Terms underused in the China Weekly Review compared to the North China Herald:

# bottom terms (underuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=FALSE)[1:25]
##         embraces       indefinite          looters       compulsion 
##     0.0002171946     0.0002171946     0.0002171946     0.0002171946 
##          zealand           insect           pulpit    road_shanghai 
##     0.0002171946     0.0002171946     0.0002171946     0.0002171946 
## chinese_citizens         explains             cer-          grammar 
##     0.0004397459     0.0006515837     0.0022634746     0.0022634746 
##        frederick         brighter        smothered           tastes 
##     0.0022634746     0.0022634746     0.0022634746     0.0022634746 
##          mixture             sors          strains      premiership 
##     0.0022634746     0.0022634746     0.0022634746     0.0022634746 
##          abiding              roy            sybil           tremen 
##     0.0022634746     0.0022634746     0.0022634746     0.0022634746 
##     well-trained 
##     0.0022634746


We also see terms that have comparatively low frequencies are identified by the LL test as statistically significant compared to the reference corpus:

llTop100 <- sort(logLikelihood, decreasing=TRUE)[1:100]
frqTop100 <- termCountsTarget[names(llTop100)]
frqLLcomparison <- data.frame(llTop100, frqTop100)
head(frqLLcomparison, 10) 
# Number of signficantly overused terms (p < 0.01)
sum(logLikelihood >0)
## [1] NA


The method did not extract any key terms from the China Weekly Review.

Visualization

Finally, visualize the result of the 50 most significant terms as a Wordcloud. This can be realized simply by function of the package wordcloud. Additionally to the words and their weights (here we use likelihood values), we override default scaling and color parameters. Feel free to try different parameters to modify the wordcloud rendering.

require(wordcloud2)
## Loading required package: wordcloud2
top50 <- sort(logLikelihood, decreasing = TRUE)[1:50]
top50_df <- data.frame(word = names(top50), count = top50, row.names = NULL)
wordcloud2(top50_df, shuffle = F, size = 0.5)


The wordcloud shows the 50 terms that were the most significantly overused in the China Weekly Review compared to the North-China Herald in the “returned students” (RS) collection.

Iteration

Key term extraction cannot be done for single documents, but for entire (sub-)corpora. Depending on the comparison corpora, the results may vary. Instead of comparing a single document/corpus to another document or corpus, we now compare collections/samples of articles within our corpus to other collections/samples of articles in the same corpus. We use decade as the basis for sampling our corpus.

For this, we iterate over all decades using a for-loop. Within the loop, we utilize a logical vector (Boolean TRUE/FALSE values), to split the DTM into two sub matrices: rows of the currently selected decade and rows of all other decades From these matrices our counts of target and comparison frequencies are created. The statistical computation of the log-likelihood measure from above, we outsourced into the function calculateLogLikelihood which we load with the source command at the beginning of the block. The function just takes both frequency vectors as input parameters and outputs a LL-value for each term of the target vector.

Results of the LL key term extraction are visualized again as a wordcloud. Instead of plotting the wordcloud into RStudio, this time we write the visualization as a PDF-file to disk into the wordclouds folder. After the for-loop is completed, the folder should contain 42 wordcloud PDFs, one for each president.

First we create the variable “decade”:

textdata$year <- substr(textdata$Date, 0, 4)
textdata$decade <- paste0(substr(textdata$Date, 0, 3), "0")


We define the target corpus:

targetDTM <- DTM

Then we load the function calculateLogLikelihood

source("https://tm4ss.github.io/calculateLogLikelihood.R")


We apply the self loop:

rs_decades <- unique(textdata$decade)
for (decade in rs_decades) {
  
  cat("Extracting terms per decade", decade, "\n")
  
  selector_logical_idx <- textdata$decade == decade
  
  decadeDTM <- targetDTM[selector_logical_idx, ]
  termCountsTarget <- colSums(decadeDTM)
  
  otherDTM <- targetDTM[!selector_logical_idx, ]
  termCountsComparison <- colSums(otherDTM)
  
  loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)
  
  top50 <- sort(loglik_terms, decreasing = TRUE)[1:50]
  
  fileName <- paste0("wordclouds/", decade, ".pdf")
  pdf(fileName, width = 9, height = 7)
  wordcloud::wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = RColorBrewer::brewer.pal(8, "Dark2"), random.order = F)
  dev.off()
  
}
## Extracting terms per decade 1920
## Extracting terms per decade 1910
## Extracting terms per decade 1930
## Extracting terms per decade 1950
## Extracting terms per decade 1940
## Extracting terms per decade 1900
## Extracting terms per decade 1870
## Extracting terms per decade 1890
## Extracting terms per decade 1840
## Extracting terms per decade 1880

Concluding remarks

References

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.