Abstract
This tutorial shows how to extract key terms from document and (sub-)collections with TF-IDF and the log-likelihood statistic and a reference corpus. We also show how it is possible to hande multi-word units such as `United States’ with the quanteda package.
This tutorial is a continuation of the series devoted to text analysis on the returned students press corpus. In the the previous tutorial, we use the package quanteda to perform basic text statistics on the corpus. In this new instalment, we rely on the same package to extract key terms from document and (sub-)collections with TF-IDF and the log-likelihood statistic and a reference corpus. We also show how it is possible to hande multi-word units such as "United States.
This experiment is based on this excellent tutorial.
Like in the previous tutorial we read the CSV data file containing the press articles related to the returned students and we preprocess the corpus object with a sequence of quanteda functions:
options(stringsAsFactors = FALSE)
library(quanteda)
# read the RS corpus data
library(readr)
rs_full_text <- read_csv("Data/rs_full_text.csv",
col_types = cols(X1 = col_skip()))
## Warning: Missing column names filled in: 'X1' [1]
textdata <- rs_full_text %>% distinct()
rs_corpus <- corpus(textdata$Text, docnames = textdata$DocID)
# Preprocessing of the corpus
corpus_tokens <- rs_corpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords(), padding = T)
In addition, we introduce handling of multi-word units (MWUs), also known as collocations in linguistics. MWUs are words comprising two or more semantically related tokens, such as machine learning’, which form a distinct new sense. Further, named entities such as “Chiang Kai-shek” or “United States” can be regarded as collocations, too. They can be inferred automatically with a statistical test. If two terms occur significantly more often as direct neighbors as expected by chance, they can be treated as collocations.
Quanteda provides two functions for handling MWUs:
First we calculate multi-word unit candidates:
rs_collocations <- textstat_collocations(corpus_tokens, min_count = 25)
We check top collocations (top 25):
head(rs_collocations, 25)
As expected, the term we use for building the corpus topped the list (returned students, returned student). More interestingly, the next most frequent collocate (United States) reflect the huge attraction that this country exerted on Chinese overseas students. The United States actually emerged as top choice destination for Chinese students abroad from the First Remission of the Boxer Indemnity onwards (1908).
We check bottom collocations (25):
tail(rs_collocations, 25)
Caution: For the calculation of collocation statistics being aware of deleted stop words, you need to add the parameter padding = T to the tokens_remove function above.
If you do not like all of the suggested collocation pairs to be considered as MWUs in the subsequent analysis, you can simply remove rows containing unwanted pairs from the rs_collocations object:
rs_collocations_reduced <- rs_collocations %>% filter(!collocation %in% c("years ago", "china press", "road tel", "govern ment", "road telephone", "years ago", "last week", "last year", "took place", "per month", "per cent", "daily news", "chin ese", "let us", "co ltd", "page col", "take place", "turner property", "palmer turner", "chi nese", "address box", "pre sent", "re turned"))
In this example, we remove the false collocations that refer to OCR errors (“chi nese”) and the poorly-informative ones that refer to advertisements (road tel, turner property), newspaper titles (“china press”, “daily news”) or vague time indicators (“last week”, “took place”, “per month”), although the latter may indicate the importance of club meetings in the social life of the returned students:
# compound collocations
corpus_tokens <- tokens_compound(corpus_tokens, rs_collocations_reduced)
Finally, we create a Document-Term-Matrix as usual, but this time with unigram tokens and concatenated MWU tokens:
# Create DTM (also remove padding empty term)
DTM <- corpus_tokens %>%
tokens_remove("") %>%
dfm()
dim(DTM)
## [1] 2739 112928
The new DTM contains 2739 and 112928 distinct tokens (words or multi-word units).
A widely used method to weight terms according to their semantic contribution to a document is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.
Let us compute TF-IDF weights for all terms in the American periodical China Weekly Review for instance:
# Compute IDF: log(N / n_i)
number_of_docs <- nrow(DTM)
term_in_docs <- colSums(DTM > 0)
idf <- log(number_of_docs / term_in_docs)
# Compute TF
cwr <- which(textdata$Source == "The China Weekly Review")
tf <- as.vector(DTM[cwr, ])
# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)
The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the China Weekly Review, according to this weight:
sort(tf_idf, decreasing = T)[1:20]
## <NA> iiing hiug <NA> <NA> <NA> pro-j chahi
## 1266.4557 1092.3180 1084.4027 920.5966 664.4425 617.3972 601.5665 585.7358
## chung-tai tub-baths <NA> <NA> <NA> murtin's <NA> hsi-hung
## 570.5539 569.9051 561.9897 530.3283 510.7787 498.6669 491.1097 482.8362
## cvw1uu <NA> <NA> <NA>
## 474.9209 474.9209 467.0055 463.5380
Unfortunately, the words with the highest td-idf are non-words resulting from errors in Optical Character Recognition (OCR). But then the tf-idf is a useful measure for detecting the most frequent OCR errors.
We can also focus on a single document or a single year. For instance, if we focus on the article titled “ENGINEERING EDUCATION IN CHINA” published in the Peking Gazette in 1917 (April 17, 1917) (DocID: 19170407). Only the 20 top words are displayed:
# Compute TF
doc1416611994 <- which(textdata$DocID == "1416611994")
tf2 <- as.vector(DTM[doc1416611994, ])
# Compute TF-IDF
tf_idf2 <- tf2 * idf
names(tf_idf2) <- colnames(DTM)
sort(tf_idf2, decreasing = T)[1:20]
## technical engineering german school textile thorough
## 30.876167 18.389568 12.142224 11.888412 11.463011 10.021912
## ure students machine college equip english
## 9.741651 9.721472 9.052525 9.050809 8.720000 8.645698
## work_done graduates ita- year9 schonl machanical
## 8.503573 8.020803 7.915348 7.915348 7.915348 7.915348
## etectric t-ian
## 7.915348 7.915348
The terms are highly indicative of the topic of the article (engineering). If we would have just relied upon term frequency, we would have obtained a list of stop words as most important terms. By re-weighting with inverse document frequency, we can see a heavy focus on engineering and education as expected from the title of the article. Besides general terms related to engineering (technical, engineering, machine) or education in general (school, students, college, graduates), we also find more specific terms related to specialized fields or applications (textile, mechanical, engineering) and country of education (German, English). Incidentally, the list also contain words with OCR mistakes (“schonl” instead of “school”, “machanical” instead of “mechanical”, etc).
Let’s now focus on the year 1917:
# First create a new variable for the year
textdata$year <- substr(textdata$Date, 0, 4)
# Compute TF for the year 1917
doc1917 <- which(textdata$year == "1917")
tf3 <- as.vector(DTM[doc1917, ])
# Compute TF-IDF
tf_idf3 <- tf3 * idf
names(tf_idf3) <- colnames(DTM)
sort(tf_idf3, decreasing = T)[1:20]
## 5anton chlh funs journal's <NA> advanta <NA> kwangs
## 538.2437 403.6828 261.2065 253.2911 248.1040 245.3758 239.6246 231.7690
## lifted missee <NA> isze <NA> genei'al amonir corneille
## 231.5667 221.6297 221.6297 213.7144 205.7991 202.2216 189.9684 182.0530
## <NA> <NA> patriek enables
## 182.0530 167.1443 166.2223 163.5650
Again, poorly-ocerized tokens top the list. The larger the text on which we apply tf-idf, the more likely we are to obtain such bizarre words.
By the way, the quanteda-package provides a convenient function for computing tf-idf weights of a given DTM: dfm_tfidf(DTM).
We now use a more sophisticated method with a comparison corpus (or sample in our corpus) and the log likelihood statistic.
The great advantage of the log-likelihood ratio test is to take the size of corpora into account. We can use it to compare (word frequency across) different corpora, or in our case, different periodicals, genre of articles, successive decades or sub-period. In the first step, we define a reference/comparison sample and a target sample. Then we apply log-likelihood ratio on the two corpus and we compare the results. CAUTION: The preprocessing of the comparison corpus must be identical to the preprocessing Of the target corpus to achieve meaningful results!
If we want to compare word frequencies over decades, for instance, we first need to define a reference/comparison decade and a target decade. Then we apply a series of log-likelihood ratio test on the successive decades. We can start by comparing the first and the last decade in our corpus. Aa an alternative, we can also define a split year (e.g. end 1905, 1911, end of WWI…) and compare frequencies over years.
Let’s start with a simple experiment and compare the vocabulary in the two most important periodicals - the American China Weekly Review and the British North-China Herald. First we define the target corpus, for instance, the collection of all articles published in the China Weekly Review:
targetDTM <- DTM
termCountsTarget <- as.vector(targetDTM[cwr, ])
names(termCountsTarget) <- colnames(targetDTM)
In termCountsTarget we have the tf for the China Weekly Review again.
As a comparison corpus, we select all the articles published in the British periodical North-China Herald:
# create comparison corpus
textdata_compare <- textdata %>% filter(Source == "The North China Herald")
corpus_compare <- corpus(textdata_compare$Text, docnames = textdata_compare$DocID)
corpus_compare_tokens <- corpus_compare %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords(), padding = T)
# Create DTM
comparisonDTM <- corpus_compare_tokens %>%
tokens_compound(rs_collocations) %>%
tokens_remove("") %>%
dfm()
termCountsComparison <- colSums(comparisonDTM)
In termCountsComparison we now have the frequencies of all (target) terms in the comparison corpus.
Let us now calculate the log-likelihood ratio test by comparing frequencies of a term in both corpora, taking the size of both corpora into account. First for a single term, for instance “student”:
# Loglikelihood for a single term
term <- "china"
# Determine variables
a <- termCountsTarget[term]
b <- termCountsComparison[term]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)
# Compute log likelihood test
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1))
t2 <- b * log((b/Expected2))
logLikelihood <- 2 * (t1 + t2)
print(logLikelihood)
## china
## 4410.24
The LL value indicates whether the term occurs significantly more frequently / less frequently in the target counts than we would expect from the observation in the comparative counts. Specific significance thresholds are defined for the LL values:
95th percentile; 5% level; p < 0.05; critical value = 3.84 99th percentile; 1% level; p < 0.01; critical value = 6.63 99.9th percentile; 0.1% level; p < 0.001; critical value = 10.83 99.99th percentile; 0.01% level; p < 0.0001; critical value = 15.13
With R it is easy to calculate the LL-value for all terms at once. This is possible because many computing operations in R can be applied not only to individual values, but to entire vectors and matrices. For example, a / 2 results in a single value a divided by 2 if a is a single number. If a is a vector, the result is also a vector, in which all values are divided by 2.
ATTENTION: A comparison of term occurrences between two documents/corpora is actually only useful if the term occurs in both units. Since, however, we also want to include terms which are not contained in the comparative corpus (the termCountsComparison vector contains 0 values for these terms), we simply add 1 to all counts during the test. This is necessary to avoid NaN values which otherwise would result from the log-function on 0-values during the LL test. Alternatively, the test could be performed only on terms that actually occur in both corpora.
First, let’s have a look into the set of terms only occurring in the target document, but not in the comparison corpus:
# use set operation to get terms only occurring in target document
uniqueTerms <- setdiff(names(termCountsTarget), names(termCountsComparison))
# Have a look into a random selection of terms unique in the target corpus
sample(uniqueTerms, 20)
## [1] "cliemiats" "largely-attended" "vivian" "linguan"
## [5] "pecans" "rocke" "ayencles" "tso-ling"
## [9] "iton" "svniphonv" "non-communist" "alwa"
## [13] "1478j" "tented" "subconsciously" "pocket-size"
## [17] "shik" "housi" "kuverttl" "cartas"
Now we calculate the statistics the same way as above, but with vectors. But, since there might be terms in the targetCounts which we did not observe in the comparison corpus, we need to make both vocabularies matching. For this, we append unique terms from the target as zero counts to the comparison frequency vector.
Moreover, we use a little trick to check for zero counts of frequency values in a or b when computing t1 or t2. If a count is zero the log function would produce an NaN value, which we want to avoid. In this case the a == 0 resp. b == 0 expression add 1 to the expression which yields a 0 value after applying the log function:
# Create vector of zeros to append to comparison counts
zeroCounts <- rep(0, length(uniqueTerms))
names(zeroCounts) <- uniqueTerms
termCountsComparison <- c(termCountsComparison, zeroCounts)
# Get list of terms to compare from intersection of target and comparison vocabulary
termsToCompare <- intersect(names(termCountsTarget), names(termCountsComparison))
# Calculate statistics (same as above, but now with vectors!)
a <- termCountsTarget[termsToCompare]
b <- termCountsComparison[termsToCompare]
c <- sum(termCountsTarget)
d <- sum(termCountsComparison)
Expected1 = c * (a+b) / (c+d)
Expected2 = d * (a+b) / (c+d)
t1 <- a * log((a/Expected1) + (a == 0))
t2 <- b * log((b/Expected2) + (b == 0))
logLikelihood <- 2 * (t1 + t2)
# Compare relative frequencies to indicate over/underuse
relA <- a / c
relB <- b / d
Display ten most underused term:
# underused terms
head(logLikelihood[relA < relB], 10)
## many university men shanghai staff gen
## 1410.433358 553.887485 1469.257425 2295.913060 198.051158 241.174005
## hwang fu mayor chinese
## 3.173342 117.361363 60.293501 4372.478139
Let’s take a look at the results: The 50 more frequently used / less frequently used terms, and then the more frequently used terms compared to their frequency:
Terms overused in the China Weekly Review compared to the North China Herald:
# top terms (overuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=TRUE)[1:50]
## china chinese mr one shanghai made time
## 4410.2402 4372.4781 3665.5234 3335.1826 2295.9131 1904.0053 1808.8050
## now peking may government said also people
## 1803.4403 1766.0244 1667.5913 1639.0312 1615.8135 1558.1110 1542.8078
## work new men two upon many s
## 1540.8788 1518.8936 1469.2574 1437.5240 1410.5506 1410.4334 1357.8082
## foreign much ing present students first country
## 1347.0838 1326.4570 1310.5903 1284.9000 1272.5102 1245.5368 1243.9501
## can great re mrs dr must ed
## 1243.7744 1156.6832 1126.5365 1120.1898 1095.1266 1086.8697 1069.4163
## japan general japanese well good canton tion
## 1066.2430 1040.8562 1021.8162 950.4160 940.8960 939.3093 934.5493
## way british even man given day make
## 929.7893 920.2692 900.8410 891.7091 880.6025 878.2485 866.3224
## order
## 847.2824
Terms underused in the China Weekly Review compared to the North China Herald:
# bottom terms (underuse in targetCorpus compared to comparisonCorpus)
sort(logLikelihood, decreasing=FALSE)[1:25]
## embraces indefinite looters compulsion
## 0.0002171946 0.0002171946 0.0002171946 0.0002171946
## zealand insect pulpit road_shanghai
## 0.0002171946 0.0002171946 0.0002171946 0.0002171946
## chinese_citizens explains cer- grammar
## 0.0004397459 0.0006515837 0.0022634746 0.0022634746
## frederick brighter smothered tastes
## 0.0022634746 0.0022634746 0.0022634746 0.0022634746
## mixture sors strains premiership
## 0.0022634746 0.0022634746 0.0022634746 0.0022634746
## abiding roy sybil tremen
## 0.0022634746 0.0022634746 0.0022634746 0.0022634746
## well-trained
## 0.0022634746
We also see terms that have comparatively low frequencies are identified by the LL test as statistically significant compared to the reference corpus:
llTop100 <- sort(logLikelihood, decreasing=TRUE)[1:100]
frqTop100 <- termCountsTarget[names(llTop100)]
frqLLcomparison <- data.frame(llTop100, frqTop100)
head(frqLLcomparison, 10)
# Number of signficantly overused terms (p < 0.01)
sum(logLikelihood >0)
## [1] NA
The method did not extract any key terms from the China Weekly Review.
Finally, visualize the result of the 50 most significant terms as a Wordcloud. This can be realized simply by function of the package wordcloud. Additionally to the words and their weights (here we use likelihood values), we override default scaling and color parameters. Feel free to try different parameters to modify the wordcloud rendering.
require(wordcloud2)
## Loading required package: wordcloud2
top50 <- sort(logLikelihood, decreasing = TRUE)[1:50]
top50_df <- data.frame(word = names(top50), count = top50, row.names = NULL)
wordcloud2(top50_df, shuffle = F, size = 0.5)
The wordcloud shows the 50 terms that were the most significantly overused in the China Weekly Review compared to the North-China Herald in the “returned students” (RS) collection.
Key term extraction cannot be done for single documents, but for entire (sub-)corpora. Depending on the comparison corpora, the results may vary. Instead of comparing a single document/corpus to another document or corpus, we now compare collections/samples of articles within our corpus to other collections/samples of articles in the same corpus. We use decade as the basis for sampling our corpus.
For this, we iterate over all decades using a for-loop. Within the loop, we utilize a logical vector (Boolean TRUE/FALSE values), to split the DTM into two sub matrices: rows of the currently selected decade and rows of all other decades From these matrices our counts of target and comparison frequencies are created. The statistical computation of the log-likelihood measure from above, we outsourced into the function calculateLogLikelihood which we load with the source command at the beginning of the block. The function just takes both frequency vectors as input parameters and outputs a LL-value for each term of the target vector.
Results of the LL key term extraction are visualized again as a wordcloud. Instead of plotting the wordcloud into RStudio, this time we write the visualization as a PDF-file to disk into the wordclouds folder. After the for-loop is completed, the folder should contain 42 wordcloud PDFs, one for each president.
First we create the variable “decade”:
textdata$year <- substr(textdata$Date, 0, 4)
textdata$decade <- paste0(substr(textdata$Date, 0, 3), "0")
We define the target corpus:
targetDTM <- DTM
Then we load the function calculateLogLikelihood
source("https://tm4ss.github.io/calculateLogLikelihood.R")
We apply the self loop:
rs_decades <- unique(textdata$decade)
for (decade in rs_decades) {
cat("Extracting terms per decade", decade, "\n")
selector_logical_idx <- textdata$decade == decade
decadeDTM <- targetDTM[selector_logical_idx, ]
termCountsTarget <- colSums(decadeDTM)
otherDTM <- targetDTM[!selector_logical_idx, ]
termCountsComparison <- colSums(otherDTM)
loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)
top50 <- sort(loglik_terms, decreasing = TRUE)[1:50]
fileName <- paste0("wordclouds/", decade, ".pdf")
pdf(fileName, width = 9, height = 7)
wordcloud::wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = RColorBrewer::brewer.pal(8, "Dark2"), random.order = F)
dev.off()
}
## Extracting terms per decade 1920
## Extracting terms per decade 1910
## Extracting terms per decade 1930
## Extracting terms per decade 1950
## Extracting terms per decade 1940
## Extracting terms per decade 1900
## Extracting terms per decade 1870
## Extracting terms per decade 1890
## Extracting terms per decade 1840
## Extracting terms per decade 1880
Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.