Abstract
This documentation experiments with alternative topic models to further explore the “returned students” corpus. It supplements two previous experiments based on structural topic modeling and tidydata-based approaches. Additional features include: topic modeling over time, topic ranking and filtering documents based on their topical content.
This document demonstrates the use of topic models on a corpus (e.g. the “returned students” press corpus) for the extraction of latent semantic contexts in the documents. In this tutorial we will:
This document is largely based on the following tutorial.
The process starts as usual with the reading of the corpus data. For this experiment, we will use the press corpus related to “returned students” extracted from the Proquest Collection of Chinese Historical Newspapers. More detail about how the corpus was created here.
We load the pre-build corpus:
library(readr)
rs_full_text <- read_csv("NER_output/rs_full_text.csv",
col_types = cols(X1 = col_skip()))
rs_full_text %>%
distinct(DocID, Date, Title, Source, Text) %>%
head()
The corpus contains 2739 documents (articles). Only the six first are shown. The table indicates the unique identifier of each article (DocID), the date of publication (Date: YYYYMMDD), the title of the article (Title), the periodical it came from (Source), and the full text of the article (Text).
Some articles, especially “Opinion” articles or articles from “People and Events” newspaper sections, are rather long documents. (More details about the classification of articles here). Other are very shorts (e.g. advertisements). Documents lengths clearly affects the results of topic modeling. For very short/very long articles, it can make sense to concatenate/split single documents to receive longer/shorter textual units for modeling. We will try various strategies later in the process. For now, let’s take the full text of each article as the basic unit for topic modeling.
We load the packages:
options(stringsAsFactors = FALSE)
library(quanteda)
require(topicmodels)
The first thing we need to do is to turn our original corpus into a document-term matrix (DTM) that the topic model algorithm (LDA) can read. In the preprocessing, the text is tokenized, words were stemmed and converted to lowercase letters. Punctuation, special characters and stopwords are removed, since they tend to occur as “noise” in the estimated topics of the LDA model. In addition, we used the textstat_collocations() function to identify multi-word expressions and retain only the most frequent (at least 25 occurrences):
rs_full_text <- rs_full_text %>% distinct(DocID, Date, Title, Source, Text)
rs_corpus <- corpus(rs_full_text$Text, docnames = rs_full_text$DocID)
corpus_tokens <- rs_corpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem() %>%
tokens_remove(pattern = stopwords(), padding = T)
rs_collocations <- textstat_collocations(corpus_tokens, min_count = 25)
corpus_tokens <- tokens_compound(corpus_tokens, rs_collocations)
After the preprocessing, we have two corpus objects:
When we create the DTM, we retain only the terms that occur with a certain minimum frequency in the body. This is primarily used to speed up the model calculation. In this case, we remove the terms which occur in less than 1% of all documents:
DTM <- corpus_tokens %>%
tokens_remove("") %>%
dfm() %>%
dfm_trim(min_docfreq = 0.01, max_docfreq = Inf, docfreq_type = "prop")
dim(DTM)
## [1] 2739 5970
The matrix contains 2739 documents and 5970 terms.
For topic modeling not only language specific stop words may be considered as uninformative, but also domain specific terms. Therefore, we also remove some of the most frequent terms to improve the modeling. This also implies removing the keywords we used to query and build the corpus, as well as the most frequent non words reflecting OCR errors (based on our previous exploration of the text in this tutorial):
top_terms <- c("returned", "students", "china", "chinese", "chines",
"tho", "ing", "will", "one", "peke", "educ", "polic", "said")
DTM <- DTM[, !(colnames(DTM) %in% top_terms)]
Due to vocabulary pruning, we have empty rows in our DTM. Since LDA does not like this, we remove those docs from the DTM and the metadata:
sel_idx <- rowSums(DTM) > 0
DTM <- DTM[sel_idx, ]
rs_full_text <- rs_full_text[sel_idx, ]
We can now calculate a topic model on the processed corpus.
As an unsupervised machine learning method, topic models are suitable for the exploration of data. The calculation of topic models aims to determine the proportionate composition of a fixed number of topics in the documents of a collection. It is useful to experiment with different parameters in order to find the most suitable parameters for your own analysis needs.
For parameterized models such as Latent Dirichlet Allocation (LDA), the number of topics K is the most important parameter to define in advance. How an optimal K should be selected depends on various factors. If K is too small, the collection is divided into a few very general semantic contexts. If K is too large, the collection is divided into too many topics of which some may overlap and others are hardly interpretable.
For our first analysis, we choose a thematic “resolution” of K = 10 topics (based on our previous experiment in structural topic modeling). In contrast to a resolution of 100 or more, this number of topics can be evaluated qualitatively very easy. We also set the seed for the random number generator to ensure reproducible results between repeated model inferences.
require(topicmodels)
K <- 10
topicModel <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, seed = 1, verbose = 25))
## K = 10; V = 5961; M = 2739
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
Depending on the size of the vocabulary, the collection size and the number K, the inference of topic models can take a very long time. This calculation may take several minutes. If it takes too long, reduce the vocabulary in the DTM by increasing the minimum frequency in the previous step.
Fitting the model was the “easy part”: the rest of the analysis will involve exploring and interpreting the model using tidying functions from the tidytext package.
The topic model inference results in two (approximate) posterior probability distributions:
Let’s take a closer look at these results:
Format of the resulting object:
tmResult <- posterior(topicModel)
attributes(tmResult)
## $names
## [1] "terms" "topics"
Length of vocabulary:
ncol(DTM)
## [1] 5961
Word-topic probabilities (Bêta)
beta <- tmResult$terms
dim(beta)
## [1] 10 5961
Rows in beta sum to 1:
rowSums(beta)
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 1 1 1 1 1 1 1
Size of collection (2739 documents):
nrow(DTM)
## [1] 2739
Document-topic probabilities (theta):
theta <- tmResult$topics
dim(theta)
## [1] 2739 10
Rows in theta sum to 1
rowSums(theta)[1:10]
## 1324674682 1326716896 1326716374 1416388974 1326717530 1319877652 1419906755
## 1 1 1 1 1 1 1
## 1319880525 1371328348 1326718133
## 1 1 1
Let’s take a look at the 10 most likely terms within the term probabilities beta of the inferred topics:
terms(topicModel, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "work" "japan" "student" "peopl" "report" "road"
## [2,] "countri" "nation" "school" "can" "ministri" "appli"
## [3,] "can" "japanes" "work" "life" "appoint" "ii"
## [4,] "onli" "american" "univers" "may" "order" "hous"
## [5,] "must" "govern" "colleg" "world" "offic" "room"
## [6,] "time" "america" "church" "onli" "state" "box"
## [7,] "use" "countri" "studi" "us" "minist" "china_press"
## [8,] "much" "ot" "christian" "must" "offici" "let"
## [9,] "ani" "polici" "institut" "men" "court" "ol"
## [10,] "system" "war" "also" "man" "receiv" "bo"
## Topic 7 Topic 8 Topic 9 Topic 10
## [1,] "bank" "canton" "day" "mr"
## [2,] "shanghai" "nank" "time" "shanghai"
## [3,] "busi" "report" "s" "mrs"
## [4,] "industri" "armi" "citi" "dr"
## [5,] "railway" "troop" "men" "miss"
## [6,] "compani" "provinc" "made" "member"
## [7,] "year" "japanes" "two" "held"
## [8,] "new" "militari" "place" "return"
## [9,] "now" "parti" "mani" "presid"
## [10,] "trade" "general" "veri" "club"
The resulting topics are very similar to those we identified in our previous experiments based on structural topic modeling) and alternative LDA implementations.
This model found topics related to business and industry (topic 1), Japan and international relations (topic 2), education (topic 3), government service and officialdom (topic 5), advertisements (topic 6), Canton government and military operations (topic 8) and club life in Shanghai (topic 10). The distribution of words within each topic, however, differed from previous models. Moreover, this model found two “non topics” (topics 7, 9) made up of non words we should remove earlier in the pre-processing step.
For the next steps, we want to give the topics more descriptive names than just numbers. Therefore, we simply concatenate the five most likely terms of each topic to a string that represents a pseudo-name for each topic.
top5termsPerTopic <- terms(topicModel, 5)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
topicNames
## Topic 1 Topic 2
## "work countri can onli must" "japan nation japanes american govern"
## Topic 3 Topic 4
## "student school work univers colleg" "peopl can life may world"
## Topic 5 Topic 6
## "report ministri appoint order offic" "road appli ii hous room"
## Topic 7 Topic 8
## "bank shanghai busi industri railway" "canton nank report armi troop"
## Topic 9 Topic 10
## "day time s citi men" "mr shanghai mrs dr miss"
Although wordclouds may not be optimal for scientific purposes they can provide a quick visual overview of a set of terms. Let’s look at some topics as wordcloud. In the following code, you can change the variable topicToViz with values between 1 and 10 to display other topics (e.g. topic 1):
require(wordcloud2)
topicToViz1 <- 1
You may also select a topic by a term contained in its name (e.g. “bank” for topic 1) (to avoid for topics with overlapping words):
require(wordcloud2)
topicToViz1 <- grep('bank', topicNames)[1]
Then select the 40 most probable terms from the topic by sorting the term-topic-probability vector in decreasing order, extract the probabilities of each of the 40 terms, and visualize the terms as a wordcloud:
top40terms1 <- sort(tmResult$terms[topicToViz1,], decreasing=TRUE)[1:40]
words1 <- names(top40terms1)
probabilities1 <- sort(tmResult$terms[topicToViz1,], decreasing=TRUE)[1:40]
wordcloud2(data.frame(words1, probabilities1), shuffle = FALSE, size = 0.8)
Let us now look more closely at the distribution of topics within individual documents. To this end, we visualize the distribution in 3 sample documents.
Let us first take a look at the contents of three sample documents:
exampleIds <- c(2, 100, 200)
cat(rs_corpus[exampleIds[1]])
## THREE American returned students two British returned students one French returned student one German returned student one graduate of the Shensi University and two Japanese returned students were elected at the first section of the long-expected elections tor the Senate held on June 20' and 21 in Peking For the first time in the history of the Republic had so many foreign-educated Chinese voted Over four hundred registered and some three hundred cast their votes The majority of them are graduates of Western or Japanese universities They questioned the legality of the new Parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein THOSE who were elected Senators are Tsur former president of the Tsing Hua College Ho Yen-sun secretary to Liang Shih-yi and Chen Huai-chang Confucianist who are returned students from America Lo Hung-nien of the Bank of China and Wang Shih-ching editor of the Peking Daily News returned students from England Wu Chun from Ger many Wu Ching-lien from France former Minister to Italy Ting Yung and Wei Sze-kan from Japan and Hsu Shih graduate of the Shensi University They were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to June 10, 1918, authors whose books have been recognized by the Ministry of Education The voting took place in the Hall of the House of Representa tives and the election was presided over bv Fu Tseng-hsiang Minister of Education
cat(rs_corpus[exampleIds[2]])
## THE PRESS BY-LAW July 14 We publish to-day letter from valued Chinese corresponent which gives us an opportunity of correcting misconceptions of the Press Licensing By-law of which we would earnestly ask our Chinese readers to clear their and their friends' minds Without closing identities if such man as Western Returned Student really believes that the Council seeks to kill growing Chines opinion what can be the state of mind of the less educated 1 One can handly blame them seeing how industriously they have heen misinformed to use no harsher term by foreigners who ought to have known better But the plain fact is and it cannot be too strongly emphasized that the last thing the Council desires is to check or in any way interfere with newspapers and fair criticism interpreting these terms in the most liberal sense What iit does wish to do is to stop publica tion of rumours that food and water are being poisoned or pasters depicting Chinese being sold as slaves We hardly think that Western Returned Student will disagree with such desire Of course all this discussion is mere waste of time Sir Havilland de Sausmarez has shown that the licence conditions aire unworkable and the Consular Body has plainly intimated that it does not mean to approve them If Western Returned Student and those like him will use their influence to spread knowledge of these facts among the Chinese they will be doing far more valuable work than in giving renewed currency to mischievous inaccuracies
cat(rs_corpus[exampleIds[3]])
## And They Found Some Money in The Treasury We present the following as proof that occasionally the unexpected does happen even in China Several months ago when Marshal Sun Chuan-fang was in control of Shanghai someone induced him to establish department of Public Health in the Chinese areas adjoining the Foreign Settlements at Shanghai The head of the new Health Service was Dr Hou-ki Hu an American returned student who in the brief space of few months and in the face of complications produced by the disturbed political situation was however able to get his department going in pretty good order Health stations and dispensaries were established and considerable progress was made in the dissemination of health propaganda particularly from the standpoint of vaccination against smallpox But as so often happens in China the war came on and Marshal Sun was defeated and forced to get out When Chiang Kai-shek came in he in accordance with political practice all over the world appointed his own officials includ ing new doctor in the Health Department The new man who is now functioning is Dr Liu Zee-tse returned student from Germany who has been local practitioner at Shanghai for some years But this is the interesting part of the story When Dr Liu arrived to take over from Dr Hu he found Dr Hu still in the office and the transfer of records and even the balance in the bank was made in thoroughly formal and official manner just as they do it in New York or London It is reported that this is the first case where the Southern cr Kuomintang Party has been able to find records in order and money actually in the bank to be transferred over to the new incumbent
After looking into the documents, we visualize the topic distributions within the documents using the packages reshape2 and ggplot2:
Load libraries for visualization:
library(reshape2)
library(ggplot2)
Get topic proportions form example documents:
N <- length(exampleIds)
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")
Visualization the proportions of topics in each sample document:
ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
facet_wrap(~ document, ncol = N)
The barplots show how topics within a document are distributed according to the model. In the current model all three documents show at least a small percentage of each topic. However, one or two topics dominate each document. For instance, topic 4 (actually a non topic) dominates in document 2, whereas topic 5 (official appointment) dominates in document 1. None of the ten topics clearly prevailed in document 3, but topics 2 and 8 related to Japan and Canton are slightly more prevalent.
The topic distribution within a document can be controlled with the Alpha-parameter of the model. Higher alpha priors for topics result in an even distribution of topics within a document. Low alpha priors ensure that the inference process distributes the probability mass on a few topics for each document.
In the previous model calculation the alpha-prior was automatically estimated in order to fit to the data (highest overall probability of the model). However, this automatic estimate does not necessarily correspond to the results that one would like to have as an analyst. Depending on our analysis interest, we might be interested in a more peaky/more even distribution of topics in the model.
Let’s us change the alpha prior to a lower value to see how this affects the topic distributions in the model.
First take a look at alpha from the previous model (5):
attr(topicModel, "alpha")
## [1] 5
Let’s change alpha to 0.5:
topicModel2 <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, verbose = 25, seed = 1, alpha = 0.5))
## K = 10; V = 5961; M = 2739
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
tmResult <- posterior(topicModel2)
theta <- tmResult$topics
beta <- tmResult$terms
Now visualize the topic distributions in the three documents again to see whether there are differences in the distribution structure.
First get topic proportions from example documents:
topicNames2 <- apply(terms(topicModel2, 5), 2, paste, collapse = " ")
topicProportionExamples2 <- theta[exampleIds,]
colnames(topicProportionExamples2) <- topicNames2
vizDataFrame2 <- melt(cbind(data.frame(topicProportionExamples2), document = factor(1:N)), variable.name = "topic", id.vars = "document")
Visualize the new topic distributions:
ggplot(data = vizDataFrame2, aes(topic, value, fill = document), ylab = "proportion") +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
facet_wrap(~ document, ncol = N)
What are the differences in the distribution structure? The prevalence of certain topics in each document is much stronger in the new model with a lower alpha (0.5) than in the previous model based on a tenfold higher alpha (5). Topic 5 clearly dominates in document 1. Topics 2 prevails in document 2 (even before the non-topic 4). Document 3 is a mixture of more equally distributed topics, yet dominated by topic 4 and 8.
First, we try to get a more meaningful order of top terms per topic by re-ranking them with a specific score [2]. The idea of re-ranking terms is similar to the idea of TF-IDF. The more a term appears in top levels w.r.t. its probability, the less meaningful it is to describe the topic. Hence, the scoring favors less general, more specific terms to describe a topic.
What are the defining topics within a collection? There are different approaches to find out which can be used to bring the topics into a certain order.
First re-rank top topic terms for topic names:
topicNames <- apply(lda::top.topic.words(beta, 5, by.score = T), 2, paste, collapse = " ")
Approach 1: We sort topics according to their probability within the entire collection:
topicProportions <- colSums(theta) / nrow(DTM) # mean probablities over all paragraphs
names(topicProportions) <- topicNames # assign the topic names we created before
sort(topicProportions, decreasing = TRUE) # show summed proportions in decreased order
## mr mrs shanghai miss dr countri peopl can nation must
## 0.17856684 0.16899219
## ministri appoint report court state citi girl women man littl
## 0.11044525 0.10821871
## student christian church school colleg japanes japan troop armi nank
## 0.09559074 0.08289550
## bank cotton industri railway trade canton organ member committe union
## 0.08155210 0.06523564
## ol bo ii ot lie appli china_press box room tel
## 0.06341015 0.04509288
We recognize some topics that are way more likely to occur in the corpus than others. These describe rather general thematic coherence. Other topics correspond more to specific contents.
Approach 2 (Rank-1)
countsOfPrimaryTopics <- rep(0, K)
names(countsOfPrimaryTopics) <- topicNames
for (i in 1:nrow(DTM)) {
topicsPerDoc <- theta[i, ] # select topic distribution for document i
# get first element position from ordered list
primaryTopic <- order(topicsPerDoc, decreasing = TRUE)[1]
countsOfPrimaryTopics[primaryTopic] <- countsOfPrimaryTopics[primaryTopic] + 1
}
sort(countsOfPrimaryTopics, decreasing = TRUE)
## mr mrs shanghai miss dr countri peopl can nation must
## 665 581
## ministri appoint report court state citi girl women man littl
## 309 271
## student christian church school colleg japanes japan troop armi nank
## 266 183
## bank cotton industri railway trade canton organ member committe union
## 168 132
## appli china_press box room tel ol bo ii ot lie
## 102 62
The distribution has not changed much, except for the two last topics. Generally, the Rank-1 method places topics with rather specific thematic coherences in upper ranks of the list.
This sorting of topics can be used for further analysis steps such as the semantic interpretation of topics found in the collection, the analysis of time series of the most important topics or the filtering of the original collection based on specific sub-topics.
The fact that a topic model conveys topic probabilities for each document makes it possible to use it for thematic filtering of a collection. As filter we select only those documents which exceed a certain threshold of their probability value for certain topics (for example, each document which contains topic X to more than Y percent).
In the following, we will select documents based on their topic content and display the resulting document quantity over time. In this case, we focus on the first topic (bank), but you can select another topic:
topicToFilter1 <- 1 # you can set this manually ...
# ... or have it selected by a term in the topic name
topicToFilter1 <- grep('bank ', topicNames)[1]
topicThreshold <- 0.1 # minimum share of content must be attributed to the selected topic
selectedDocumentIndexes <- (theta[, topicToFilter1] >= topicThreshold)
filteredCorpus <- rs_corpus %>% corpus_subset(subset = selectedDocumentIndexes)
# show length of filtered corpus
filteredCorpus
## Corpus consisting of 708 documents.
## 1419906755 :
## " Those Returned St..."
##
## 1319880525 :
## " fr Ht fc fr MILLA..."
##
## 1420645297 :
## " BASKET BALL IN PE..."
##
## 1371778568 :
## " Verein Wa Origina..."
##
## 1369950451 :
## " BOXER MONEY PLEA ..."
##
## 1324901182 :
## " Chinese Journal I..."
##
## [ reached max_ndoc ... 702 more documents ]
Our filtered corpus contains 685 documents related to the topic 1 to at least 10 %.
Filtering by topic is useful to focus on the documents that are most related to a specific topic you are interested in. You can increase the minimum probability in order to restrict the size of the filtered corpus and thereby refine your research.
In a last step, we provide a distant view on the topics in the data over time. For this, we aggregate mean topic proportions per decade of all articles related to returned students. These aggregated topic proportions can then be visualized, e.g. as a bar plot.
# append decade information for aggregation
rs_full_text$decade <- paste0(substr(rs_full_text$Date, 0, 3), "0")
# get mean topic proportions per decade
topic_proportion_per_decade <- aggregate(theta, by = list(decade = rs_full_text$decade), mean)
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame3 <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame3, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The visualization shows that topic related to law cases (black) clearly dominate the late decades of the 19th century (1840-1870) and declined from 1900 onwards. Except for the 1880s, Japan and international relations (blue) were an important topic during the entire period. Philosophical debates (people, life) (purple) gained currency in the 1880s, issues related to government appointment rose in the 1900s, and club life in Shanghai developed in the 1910-1930s (light green). Topics related to Canton, communism and the war with Japan (orange), as well as education and missionary work (dark green) were the most important topics of recent articles (1940-1950s).
Here are a few suggestions to take advantage of this topic model exercise:
Good luck!