Topic modeling on Chinese historical newspapers (3)

Modeling topics in the ‘returned students’ press corpus using topicmodels and quanteda

Cécile Armand

2021-05-04

Abstract

This documentation experiments with alternative topic models to further explore the “returned students” corpus. It supplements two previous experiments based on structural topic modeling and tidydata-based approaches. Additional features include: topic modeling over time, topic ranking and filtering documents based on their topical content.

Prologue

This document demonstrates the use of topic models on a corpus (e.g. the “returned students” press corpus) for the extraction of latent semantic contexts in the documents. In this tutorial we will:

Read in and preprocess text data,
Calculate a topic model using the R package topmicmodels and analyze its results in more detail,
Visualize the results from the calculated model and
Select documents based on their topic composition.

This document is largely based on the following tutorial.

The process starts as usual with the reading of the corpus data. For this experiment, we will use the press corpus related to “returned students” extracted from the Proquest Collection of Chinese Historical Newspapers. More detail about how the corpus was created here.

We load the pre-build corpus:

library(readr)
rs_full_text <- read_csv("NER_output/rs_full_text.csv",
col_types = cols(X1 = col_skip()))

rs_full_text %>% 
  distinct(DocID, Date, Title, Source, Text) %>% 
  head()

The corpus contains 2739 documents (articles). Only the six first are shown. The table indicates the unique identifier of each article (DocID), the date of publication (Date: YYYYMMDD), the title of the article (Title), the periodical it came from (Source), and the full text of the article (Text).

Some articles, especially “Opinion” articles or articles from “People and Events” newspaper sections, are rather long documents. (More details about the classification of articles here). Other are very shorts (e.g. advertisements). Documents lengths clearly affects the results of topic modeling. For very short/very long articles, it can make sense to concatenate/split single documents to receive longer/shorter textual units for modeling. We will try various strategies later in the process. For now, let’s take the full text of each article as the basic unit for topic modeling.

We load the packages:

options(stringsAsFactors = FALSE)
library(quanteda)
require(topicmodels)

Text Preprocessing

The first thing we need to do is to turn our original corpus into a document-term matrix (DTM) that the topic model algorithm (LDA) can read. In the preprocessing, the text is tokenized, words were stemmed and converted to lowercase letters. Punctuation, special characters and stopwords are removed, since they tend to occur as “noise” in the estimated topics of the LDA model. In addition, we used the textstat_collocations() function to identify multi-word expressions and retain only the most frequent (at least 25 occurrences):

rs_full_text <- rs_full_text %>% distinct(DocID, Date, Title, Source, Text)

rs_corpus <- corpus(rs_full_text$Text, docnames = rs_full_text$DocID)

corpus_tokens <- rs_corpus %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_wordstem() %>%
  tokens_remove(pattern = stopwords(), padding = T)

rs_collocations <- textstat_collocations(corpus_tokens, min_count = 25)
corpus_tokens <- tokens_compound(corpus_tokens, rs_collocations)

After the preprocessing, we have two corpus objects:

the corpus object rs_corpus serves to be able to view the original texts and thus to facilitate a qualitative control of the topic model results
the processed Corpus (corpus_tokens) on which we calculate an LDA topic model.

When we create the DTM, we retain only the terms that occur with a certain minimum frequency in the body. This is primarily used to speed up the model calculation. In this case, we remove the terms which occur in less than 1% of all documents:

DTM <- corpus_tokens %>% 
  tokens_remove("") %>%
  dfm() %>% 
  dfm_trim(min_docfreq = 0.01, max_docfreq = Inf, docfreq_type = "prop")

dim(DTM)

## [1] 2739 5970

The matrix contains 2739 documents and 5970 terms.

For topic modeling not only language specific stop words may be considered as uninformative, but also domain specific terms. Therefore, we also remove some of the most frequent terms to improve the modeling. This also implies removing the keywords we used to query and build the corpus, as well as the most frequent non words reflecting OCR errors (based on our previous exploration of the text in this tutorial):

top_terms <- c("returned", "students", "china", "chinese", "chines", 
                                                 "tho", "ing", "will", "one", "peke", "educ", "polic", "said")

DTM <- DTM[, !(colnames(DTM) %in% top_terms)]

Due to vocabulary pruning, we have empty rows in our DTM. Since LDA does not like this, we remove those docs from the DTM and the metadata:

sel_idx <- rowSums(DTM) > 0
DTM <- DTM[sel_idx, ]
rs_full_text <- rs_full_text[sel_idx, ]

We can now calculate a topic model on the processed corpus.

Model calculation

As an unsupervised machine learning method, topic models are suitable for the exploration of data. The calculation of topic models aims to determine the proportionate composition of a fixed number of topics in the documents of a collection. It is useful to experiment with different parameters in order to find the most suitable parameters for your own analysis needs.

For parameterized models such as Latent Dirichlet Allocation (LDA), the number of topics K is the most important parameter to define in advance. How an optimal K should be selected depends on various factors. If K is too small, the collection is divided into a few very general semantic contexts. If K is too large, the collection is divided into too many topics of which some may overlap and others are hardly interpretable.

For our first analysis, we choose a thematic “resolution” of K = 10 topics (based on our previous experiment in structural topic modeling). In contrast to a resolution of 100 or more, this number of topics can be evaluated qualitatively very easy. We also set the seed for the random number generator to ensure reproducible results between repeated model inferences.

require(topicmodels)
K <- 10
topicModel <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, seed = 1, verbose = 25))

## K = 10; V = 5961; M = 2739
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!

Depending on the size of the vocabulary, the collection size and the number K, the inference of topic models can take a very long time. This calculation may take several minutes. If it takes too long, reduce the vocabulary in the DTM by increasing the minimum frequency in the previous step.

Fitting the model was the “easy part”: the rest of the analysis will involve exploring and interpreting the model using tidying functions from the tidytext package.

The topic model inference results in two (approximate) posterior probability distributions:

Theta: a distribution theta over K topics within each document
Bêta: and a distribution beta over V terms within each topic, where V represents the length of the vocabulary of the collection (V = 5961).

Let’s take a closer look at these results:

Format of the resulting object:

tmResult <- posterior(topicModel)
attributes(tmResult)

## $names
## [1] "terms"  "topics"

Length of vocabulary:

ncol(DTM)

## [1] 5961

Word-topic probabilities (Bêta)

beta <- tmResult$terms 
dim(beta)

## [1]   10 5961

Rows in beta sum to 1:

rowSums(beta)

##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

Size of collection (2739 documents):

nrow(DTM)

## [1] 2739

Document-topic probabilities (theta):

theta <- tmResult$topics 
dim(theta)

## [1] 2739   10

Rows in theta sum to 1

rowSums(theta)[1:10]

## 1324674682 1326716896 1326716374 1416388974 1326717530 1319877652 1419906755 
##          1          1          1          1          1          1          1 
## 1319880525 1371328348 1326718133 
##          1          1          1

Let’s take a look at the 10 most likely terms within the term probabilities beta of the inferred topics:

terms(topicModel, 10)

##       Topic 1   Topic 2    Topic 3     Topic 4 Topic 5    Topic 6      
##  [1,] "work"    "japan"    "student"   "peopl" "report"   "road"       
##  [2,] "countri" "nation"   "school"    "can"   "ministri" "appli"      
##  [3,] "can"     "japanes"  "work"      "life"  "appoint"  "ii"         
##  [4,] "onli"    "american" "univers"   "may"   "order"    "hous"       
##  [5,] "must"    "govern"   "colleg"    "world" "offic"    "room"       
##  [6,] "time"    "america"  "church"    "onli"  "state"    "box"        
##  [7,] "use"     "countri"  "studi"     "us"    "minist"   "china_press"
##  [8,] "much"    "ot"       "christian" "must"  "offici"   "let"        
##  [9,] "ani"     "polici"   "institut"  "men"   "court"    "ol"         
## [10,] "system"  "war"      "also"      "man"   "receiv"   "bo"         
##       Topic 7    Topic 8    Topic 9 Topic 10  
##  [1,] "bank"     "canton"   "day"   "mr"      
##  [2,] "shanghai" "nank"     "time"  "shanghai"
##  [3,] "busi"     "report"   "s"     "mrs"     
##  [4,] "industri" "armi"     "citi"  "dr"      
##  [5,] "railway"  "troop"    "men"   "miss"    
##  [6,] "compani"  "provinc"  "made"  "member"  
##  [7,] "year"     "japanes"  "two"   "held"    
##  [8,] "new"      "militari" "place" "return"  
##  [9,] "now"      "parti"    "mani"  "presid"  
## [10,] "trade"    "general"  "veri"  "club"

The resulting topics are very similar to those we identified in our previous experiments based on structural topic modeling) and alternative LDA implementations.

This model found topics related to business and industry (topic 1), Japan and international relations (topic 2), education (topic 3), government service and officialdom (topic 5), advertisements (topic 6), Canton government and military operations (topic 8) and club life in Shanghai (topic 10). The distribution of words within each topic, however, differed from previous models. Moreover, this model found two “non topics” (topics 7, 9) made up of non words we should remove earlier in the pre-processing step.

For the next steps, we want to give the topics more descriptive names than just numbers. Therefore, we simply concatenate the five most likely terms of each topic to a string that represents a pseudo-name for each topic.

top5termsPerTopic <- terms(topicModel, 5)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
topicNames

##                                Topic 1                                Topic 2 
##           "work countri can onli must" "japan nation japanes american govern" 
##                                Topic 3                                Topic 4 
##   "student school work univers colleg"             "peopl can life may world" 
##                                Topic 5                                Topic 6 
##  "report ministri appoint order offic"              "road appli ii hous room" 
##                                Topic 7                                Topic 8 
##  "bank shanghai busi industri railway"        "canton nank report armi troop" 
##                                Topic 9                               Topic 10 
##                  "day time s citi men"              "mr shanghai mrs dr miss"

Visualization

Wordcloud

Although wordclouds may not be optimal for scientific purposes they can provide a quick visual overview of a set of terms. Let’s look at some topics as wordcloud. In the following code, you can change the variable topicToViz with values between 1 and 10 to display other topics (e.g. topic 1):

require(wordcloud2)
topicToViz1 <- 1

You may also select a topic by a term contained in its name (e.g. “bank” for topic 1) (to avoid for topics with overlapping words):

require(wordcloud2)
topicToViz1 <- grep('bank', topicNames)[1]

Then select the 40 most probable terms from the topic by sorting the term-topic-probability vector in decreasing order, extract the probabilities of each of the 40 terms, and visualize the terms as a wordcloud:

top40terms1 <- sort(tmResult$terms[topicToViz1,], decreasing=TRUE)[1:40]
words1 <- names(top40terms1)
probabilities1 <- sort(tmResult$terms[topicToViz1,], decreasing=TRUE)[1:40]
wordcloud2(data.frame(words1, probabilities1), shuffle = FALSE, size = 0.8)

Sample documents

Let us now look more closely at the distribution of topics within individual documents. To this end, we visualize the distribution in 3 sample documents.

Let us first take a look at the contents of three sample documents:

exampleIds <- c(2, 100, 200)
cat(rs_corpus[exampleIds[1]])

##                                            THREE American returned students two British returned students one French returned student one German returned student one graduate of the Shensi University and two Japanese returned students were elected at the first section of the long-expected elections tor the Senate held on June 20' and 21 in Peking For the first time in the history of the Republic had so many foreign-educated Chinese                 voted Over four hundred registered and some three hundred cast their votes The majority of them are graduates of Western or Japanese universities They questioned the legality of the new Parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein THOSE who were elected Senators are Tsur former president of the Tsing Hua College Ho Yen-sun secretary to Liang Shih-yi and Chen Huai-chang Confucianist who are returned students from America Lo Hung-nien of the Bank of China and Wang Shih-ching editor of the Peking Daily News returned students from England Wu Chun from Ger many Wu Ching-lien from France former Minister to Italy Ting Yung and Wei Sze-kan from Japan and Hsu Shih graduate of the Shensi University They were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to June 10, 1918, authors whose books have been recognized by the Ministry of Education The voting took place in the Hall of the House of Representa tives and the election was presided over bv Fu Tseng-hsiang Minister of Education

cat(rs_corpus[exampleIds[2]])

##                                            THE PRESS BY-LAW                 July 14                 We publish to-day letter from valued Chinese corresponent which gives us an opportunity of correcting misconceptions of the Press Licensing By-law of which we would earnestly ask our Chinese readers to clear their and their friends' minds Without                 closing identities if such man as Western Returned Student really believes that the Council seeks to kill growing Chines opinion what can be the state of mind of the less educated 1 One can handly blame them seeing how industriously they have heen misinformed to use no harsher term by foreigners who ought to have known better But the plain fact is and it cannot be too strongly emphasized that the last thing the Council desires is to check or in any way interfere with newspapers and fair criticism interpreting these terms in the most liberal sense What iit does wish to do is to stop publica tion of rumours that food and water are being poisoned or pasters depicting Chinese being sold as slaves We hardly think that Western Returned Student will disagree with such desire Of course all this discussion is mere waste of time Sir Havilland de Sausmarez has shown that the licence conditions aire unworkable and the Consular Body has plainly intimated that it does not mean to approve them If Western Returned Student and those like him will use their influence to spread knowledge of these facts                 among the Chinese they will be doing far more valuable work than in giving renewed currency to mischievous inaccuracies

cat(rs_corpus[exampleIds[3]])

##                                            And They Found Some Money in The Treasury                 We present the following as proof that occasionally the unexpected does happen even in China Several months ago when Marshal Sun Chuan-fang was in control of Shanghai someone induced him to establish department of Public Health in the Chinese areas adjoining the Foreign Settlements at Shanghai The head of the new Health Service was Dr Hou-ki Hu an American returned student who in the brief space of few months and in the face of complications produced by the disturbed political situation was however able to get his department going in pretty good order Health stations and dispensaries were established and considerable progress was made in the dissemination of health propaganda particularly from the standpoint of vaccination against smallpox But as so often happens in China the war came on and Marshal Sun was defeated and forced to get out When Chiang Kai-shek came in he in accordance with political practice all over the world appointed his own officials includ ing new doctor in the Health Department The new man who is now functioning is Dr Liu Zee-tse returned student from Germany who has been local practitioner at Shanghai for some years But this is the interesting part of the story When Dr Liu arrived to take over from Dr Hu he found Dr Hu still in the office and the transfer of records and even the balance in the bank was made in thoroughly formal and official manner just as they do it in New York or London It is reported that this is the first case where the Southern cr Kuomintang Party has been able to find records in order and money actually in the bank to be transferred over to the new incumbent

After looking into the documents, we visualize the topic distributions within the documents using the packages reshape2 and ggplot2:

Load libraries for visualization:

library(reshape2)
library(ggplot2)

Get topic proportions form example documents:

N <- length(exampleIds)
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")

Visualization the proportions of topics in each sample document:

ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

The barplots show how topics within a document are distributed according to the model. In the current model all three documents show at least a small percentage of each topic. However, one or two topics dominate each document. For instance, topic 4 (actually a non topic) dominates in document 2, whereas topic 5 (official appointment) dominates in document 1. None of the ten topics clearly prevailed in document 3, but topics 2 and 8 related to Japan and Canton are slightly more prevalent.

Topic distributions

The topic distribution within a document can be controlled with the Alpha-parameter of the model. Higher alpha priors for topics result in an even distribution of topics within a document. Low alpha priors ensure that the inference process distributes the probability mass on a few topics for each document.

In the previous model calculation the alpha-prior was automatically estimated in order to fit to the data (highest overall probability of the model). However, this automatic estimate does not necessarily correspond to the results that one would like to have as an analyst. Depending on our analysis interest, we might be interested in a more peaky/more even distribution of topics in the model.

Let’s us change the alpha prior to a lower value to see how this affects the topic distributions in the model.

First take a look at alpha from the previous model (5):

attr(topicModel, "alpha")

## [1] 5

Let’s change alpha to 0.5:

topicModel2 <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, verbose = 25, seed = 1, alpha = 0.5))

## K = 10; V = 5961; M = 2739
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!

tmResult <- posterior(topicModel2)
theta <- tmResult$topics
beta <- tmResult$terms

Now visualize the topic distributions in the three documents again to see whether there are differences in the distribution structure.

First get topic proportions from example documents:

topicNames2 <- apply(terms(topicModel2, 5), 2, paste, collapse = " ")
topicProportionExamples2 <- theta[exampleIds,]
colnames(topicProportionExamples2) <- topicNames2
vizDataFrame2 <- melt(cbind(data.frame(topicProportionExamples2), document = factor(1:N)), variable.name = "topic", id.vars = "document")

Visualize the new topic distributions:

ggplot(data = vizDataFrame2, aes(topic, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

What are the differences in the distribution structure? The prevalence of certain topics in each document is much stronger in the new model with a lower alpha (0.5) than in the previous model based on a tenfold higher alpha (5). Topic 5 clearly dominates in document 1. Topics 2 prevails in document 2 (even before the non-topic 4). Document 3 is a mixture of more equally distributed topics, yet dominated by topic 4 and 8.

Topic ranking

First, we try to get a more meaningful order of top terms per topic by re-ranking them with a specific score [2]. The idea of re-ranking terms is similar to the idea of TF-IDF. The more a term appears in top levels w.r.t. its probability, the less meaningful it is to describe the topic. Hence, the scoring favors less general, more specific terms to describe a topic.

What are the defining topics within a collection? There are different approaches to find out which can be used to bring the topics into a certain order.

Probability within the entire corpus : the topics that are more likely to occur in the corpus describe rather general thematic coherence. Other topics correspond more to specific contents.
Rank-1 : based on how often a topic appears as a primary topic within a document. sorting topics by the Rank-1 method places topics with rather specific thematic coherences in upper ranks of the list.

First re-rank top topic terms for topic names:

topicNames <- apply(lda::top.topic.words(beta, 5, by.score = T), 2, paste, collapse = " ")

Approach 1: We sort topics according to their probability within the entire collection:

topicProportions <- colSums(theta) / nrow(DTM) # mean probablities over all paragraphs
names(topicProportions) <- topicNames     # assign the topic names we created before
sort(topicProportions, decreasing = TRUE) # show summed proportions in decreased order

##                mr mrs shanghai miss dr          countri peopl can nation must 
##                             0.17856684                             0.16899219 
##    ministri appoint report court state              citi girl women man littl 
##                             0.11044525                             0.10821871 
## student christian church school colleg          japanes japan troop armi nank 
##                             0.09559074                             0.08289550 
##     bank cotton industri railway trade     canton organ member committe union 
##                             0.08155210                             0.06523564 
##                        ol bo ii ot lie         appli china_press box room tel 
##                             0.06341015                             0.04509288

We recognize some topics that are way more likely to occur in the corpus than others. These describe rather general thematic coherence. Other topics correspond more to specific contents.

Approach 2 (Rank-1)

countsOfPrimaryTopics <- rep(0, K)
names(countsOfPrimaryTopics) <- topicNames
for (i in 1:nrow(DTM)) {
  topicsPerDoc <- theta[i, ] # select topic distribution for document i
  # get first element position from ordered list
  primaryTopic <- order(topicsPerDoc, decreasing = TRUE)[1] 
  countsOfPrimaryTopics[primaryTopic] <- countsOfPrimaryTopics[primaryTopic] + 1
}
sort(countsOfPrimaryTopics, decreasing = TRUE)

##                mr mrs shanghai miss dr          countri peopl can nation must 
##                                    665                                    581 
##    ministri appoint report court state              citi girl women man littl 
##                                    309                                    271 
## student christian church school colleg          japanes japan troop armi nank 
##                                    266                                    183 
##     bank cotton industri railway trade     canton organ member committe union 
##                                    168                                    132 
##         appli china_press box room tel                        ol bo ii ot lie 
##                                    102                                     62

The distribution has not changed much, except for the two last topics. Generally, the Rank-1 method places topics with rather specific thematic coherences in upper ranks of the list.

This sorting of topics can be used for further analysis steps such as the semantic interpretation of topics found in the collection, the analysis of time series of the most important topics or the filtering of the original collection based on specific sub-topics.

Filtering documents

The fact that a topic model conveys topic probabilities for each document makes it possible to use it for thematic filtering of a collection. As filter we select only those documents which exceed a certain threshold of their probability value for certain topics (for example, each document which contains topic X to more than Y percent).

In the following, we will select documents based on their topic content and display the resulting document quantity over time. In this case, we focus on the first topic (bank), but you can select another topic:

topicToFilter1 <- 1  # you can set this manually ...
# ... or have it selected by a term in the topic name
topicToFilter1 <- grep('bank ', topicNames)[1] 
topicThreshold <- 0.1 # minimum share of content must be attributed to the selected topic
selectedDocumentIndexes <- (theta[, topicToFilter1] >= topicThreshold)
filteredCorpus <- rs_corpus %>% corpus_subset(subset = selectedDocumentIndexes)

# show length of filtered corpus
filteredCorpus

## Corpus consisting of 708 documents.
## 1419906755 :
## "                                           Those Returned St..."
## 
## 1319880525 :
## "                                           fr Ht fc fr MILLA..."
## 
## 1420645297 :
## "                                           BASKET BALL IN PE..."
## 
## 1371778568 :
## "                                           Verein Wa Origina..."
## 
## 1369950451 :
## "                                           BOXER MONEY PLEA ..."
## 
## 1324901182 :
## "                                           Chinese Journal I..."
## 
## [ reached max_ndoc ... 702 more documents ]

Our filtered corpus contains 685 documents related to the topic 1 to at least 10 %.

Filtering by topic is useful to focus on the documents that are most related to a specific topic you are interested in. You can increase the minimum probability in order to restrict the size of the filtered corpus and thereby refine your research.

Topic proportion over time

In a last step, we provide a distant view on the topics in the data over time. For this, we aggregate mean topic proportions per decade of all articles related to returned students. These aggregated topic proportions can then be visualized, e.g. as a bar plot.

# append decade information for aggregation
rs_full_text$decade <- paste0(substr(rs_full_text$Date, 0, 3), "0")
# get mean topic proportions per decade
topic_proportion_per_decade <- aggregate(theta, by = list(decade = rs_full_text$decade), mean)
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames

# reshape data frame
vizDataFrame3 <- melt(topic_proportion_per_decade, id.vars = "decade")


# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame3, aes(x=decade, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The visualization shows that topic related to law cases (black) clearly dominate the late decades of the 19th century (1840-1870) and declined from 1900 onwards. Except for the 1880s, Japan and international relations (blue) were an important topic during the entire period. Philosophical debates (people, life) (purple) gained currency in the 1880s, issues related to government appointment rose in the 1900s, and club life in Shanghai developed in the 1910-1930s (light green). Topics related to Canton, communism and the war with Japan (orange), as well as education and missionary work (dark green) were the most important topics of recent articles (1940-1950s).

Going further

Here are a few suggestions to take advantage of this topic model exercise:

Create a list of all documents that contain a share of US-related topics of at least 50 % (possible approach: sub-select topics which contain the term United States or American among the top terms).
Repeat the exercises with a different K value (e.g., K = 5, 30, 50). What do you think of the results?
Split the original text source into paragraphs (e.g. use corpus_reshape(x, to = “paragraphs”)) and compute a topic model on paragraphs instead of full speeches. How does the smaller context unit affect the result?
Use the LDAvis package to visualize the latest model you computed.
Cluster articles from several newspaper sections (genre of articles), where we can see that a topic model “learns” to tell the difference between the different sections based on the text content (see this tutorial).
Use numeric output (doc probabilities to generate/belong to topics -> to transform into numeric vector) to train a model for automatically classifying articles (see this tutorial).

Good luck!