Mapping the Transnational Public Sphere in modern China (1)

Structural Topic Modeling of the Shenbao

Cécile Armand

2022-09-21

Abstract

This is the companion documentation to a conference paper initially entitled “Local elites with global outreach: A bilingual topic modeling of the Shanghai press (1919-49)” given at the ESSCS conference “New Foundations in Chinese Digital History” held at Aix-Marseille University, 24-25 June 2022. This paper aims to map the formation of a transnational public sphere in republican China using a bilingual structural topic modeling approach. This document focuses on the Chinese-language corpus built from the newspaper Shenbao.

Research context

This paper seeks to investigate the formation of a transnational public sphere in republican China, through the joint empirical study of two key institutions – a non-state transnational organization, the Rotary Club, and its representations in the Shanghai press, which has long been considered as a key medium for shaping and disseminating information in modern China (Rankin, 1990; Huang, 1993; Wakeman, 1993; Wagner, 2007). Previous research on the Chinese public sphere presents to main limitations. One the one hand, scholars have focused on theoretical discussions regarding the transferability of Western concepts in China, instead of examining its concrete manifestations in the press and how it was put in practice by social actors. On the other hand, scholars who have used the press as a source have essentially relied on the close reading of subjectively selected articles, without providing the possibility to contextualize their findings and to assess whether/to what extent the selected texts or passages were representative of larger trends.

Taking advantage of the massive, multilingual corpora recently made available in full text by the ENP-China (Elites, Networks and Power in modern China) project, this paper introduces a mixed-method approach based on topic modeling to enable a change of scale in the analysis of the historical press and to overcome certain limitations of manual reading.Topic modeling is a computational, statistical method aimed at automatically detecting hidden themes (topics) in large collections of unstructured texts, based on the co-occurrences of words in documents. In this paper, we rely on structural topic modeling (STM). STM is based on Latent Dirichlet Allocation (LDA), a probabilistic model that treats topics as mixtures of words and documents as mixtures of topics. This implies that words can belong to different topics, while topics can be represented in multiple documents with varying proportions. In addition, STM is able to incorporate document metadata such as the date of publication, which enables to analyze topical changes over time. More specifically, we will use the stm R package which includes several built-in functions designed to facilitate the exploration of topics, including various visualizations and statistical outputs.

The purpose of this research is twofold. Substantively, our key questions include: How did the Shanghai press reported on the Rotary Club? How did the organization mediate between elite/society, business/politics, localism/internationalism? What does this reveal about how the press itself functioned as a public sphere? How did this emerging public sphere change over time and vary across languages? Methodologically, we aim to design a reliable method for conducting a bilingual, dynamic topic modeling approach of the historical press. More specifically, we address three major challenges: (1) to identify topics across multiple languages (in this paper, English and Chinese), (2) to trace topical changes over time, and (3) to adapt topic modeling to the heterogeneity of newspaper content, particularly to brevity-style articles made up of short pieces of unrelated news.

In this document, we focus on the Chinese-language newspaper Shenbao (Chinese)¹. For the English-language press (ProQuest Collection of Chinese Newspapers), see the counterpart document. Our workflow follows four main steps. First, we build the corpus from the ENP-China textbase using the HistText package. Second, we prepare the text data and build the topic models using the stm package. Next, we explore and label the topics using various visualizations and statistical measures. Finally, we analyze the effect of time on topic prevalence based on the date of publication.

Note: The purpose of this document is to describe our workflow and to make our methodological choices more explicit, testable and replicable. Historical questions and interpretations are kept to the minimum. For a comprehensive literature review and detailed interpretation of the findings embedded in the final narrative, see to the companion research paper to be published in the Journal of Digital History (JDH).

Corpus building

Load packages

library(histtext)
library(tidyverse)

We first search “扶輪社” (fulunshe) in the Shenbao. Since we are investigating a very specific organization with few possible homonyms and low degree of ambiguity, we can rely on simple keywords. We simply exclude the quasi-homonym “國學扶輪社” (guoxue fulunshe), which referred to a publishing enterprise established in the early 20th century with no connection with the Rotary Club. Additionally, we restricted the query to the period posterior to 1919, when the first Rotary Club in China was established in Shanghai:

rotaryzh_doc <- search_documents_ex('"扶輪社" NOT "國學扶輪社"', corpus="shunpao", dates="[1919 TO 1947]")
head(rotaryzh_doc)

When we retrieved the full text of the documents (not done here), we realized that the results contained many articles in which the Rotary Club was just mentioned in passing, amidst unrelated pieces of news. Using the entire document as a text unit would only reflect the messy structure of these texts. In order to alleviate the issue, we propose to apply topic modeling on finer segments of text instead of the entire document.

Example of problematic documents (the queried term is highlighted in red):

view_document("SPSP193602290401", "shunpao", query = '"扶輪社"')

## NULL

In the above example, the length of the targeted segment is 56 characters, whereas the total length of the “article” is 9169 characters.

Retrieve concordance

Instead of retrieving entire documents, therefore, we will retrieve finer strings of characters using the “concordance” function included in the histtext package. This function returns the queried terms in their context. The main challenge at this stage is to define the right context size. After a careful examination of a sample of articles, we decided to set the threshold at 100 characters to minimize the risk of overlap in cases when articles contain several occurrences of the queried terms:

rotaryzh_conc100 <- search_concordance_ex('"扶輪社" NOT "國學扶輪社"', 
                                          context = 100, corpus="shunpao", 
                                          dates="[1919 TO 1947]")

head(rotaryzh_conc100)

The concordance table contains seven columns, including the unique identifier of the document (DocId), the date of publication, the title of the article (Title), the name of the periodical (Source), the queried terms (Matched), and the terms preceding (Before) and following (After) the key words.

We can first count the number of occurrences per article:

rotaryzh_conc100 <- rotaryzh_conc100 %>% group_by(DocId) %>% add_tally()

rotaryzh_conc100 %>% arrange(desc(n))

Next, we create a new variable for the merged text

# First merge "before" and the "matched" term into a new "Text" variable
rotaryzh_conc100 <- rotaryzh_conc100 %>% mutate(Text = paste0(Before, Matched)) 
head(rotaryzh_conc100)

# Next, merge the text resulting from previous operation (Text) with "after"
rotaryzh_conc100 <- rotaryzh_conc100 %>% mutate(Text = paste0(Text, After)) 
head(rotaryzh_conc100)

Finally, we reunite the documents by merging all the occurrences they each contain:

library(data.table)

rotaryzh_100_united <- rotaryzh_conc100 %>%
  group_by(DocId, Date, Title, Source, grp = rleid(DocId)) %>% 
  summarise(Text = str_c(Text, collapse=' '), .groups = 'drop') %>%
  ungroup %>%
  select(-grp)

rotaryzh_100_united

We obtain 467 reshaped documents spanning from 1922 to 1947. Let’s examine the distribution of documents over years:

# create variable for years   
rotaryzh_100_united <- rotaryzh_100_united %>%
  mutate(year = stringr::str_sub(Date,0,4)) %>% 
  mutate(year = as.numeric(year)) 

rotaryzh_100_united %>% 
  group_by(year) %>% count() %>%
  ggplot(aes(x=year, y=n)) + 
  geom_col(alpha = 0.8) + 
  labs(title = "The Rotary Club in the Shenbao",
       subtitle = "Number of articles mentioning '扶輪社'",
       x = "Year", 
       y = "Number of articles")

Compute number of characters in each text

rotaryzh_100_united <- rotaryzh_100_united %>% mutate(nchar = nchar(Text))

Save and exports results as a csv file:

write.csv(rotaryzh_100_united, "rotaryzh_100_united.csv")

Tokenize the text

The next step is tokenization, which consists in segmenting the Chinese text into meaningful units (tokens, which can be considered as the equivalent of words). For this purpose, we rely on the package jieba. Although jieba was initially designed for tokenizing contemporary Chinese texts, it nonetheless gave satisfactory results on our corpus.

Load packages:

library(jiebaR)
library(jiebaRD)

Initialize jiebaR worker

cutter <- worker()

Test the worker

cutter <- worker()

Define the segmenting function

seg_x <- function(x) {str_c(cutter[x], collapse = " ")}

Apply the function to each document (row of ltext)

x.out <- sapply(rotaryzh_100_united$Text, seg_x, USE.NAMES = FALSE)

Attach the segmented text back to the data frame

rotaryzh_100_united$text.seg <- x.out

Inspect the first two rows of the data frame

head(rotaryzh_100_united$text.seg, 2)

## [1] "家 二十八日 上午 九時 至 十時 半 參觀 各 工廠 十一時 至 十二時 開 演講會 列席 者 聖約翰 大學 學生 十二時 半開 交誼會 列席 者 扶輪社 社員 下午 四時 開 討論會 於 崑山 路 女青年會 列席 者 即 爲 該會 之 幹事 六時 三刻 至 八時 開 演講會 列席 者 南洋 大學 學生 二十"
## [2] "家 二十八日 上午 九時 至 十時 半 參觀 各 工廠 十一時 至 十二時 開 演講會 列席 者 聖約翰 大學 學生 十二時 半開 交誼會 列席 者 扶輪社 社員 下午 四時 開 討論會 於 崑山 路 女青年會 列席 者 即 爲 該會 之 幹事 六時 三刻 至 八時 開 演講會 列席 者 南洋 大學 學生 二十"

Count tokens and characters:

library(quanteda)

rotaryzh_100_united <- rotaryzh_100_united %>% 
  mutate(ntoken = ntoken(text.seg)) %>% 
  mutate(nchar = nchar(Text))

Add metadata

Finally, we incorporate the date of publication into the metadata in view of analyzing topical changes over time. We create different variables for enabling different degrees of temporal granularity.

Create variable for years

rotaryzh_100_united <- rotaryzh_100_united %>%
  mutate(year = stringr::str_sub(Date,0,4)) %>% 
  mutate(year = as.numeric(year))

Create variable for decades

rotaryzh_100_united$decade <- paste0(substr(rotaryzh_100_united$Date, 0, 3), "0")

Create time period based on historian’s prior knowledge on the Rotary Club and of the political context in pre-1949 China:

rotaryzh_100_united$period <- cut(rotaryzh_100_united$year, breaks = c(1919, 1929, 1937, 1948), 
                                label = c("1919-1929", "1930-1937", "1938-1948"), 
                                include.lowest = TRUE)

Select relevant variables

rotaryzh_conc100_corpus <- rotaryzh_100_united %>% 
  select(DocId, Source, Title, Text, 
         text.seg, nchar, ntoken, 
         Date, year, decade, period)

Save and export the tokenized corpus

write.csv(rotaryzh_conc100_corpus, "rotaryzh_conc100_corpus.csv")

Pre-processing

Next, we prepare the text data to make it readable by topic model algorithms. We decided to exclude a customized list of stop words, especially the queried terms used to build the corpus (扶輪社) and too common terms in this context (上海, 中國). We removed the words which contained less than 2 characters and occurred in less than 2 documents.

Load packages:

library(stm)
library(stminsights)

Pre-processing

# select metadata
meta <- rotaryzh_conc100_corpus %>% transmute(DocId, Title, Date, year, decade, period, ntoken, nchar)  

# create corpus
corpus <- stm::textProcessor(rotaryzh_conc100_corpus$text.seg,
                             metadata = meta, 
                             stem = FALSE, 
                             wordLengths = c(2, Inf), 
                             verbose = FALSE, 
                             customstopwords = c("上海", "扶輪社", "中國")) 
stm::plotRemoved(corpus$documents, lower.thresh = c(0,10, by=5))

out <- stm::prepDocuments(corpus$documents, 
                          corpus$vocab, 
                          corpus$meta, 
                          lower.thresh = 2)

## Removing 4767 of 5688 terms (5505 of 12414 tokens) due to frequency 
## Removing 5 Documents with No Words 
## Your corpus now has 462 documents, 921 terms and 6909 tokens.

4767 of 5688 terms (5505 of 12414 tokens) were removed due to frequency. 5 Documents with no words were removed (they refer to documents in which the full text has been misplaced in the “title” field). The final corpus contains 462 documents, 921 terms and 6909 tokens.

Before we go on building the models, a sound reflex is to inspect more closely which documents were removed:

out$docs.removed

## [1] 165 195 280 284 430

rotaryzh_conc100_corpus[c(165, 195, 280, 284, 430),  ]

Similarly, let’s examine the words that were removed:

wordsremoved <- as_tibble(out$words.removed) 
wordsremoved

We notice that many of the removed words are English words inserted in the Chinese text.

Model building

Choosing the right number of topics k remains a highly debated question. There is no definite solution. Most topic modeling tools generally provide a set of metrics such as held-out likelihood, residual analysis, average exclusivity and semantic coherence, to help the researcher to determine the optimal number of topics for a given corpus. According to the authors of the manual of the stm package, for small corpora ranging from a few hundred to a few thousand documents, the best number of topics should range between 5 and 50 topics ². Ultimately, however, only the researcher’s interpretational needs can determine what is the most appropriate number of topics for a given specific research.

In the stm R package, the searchK function provides a wide range of metrics to guide our choice, including held-out likelihood, residual analysis, average exclusivity and semantic coherence. Only the default properties (held-out likelihood, residuals, semantic coherence, lower bound) are displayed below:

set.seed(1111)
K<-seq(5,50, by=10) 
kresult <- searchK(out$documents, out$vocab, K, prevalence =~ year, data=out$meta, verbose=FALSE)
plot(kresult)

After several experiments, we decided build three models with 5, 10 and 20 topics, which will enable us to navigate different levels of granularity:

# 5-topic model
mod.5 <- stm::stm(out$documents, 
                   out$vocab, K=5, 
                   prevalence =~ year, 
                   data=out$meta, verbose = FALSE)

# 10-topic model
mod.10 <- stm::stm(out$documents, 
                   out$vocab, K=10, 
                   prevalence =~ year, 
                   data=out$meta, verbose = FALSE)

# 20-topic model
mod.20 <- stm::stm(out$documents, 
                   out$vocab, K=20, 
                   prevalence =~ year, 
                   data=out$meta, verbose = FALSE)

Next, we incorporate the time variables in the models to further analyze topical changes over time:

year5 <- stm::estimateEffect(1:5 ~ year, mod.5, meta=out$meta)
year10 <- stm::estimateEffect(1:10 ~ year, mod.10, meta=out$meta)
year20 <- stm::estimateEffect(1:20 ~ year, mod.20, meta=out$meta)

Finally, we save the models as an “.RData” file to save time and computational power in the future:

save.image('rotaryzh.RData')

Model evaluation

To compare the three models, we can plot the semantic coherence of topics against their exclusivity. As the plot suggested, the higher the number of topics, the lower their semantic coherence, and the higher their exclusivity:

mod5df<-as.data.frame(cbind(c(1:5),exclusivity(mod.5), semanticCoherence(model=mod.5, out$documents), "SB5T"))
mod10df<-as.data.frame(cbind(c(1:10),exclusivity(mod.10), semanticCoherence(model=mod.10, out$documents), "SB10T"))
mod20df<-as.data.frame(cbind(c(1:20),exclusivity(mod.20), semanticCoherence(model=mod.20, out$documents), "SB20T"))

models<-rbind(mod5df, mod10df, mod20df)
colnames(models)<-c("Topic","Exclusivity", "SemanticCoherence", "Model")

models$Exclusivity<-as.numeric(as.character(models$Exclusivity))
models$SemanticCoherence<-as.numeric(as.character(models$SemanticCoherence))

options(repr.plot.width=7, repr.plot.height=6, repr.plot.res=100)

plotmodels <-ggplot(models, aes(SemanticCoherence, Exclusivity, color = Model))+
  geom_point(size = 2, alpha = 0.7) + 
  geom_text(aes(label=Topic), nudge_y=.04)+
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Comparing exclusivity and semantic coherence", 
       subtitle = "Chinese-language corpus (Shenbao)")


plotmodels

Model exploration

In the first step, we highly recommend using the package stminsights to explore the models:

library(stminsights)
run_stminsights()

Stminsights is an R Shiny application which provides a set of visualizations and statistical tools for exploring the topics in one or across multiple models. While building on the stm package itself, it greatly facilitates the preliminary exploration. In the next sections, we shall provide the full code for reproducing and adjusting the outputs produced through the “stm insights” application.

Topic proportions θ

The package “stm” stores the document-topic proportions and the topic-word distributions in two matrices, θ (which is also referred to, somewhat confusingly, as γ) and β. We can then take a closer look at θ, which can be called directly from the model. Alternatively, it is possible and perhaps more convenient to use the built-in function “make.dt()”. The latter allows to incorporate the metadata, which in our case is helpful since we aim to examine the influence of the data of publication in topic prevalence. The table below display the proportions of topics for each document, along with their metadata.

Extract topic proportions for each model:

topicprop5<-make.dt(mod.5, meta)
topicprop10<-make.dt(mod.10, meta)
topicprop20<-make.dt(mod.20, meta)

Consulting the table might be a bit cumbersome unless we want to examine the topic proportions of a specific document. The “plot.STM” function associated with the “hist” argument helps to better visualize the estimates of document-topic proportions:

plot.STM(mod.5, "hist")

plot.STM(mod.10, "hist")

plot.STM(mod.20, "hist")

Next, we can examine more closely the words that define the topics in order to better understand what each topic is really about.

Word per topic β

In the stm package, the function “plot.STM” with argument “summary” displays the general distribution of topics (which topics are overall more common in the corpus) along with the most common words for each topic. In the example below, we set the number of desired words to 5:

plot.STM(mod.5,"summary", n=5)

plot.STM(mod.10, "summary", n=5)

plot.STM(mod.20, "summary", n=5)

Alternatively, we can plot words proportions per topics as bar plots using a tidy approach. In the example below, we focus on the 10-topic model:

# load packages

library(tidyverse)
library(tidytext)

td_beta10_zh <- tidytext::tidy(mod.10) 

options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100) 

td_beta10_zh %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Highest word probabilities for each topic",
       subtitle = "Different words are associated with different topics",
       caption = "Based on Shenbao (扶輪社 Corpus)")

Let’s explore the top 10 words for each topic:

topic10zh_top_words <- td_beta10_zh %>%
  group_by(topic) %>%
  top_n(10, beta) 

topic10zh_top_words %>% arrange(topic, desc(beta))

Topic labeling

The function “labelTopics” (or sageLabels) provides a deeper insight on the popular words in each topic. In addition to words probabilities, other metrics can be computed, including the FREX words (FREX weights words by frequency and exclusivity to the topic), lift words (frequency divided by frequency in other topics), and score (similar to lift, but with log frequencies). In the example below, we set the number of words to 10:

labelTopics(mod.5, n=10)

## Topic 1 Top Words:
##       Highest Prob: 玩具, 兒童, 醫院, 修理, 苦兒, 團體, 耶誕, 貧苦, 放映, 徵求 
##       FREX: 玩具, 兒童, 醫院, 修理, 苦兒, 耶誕, 貧苦, 放映, 市長, 大戲院 
##       Lift: 不知, 中山, 之後, 事務, 二十四日, 免費, 入座, 公用局, 募集, 北上 
##       Score: 玩具, 兒童, 苦兒, 耶誕, 貧苦, 大戲院, 放映, 破舊, 徵集, 醫院 
## Topic 2 Top Words:
##       Highest Prob: 組織, 國際, 昨日, 討論, 博士, 年會, 全國, 學生, 該會, 會長 
##       FREX: 學生, 區域, 紀念會, 衛生, 擴大, 一年, 百餘人, 一百, 八十一, 發生 
##       Lift: 二十餘, 共有, 創立, 廳長, 擴大, 案件, 永久, 盛况, 致力, 評判員 
##       Score: 擴大, 紀念會, 一年, 一百, 會長, 區域, 組織, 全國, 擔任, 討論 
## Topic 3 Top Words:
##       Highest Prob: 國際, 社員, 演說, 大會, 舉行, 昨日, 美國, 代表, 主席, 世界 
##       FREX: 此次, 目的, 日本, 及其, 人士, 親善, 努力, 和平, 吾人, 昨在 
##       Lift: 一人, 三年, 主持, 二十六日, 交涉, 人士, 人類, 全球, 共同, 最大 
##       Score: 國際, 親善, 目的, 此次, 代表, 席間, 演說, 童子軍, 發表, 世界 
## Topic 4 Top Words:
##       Highest Prob: 常會, 下午, 本週, 演講, 舉行, 總會, 定於, 飯店, 十二時, 星期四 
##       FREX: 常會, 本週, 十二時, 聯華, 三十分, 五月, 八日, 十八日, 今午, 廿八日 
##       Lift: 三十分, 九畝, 事項, 二十七日, 五時, 五時半, 休息, 先生, 入社, 八十週年 
##       Score: 常會, 本週, 下午, 聯華, 十二時, 廿八日, 事項, 總會, 今午, 八日 
## Topic 5 Top Words:
##       Highest Prob: 扶輪, 今日, 社長, 委員會, 乞丐, 會議, 出席, 天津, 問題, 公司 
##       FREX: 扶輪, 委員會, 乞丐, 問題, 錦標, 救世軍, 香港, 收容所, 太平洋, 諸君 
##       Lift: 一部, 世運, 事畢, 二十一日, 人員, 兩國, 公使, 出口, 協助, 原定 
##       Score: 乞丐, 扶輪, 救世軍, 諸君, 收容所, 光明, 香港, 出席, 錦標, 天津

labelTopics(mod.10, n=10)

## Topic 1 Top Words:
##       Highest Prob: 玩具, 兒童, 苦兒, 修理, 耶誕, 貧苦, 放映, 大戲院, 醫院, 影片 
##       FREX: 玩具, 苦兒, 耶誕, 貧苦, 放映, 大戲院, 破舊, 徵集, 一件, 照例 
##       Lift: 貧苦, 不知, 多少, 小朋友, 微集, 徵集, 慈幼, 我們, 敬向, 有着 
##       Score: 玩具, 耶誕, 苦兒, 大戲院, 徵集, 破舊, 照例, 兒童, 公映, 時節 
## Topic 2 Top Words:
##       Highest Prob: 國際, 大會, 代表, 出席, 天津, 會員, 會議, 香港, 精神, 服務 
##       FREX: 香港, 各地, 此間, 出席, 天津, 法國, 總統, 國際, 服務, 會議 
##       Lift: 創立, 歡迎會, 獎學金, 第六, 籌備, 號輪, 訪問, 閉幕, 各地, 外交部長 
##       Score: 國際, 天津, 各地, 香港, 代表, 大會, 親善, 出席, 外交部長, 閉幕 
## Topic 3 Top Words:
##       Highest Prob: 社員, 組織, 世界, 各國, 主席, 該社, 昨日, 演說, 該會, 集會 
##       FREX: 衛生, 社員, 世界, 各國, 國籍, 組織, 集會, 主席, 最大, 國民 
##       Lift: 交涉, 最大, 國民, 沿海, 名譽, 國籍, 歐洲, 衛生, 四年, 一面 
##       Score: 世界, 組織, 社員, 國籍, 親善, 目的, 衛生, 集會, 沿海, 大城市 
## Topic 4 Top Words:
##       Highest Prob: 下午, 討論, 都城, 時間, 區域, 十二時, 飯店, 擴大, 事項, 先生 
##       FREX: 擴大, 事項, 先生, 一年, 一百, 八仙, 區域, 愛物, 擔任, 江西 
##       Lift: 五時半, 擴大, 每日, 研究班, 誦經, 一年, 一百, 並將, 事項, 先生 
##       Score: 事項, 擴大, 宴會, 八仙, 一百, 星期五, 並將, 一年, 區域, 大城市 
## Topic 5 Top Words:
##       Highest Prob: 演講, 下午, 公司, 七時, 八日, 廿八日, 發表, 十日, 十九日, 麻瘋 
##       FREX: 七時, 八日, 廿八日, 發表, 麻瘋, 浦東, 畢業, 通過, 事件, 十九日 
##       Lift: 北上, 南門, 商團, 專科, 師範學校, 日軍, 東門, 發表, 碩士, 總幹事 
##       Score: 八日, 廿八日, 浦東, 禁止, 休業, 永年, 沙田, 福利, 七時, 練習 
## Topic 6 Top Words:
##       Highest Prob: 美國, 舉行, 聚餐會, 飯店, 聚餐, 演講, 乞丐, 十二時, 星期四, 問題 
##       FREX: 乞丐, 救世軍, 今午, 收容所, 聚餐會, 本屆, 自由, 星期, 社友, 情形 
##       Lift: 兩國, 完成, 收容所, 由於, 積極, 今午, 卡爾登, 多邀, 改於, 救世軍 
##       Score: 今午, 聚餐會, 乞丐, 救世軍, 收容所, 社友, 十二時, 聚餐, 三十分, 美國 
## Topic 7 Top Words:
##       Highest Prob: 下午, 總會, 常會, 二時, 聯華, 馬路, 聯會, 廿七日, 十五日, 十八日 
##       FREX: 聯會, 八十一, 六日, 外灘, 漢口路, 陳列, 馬路, 廿七日, 光明, 總會 
##       Lift: 九畝, 八十週年, 午三, 天后宫, 山東, 工業, 民國, 没心肝, 納税, 育會 
##       Score: 八十一, 常會, 聯華, 光明, 外灘, 漢口路, 納税, 總會, 馬路, 下午 
## Topic 8 Top Words:
##       Highest Prob: 常會, 本週, 定於, 舉行, 假座, 飯店, 屆時, 中午, 本報, 星期四 
##       FREX: 本週, 定於, 屆時, 大華, 駐華, 假座, 本報, 常會, 中央, 將由 
##       Lift: 二十四日, 入社, 公使, 播音, 不及, 俄童, 大華, 將由, 探悉, 本週 
##       Score: 本週, 常會, 定於, 五月, 週會, 將由, 假座, 屆時, 公使, 俄童 
## Topic 9 Top Words:
##       Highest Prob: 比賽, 萬國, 錦標, 扶輪, 國際, 網球, 兒童, 委員會, 醫院, 團體 
##       FREX: 網球, 贈送, 錦標, 優勝者, 歷年, 紅十字會, 比賽, 網球賽, 萬國, 著名 
##       Lift: 募集, 基金, 截止, 拳擊賽, 施行, 案件, 殘廢, 永久, 童子, 虹橋路 
##       Score: 網球, 比賽, 錦標, 幼童, 萬國, 兒童, 扶輪, 網球賽, 贈送, 紅十字會 
## Topic 10 Top Words:
##       Highest Prob: 席間, 中外, 童子軍, 舉行, 下午, 學生, 代表, 昨日, 上午, 招待 
##       FREX: 席間, 童子軍, 德國, 注意, 列席, 學生, 中外, 參觀, 市長, 招待 
##       Lift: 公用局, 參加者, 夫婦, 新聞, 領袖, 二十, 創設, 吳鐵城, 報紙, 孔祥熙 
##       Score: 席間, 列席, 童子軍, 中外, 市長, 杭州, 學生, 紀錄, 第一, 來賓

labelTopics(mod.20, n=10)

## Topic 1 Top Words:
##       Highest Prob: 公司, 濟南, 婦女, 十二日, 美國, 明日, 本周, 耶誕節, 夫人, 霞飛路 
##       FREX: 婦女, 濟南, 本周, 耶誕節, 公司, 十二日, 霞飛路, 明日, 廿五日, 一行 
##       Lift: 本周, 耶誕節, 北上, 霞飛路, 婦女, 濟南, 十二日, 一行, 並請, 設備 
##       Score: 本周, 耶誕節, 濟南, 公司, 十二日, 婦女, 廿五日, 霞飛路, 明日, 此外 
## Topic 2 Top Words:
##       Highest Prob: 討論, 區域, 擴大, 昨日, 一年, 一百, 進行, 增加, 成立, 下午 
##       FREX: 擴大, 一百, 一年, 區域, 決議, 討論, 增加, 擔任, 休業, 進行 
##       Lift: 商團, 閉幕, 休業, 決議, 一百, 擴大, 第九十七, 一年, 擔任, 區域 
##       Score: 擴大, 一百, 區域, 一年, 增加, 擔任, 決議, 大城市, 目標, 並將 
## Topic 3 Top Words:
##       Highest Prob: 演講, 影片, 主席, 昨日, 該會, 馬來, 昨在, 世界, 飯店, 組織 
##       FREX: 演講, 馬來, 影片, 昨在, 經理, 麻瘋, 慶祝, 放映, 人民, 關於 
##       Lift: 沿海, 聽者, 經理, 馬來, 孔雀, 貨幣, 勝利, 過去, 上海市, 演講 
##       Score: 演講, 馬來, 影片, 經理, 沿海, 麻瘋, 人民, 社友, 孔雀, 自由 
## Topic 4 Top Words:
##       Highest Prob: 昨日, 飯店, 本埠, 代表, 社員, 舉行, 該社, 會員, 來賓, 主席 
##       FREX: 昨日, 宴會, 來賓, 起立, 邀請, 十餘, 登君, 中西, 本埠, 本報 
##       Lift: 宴會, 登君, 市政府, 當晚, 起立, 十餘, 紳商, 循例, 邀請, 創設 
##       Score: 宴會, 來賓, 昨日, 起立, 已有, 飯店, 登君, 列席, 紳商, 十餘 
## Topic 5 Top Words:
##       Highest Prob: 德國, 報告, 定於, 愛物, 美國, 展覽會, 加入, 十一月, 中華, 童子軍 
##       FREX: 德國, 愛物, 我們, 消息, 十一月, 加入, 展覽會, 禁止, 報告, 各校 
##       Lift: 優良, 評判員, 愛物, 德國, 禁止, 案件, 我們, 正在, 報紙, 各校 
##       Score: 德國, 禁止, 愛物, 消息, 加入, 展覽會, 我們, 報告, 報紙, 十一月 
## Topic 6 Top Words:
##       Highest Prob: 下午, 飯店, 十二時, 都城, 聚餐, 今日, 今午, 時間, 中午, 事項 
##       FREX: 今午, 事項, 八仙, 江西, 十二時, 先生, 時間, 都城, 今日, 地點 
##       Lift: 事項, 五時半, 研究班, 誦經, 八仙, 樂隊, 江西, 演奏, 今午, 先生 
##       Score: 今午, 事項, 八仙, 十二時, 都城, 球隊, 聚餐, 時間, 星期五, 江西 
## Topic 7 Top Words:
##       Highest Prob: 國際, 代表, 出席, 社長, 會議, 開會, 香港, 大會, 協會, 全國 
##       FREX: 會議, 八十一, 香港, 出席, 社長, 太平洋, 創立, 籌備, 開會, 代表 
##       Lift: 創立, 籌備, 八十一, 總社, 世運, 號輪, 東京, 太平洋, 昨晚, 香港 
##       Score: 八十一, 國際, 香港, 代表, 創立, 出席, 社長, 各地, 籌備, 會議 
## Topic 8 Top Words:
##       Highest Prob: 舉行, 飯店, 盛大, 假座, 大華, 本埠, 跳舞會, 紀念, 學校, 本月 
##       FREX: 大華, 跳舞會, 建築, 盛大, 公使, 學校, 舉行, 俄童, 二十四日, 紀念 
##       Lift: 入社, 公用局, 俄童, 公使, 建築, 週會, 大華, 二十四日, 跳舞會, 商務 
##       Score: 大華, 舉行, 俄童, 週會, 跳舞會, 建築, 盛大, 公使, 飯店, 紀念 
## Topic 9 Top Words:
##       Highest Prob: 玩具, 兒童, 苦兒, 耶誕, 醫院, 修理, 貧苦, 大戲院, 破舊, 電影 
##       FREX: 玩具, 苦兒, 耶誕, 大戲院, 破舊, 徵集, 照例, 兒童, 公映, 貧苦 
##       Lift: 不知, 不論, 分發, 分送, 募集, 同樂會, 多少, 小朋友, 徵集, 慈幼 
##       Score: 玩具, 耶誕, 苦兒, 大戲院, 破舊, 徵集, 兒童, 照例, 公映, 貧苦 
## Topic 10 Top Words:
##       Highest Prob: 組織, 席間, 世界, 演說, 社員, 各國, 國籍, 目的, 人士, 努力 
##       FREX: 席間, 國籍, 世界, 努力, 組織, 人士, 無不, 從事, 名人, 演說 
##       Lift: 席間, 名人, 國籍, 顯著, 凡屬, 無不, 國民, 最大, 交涉, 進展 
##       Score: 席間, 組織, 世界, 親善, 目的, 國籍, 大城市, 最大, 進行, 國民 
## Topic 11 Top Words:
##       Highest Prob: 乞丐, 救世軍, 委員會, 收容所, 杭州, 計劃, 救濟, 衛生, 工部局, 中外 
##       FREX: 乞丐, 救世軍, 收容所, 衛生, 救濟, 杭州, 委員會, 有所, 局長, 收容 
##       Lift: 施行, 此事, 育會, 收容所, 救世軍, 一點, 乞丐, 名譽, 局長, 收容 
##       Score: 乞丐, 救世軍, 收容所, 計劃, 衛生, 杭州, 育會, 名譽, 有所, 局長 
## Topic 12 Top Words:
##       Highest Prob: 學生, 比賽, 萬國, 網球, 錦標, 扶輪, 第一, 大學, 列席, 慈善 
##       FREX: 網球, 第一, 學生, 委託, 錦標, 比賽, 體育, 慈善, 萬國, 業餘 
##       Lift: 拳擊賽, 永久, 錦標賽, 十五, 委託, 業餘, 第一, 維斯, 網球, 銀盃 
##       Score: 第一, 網球, 錦標, 比賽, 委託, 列席, 學生, 萬國, 網球賽, 扶輪 
## Topic 13 Top Words:
##       Highest Prob: 常會, 本週, 定於, 中午, 五月, 舉行, 假座, 屆時, 星期四, 飯店 
##       FREX: 常會, 本週, 定於, 五月, 將由, 屆時, 九月, 探悉, 中午, 十月 
##       Lift: 將由, 常會, 探悉, 本週, 九月, 大美, 五月, 該項, 定於, 照舊 
##       Score: 常會, 本週, 定於, 五月, 將由, 中午, 屆時, 九月, 假座, 星期四 
## Topic 14 Top Words:
##       Highest Prob: 國際, 社員, 美國, 及其, 法國, 主席, 世界, 此次, 此間, 發表演說 
##       FREX: 法國, 此間, 發表演說, 及其, 開幕, 發表, 大總統, 全球, 互相, 原因 
##       Lift: 加拿大, 開幕, 此間, 發表演說, 大總統, 全球, 法國, 原因, 發表, 事實 
##       Score: 開幕, 法國, 大總統, 及其, 國際, 發表演說, 下列, 此間, 親善, 全球 
## Topic 15 Top Words:
##       Highest Prob: 年會, 招待, 舉行, 團體, 國際, 市長, 會員, 俱樂部, 本市, 事務 
##       FREX: 年會, 招待, 市長, 事務, 該校, 俱樂部, 鐵路, 本市, 團體, 第一屆 
##       Lift: 市中心區, 年會, 該校, 鐵路, 共謀, 土地, 供給, 招待, 休息, 事務 
##       Score: 年會, 市長, 招待, 俱樂部, 事務, 第一屆, 該校, 團體, 紅十字會, 漢口 
## Topic 16 Top Words:
##       Highest Prob: 下午, 十八日, 十五日, 博士, 總會, 十日, 十一日, 紀念會, 十七日, 陳列 
##       FREX: 十八日, 十七日, 十一日, 山東, 紀念會, 十五日, 南京路, 外白渡橋, 天安, 市政廳 
##       Lift: 八十週年, 外白渡橋, 天安, 山東, 市政廳, 聯歡會, 西門, 解釋, 十八日, 同鄉會 
##       Score: 十八日, 紀念會, 十五日, 解釋, 十七日, 外白渡橋, 天安, 市政廳, 聯歡會, 十一日 
## Topic 17 Top Words:
##       Highest Prob: 大會, 童子軍, 方面, 六日, 中外, 代表, 天津, 服務, 歡迎, 比賽 
##       FREX: 六日, 方面, 童子軍, 市民, 日來, 專電, 外交, 二十二日, 西人, 廿一日 
##       Lift: 九畝, 事畢, 二十餘, 評議員, 六日, 日來, 堪稱, 專電, 外交, 維持會 
##       Score: 六日, 日來, 童子軍, 紀念會, 專電, 天津, 市民, 九畝, 評議員, 方面 
## Topic 18 Top Words:
##       Highest Prob: 下午, 總會, 馬路, 聯華, 廿八日, 協會, 四川, 廿七日, 二時, 租界 
##       FREX: 廿八日, 馬路, 廿七日, 五時, 約翰, 聯華, 四川, 租界, 納税, 總會 
##       Lift: 納税, 午三, 南門, 專科, 師範學校, 茶會, 茶話會, 附屬中學, 中學, 召開 
##       Score: 廿八日, 下午, 聯華, 廿七日, 約翰, 馬路, 納税, 四川, 五時, 總會 
## Topic 19 Top Words:
##       Highest Prob: 公司, 八日, 天津, 洋行, 諸君, 五日, 對於, 光明, 商會, 目的 
##       FREX: 八日, 洋行, 光明, 諸君, 永年, 沙田, 福利, 日前, 足以, 略謂 
##       Lift: 東門, 没心肝, 二三, 二二, 光明, 八日, 新光, 永年, 沙田, 油漆 
##       Score: 洋行, 八日, 光明, 天津, 諸君, 永年, 沙田, 福利, 二三, 二二 
## Topic 20 Top Words:
##       Highest Prob: 聚餐會, 舉行, 聚餐, 十二時, 飯店, 三十分, 社員, 演說, 星期四, 於今 
##       FREX: 聚餐會, 三十分, 聚餐, 於今, 十二時, 社友, 第六, 宗旨, 檀香山, 改於 
##       Lift: 改於, 聚餐會, 多邀, 於今, 第六, 午間, 三十分, 卡爾, 本日, 半在 
##       Score: 聚餐會, 聚餐, 十二時, 於今, 社友, 檀香山, 三十分, 改於, 星期四, 飯店

For example, we can display the 10 FREX words for selected topics (1 and 3) in the 10-topic model:

plot.STM(mod.10, "labels", topics=c(1,3), label="frex", n=10, width=60)

Word clouds

Word clouds provide a more intuitive way of visualizing word prevalence in topics. The example below displays the word clouds of the two “international” topics (2 and 3):

par(mfrow=c(1,2), mar=c(0,0,2,2))
cloud(mod.10, topic = 2, scale = c(4, 0.4))
cloud(mod.10, topic = 3, scale = c(4, 0.4))

Perspective

We can use the “perspective” argument to compare topics two by two. This function is helpful to better distinguish between topics that share many similar words. For instance, we can compare two internationally minded topics (2 and 3) in the 10-topic model:

par(mfrow=c(1,1))
plot(mod.10, type="perspectives", topics=c(2, 3))

We can also compare two topics that both deal with meetings (1 and 8):

par(mfrow=c(1,1))
plot(mod.10, type="perspectives", topics=c(8, 1))

Quotations

Some topics may still be unclear and require that we closely look at a sample of representative documents in order to better understand how words translate into concrete sentences in the original articles. To retrieve the representative documents for a given topic, we can use the “findThoughts” function and then display the documents with the companion function “plotQuote”:

To apply this function, we first need to remove the five documents that were eliminated during the pre-processing:

docremoved <- rotaryzh_conc100_corpus[-c(165, 195, 280, 284, 430),  ]

Then, we can extract and plot the 3 most representative documents for topics 1 and 3:

thoughts1 <- findThoughts(mod.10,texts=docremoved$Text, topics=1, n=3)$docs[[1]]
thoughts3 <- findThoughts(mod.10,texts=docremoved$Text, topics=3, n=3)$docs[[1]]

par(mfrow=c(1,2), mar=c(0,0,2,2))
plotQuote(thoughts1, width=50, maxwidth=500, text.cex=0.5, main="Topic 1")
plotQuote(thoughts3, width=50, maxwidth=500, text.cex=0.5, main="Topic 3")

Topic correlation

Topic correlation serves to map connections between topics in a given model. The relations between the topics are based on the proportions of the words they have in common. The “stm” package provides two options for estimating topic correlations. The “simple” method simply thresholds the covariances, whereas the “huge” method uses a semi-parametric procedure. Let’s compare the two approaches:

corrsimple <- topicCorr(mod.20, method = "simple", verbose = FALSE)
corrhuge <- topicCorr(mod.20, method = "huge", verbose = FALSE)
par(mfrow=c(1,2), mar=c(0,0,2,2))
plot(corrsimple, main = "Simple method")
plot(corrhuge, main = "Huge method")

We can then use the package gggraph to better visualize topic proportions by adding weights to the links:

# extract network 
stm_corrs <- get_network(model = mod.20,
                         method = 'simple',
                         labels = paste('Topic', 1:20),
                         cutoff = 0.001,
                         cutiso = TRUE)

# plot network with ggraph 
library(ggraph)

ggraph(stm_corrs, layout = 'fr') +
  geom_edge_link(
    aes(edge_width = weight),
    label_colour = '#fc8d62',
    edge_colour = '#377eb8') +
  geom_node_point(size = 4, colour = 'black')  +
  geom_node_label(
    aes(label = name, size = props),
    colour = 'black',  repel = TRUE, alpha = 0.85) +
  scale_size(range = c(2, 10), labels = scales::percent) +
  labs(size = 'Topic Proportion',  edge_width = 'Topic Correlation', title = "Simple method") + 
  scale_edge_width(range = c(1, 3)) +
  theme_graph()

Interactive visualization

The “stm” package also includes a “LDAvis” function which produces an interactive visualization of an LDA topic model. The main graphical elements include:

Default topic circles - K circles, one for each topic, whose areas are set to be proportional to the proportions of the topics across the N total tokens in the corpus.
Red bars - represent the estimated number of times a given term was generated by a given topic.
- Blue bars - represent the overall frequency of each term in the corpus
- Topic-term circles - K×W circles whose areas are set to be proportional to the frequencies with which a given term is estimated to have been generated by the topics.

stm::toLDAvis(mod.5, doc=out$documents)
stm::toLDAvis(mod.10, doc=out$documents)
stm::toLDAvis(mod.20, doc=out$documents)

Topics over time

In this section, we analyze the effect of time (date of publication) on topic prevalence in the three models.

First, select topic proportions

topic5prop <- topicprop5 %>% select(c(2:6))
topic10prop <- topicprop10 %>% select(c(2:11))
topic20prop <- topicprop20 %>% select(c(2:21))

Compute topic proportions per year

topic_proportion_per_year5 <- aggregate(topic5prop, by = list(Year = rotaryzh_conc100_corpus$year), mean)
topic_proportion_per_year10 <- aggregate(topic10prop, by = list(Year = rotaryzh_conc100_corpus$year), mean)
topic_proportion_per_year20 <- aggregate(topic20prop, by = list(Year = rotaryzh_conc100_corpus$year), mean)

Reshape data frame

library(reshape)
vizDataFrame5y <- melt(topic_proportion_per_year5, id.vars = "Year")
vizDataFrame10y <- melt(topic_proportion_per_year10, id.vars = "Year")
vizDataFrame20y <- melt(topic_proportion_per_year20, id.vars = "Year")

Plot topic proportions per year as bar plots:

require(pals)

# 5-topic model: 
ggplot(vizDataFrame5y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="The Rotary Club (扶輪社) in the Shenbao", 
       subtitle = "Topic proportion over time (5-topic model)")

# 10-topic model:
ggplot(vizDataFrame10y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="The Rotary Club (扶輪社) in the Shenbao", 
       subtitle = "Topic proportion over time (10-topic)")

# 20-topic model:
ggplot(vizDataFrame20y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(title="The Rotary Club (扶輪社) in the Shenbao", 
       subtitle = "Topic proportion over time (20-topic)")

Concluding remarks

Methodologically, the contribution of this paper is threefold:

First, it has offered a simple yet efficient solution, based on concordance, to the problem of article segmentation in digitized newspapers. This preliminary method needs to be refined and adjusted to different types of texts depending on their varying length and relevance to the research question.
Second, this paper constitutes a rare instance of cross-lingual comparison involving the Chinese language during its transitional stage between classical and modern Chinese. As a low-resource language, pre-modern Chinese presents significant challenges for the application of natural language processing tools and computer-assisted text analysis. As E. Kaske has demonstrated (Kaske, 2007), the Chinese language was highly unstable during the century under study, which was a pivotal phase in the creation of a standard vernacular (baihua) national language (guoyu). In his data-driven study of the Shenbao, P. Magistry has shown that the Chinese language actually evolved through six main stages between 1872 and 1949. Further research should pay greater attention to the decisions made during the pre-processing phase and design strict protocols for evaluating the impact of tokenization and language variation on the resulting topics. From a multilingual perspective, future research could also benefit from more sophisticated techniques for the automatic alignment of topics across languages. It could also investigate the differences between the various English periodicals included in the ProQuest collection, especially between the British North China Herald, the American China Weekly Review, and the Chinese-owned China Press.
Third, this research has demonstrated the value of combining different models with different k number of topics, instead of focusing on a single, definitive model. This multi-model approach is particularly appropriate when dealing with corpora of different sizes and with different structures. This multi-scalar reading of corpora enables scholars to navigate between different levels of granularity and to select in each model the topics that are the most relevant to the research question.

The results of this topic modeling exercise can be used as a starting point for addressing more specific research questions. The inferred topics point to the existence of two main categories of articles that can be further investigated using adequate methods. On the one hand, topics related to meetings, organization, and philanthropy are generally rich with names of individuals, organizations, and locations. Named entity recognition (NER) and network analysis can then be utilized to automatically extract the names of these actors and further analyze their connections. On the other hand, topics related to lectures and discussions (forums), which are richer in semantic contents, lend themselves to a deeper examination of the discourses articulated by the various actors, using methods such as semantic and sentiment analysis. Finally, while the Rotary Club has served as a test case in this paper, our methodology can be expanded to investigate other public sphere institutions and more abstract concepts related to the public sphere. Furthermore, it can be transposed to similar digitized texts in English, Chinese, and possibly other languages, beyond the specific corpora utilized in this research.

References

Armand, Cécile. “Foreign Clubs with Chinese Flavor: The Rotary Club of Shanghai and the Politics of Language.” In Knowledge, Power, and Networks: Elites in Transition in Modern China, edited by Cécile Armand, Christian Henriot, and Huei-min Sun, 233–59. Leiden: Brill, 2022. Huang, Philip C.C. “‘Public Sphere “/”Civil Society’ in China? The Third Realm between State and Society.” Modern China Modern China 19, no. 2 (1993): 216–40. Kaske, Elisabeth. The Politics of Language in Chinese Education, 1895-1919. Vol. 82. Sinica Leidensia. Leiden: Brill, 2007. Magistry, Pierre. “Languages(s) of the Shun-Pao, a Computational Linguistics Account.” In 10th International Conference of Digital Archives and Digital Humanities. Taipei, Taiwan, 2019. Rankin, Mary Backus. “The Origins of a Chinese Public Sphere. Local Elites and Community Affairs in the Late Imperial Period.” Études Chinoises 9, no. 2 (1990): 13–60. Wagner, Rudolf G, ed. Joining the Global Public: Word, Image, and City in Early Chinese Newspapers, 1870-1910. Albany, NY: State University of New York Press, 2007. Wakeman, Frederic. “The Civil Society and Public Sphere Debate: Western Reflections on Chinese Political Culture.” Modern China 19, no. 2 (1993): 108–38.

Acknowledgements

This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 788476) and a CCFK grant.

Shenbao was a leading newspaper published in Shanghai between 1872 and 1949. Despite low literacy rates among the Chinese population, it reached 150,000 copies in the 1930s, making one of the two most widely circulated newspapers in China. Although it catered primarily to Shanghai intellectual, political, and business elites, its readership widened in the 1930s. Although it was printed in Shanghai, Shenbao had a national, even international coverage, being also circulated among overseas Chinese↩︎
Margaret Roberts et al., “Stm: Estimation of the Structural Topic Model,” September 18, 2020, 64–65, https://CRAN.R-project.org/package=stm ↩︎