Mapping the Transnational Public Sphere in modern China (2)

Structural Topic Modeling of the English-language press (ProQuest)

Cécile Armand

2022-12-11

Abstract

This is the companion documentation to a conference paper initially entitled “Local elites with global outreach: A bilingual topic modeling of the Shanghai press (1919-49)” given at the ESSCS conference “New Foundations in Chinese Digital History” held at Aix-Marseille University, 24-25 June 2022. This paper aims to map the formation of a transnational public sphere in republican China using a bilingual structural topic modeling approach. This document focuses on the English-language corpus built from the ProQuest Chinese newspaper collection.

Research context

This paper seeks to investigate the formation of a transnational public sphere in republican China, through the joint empirical study of two key institutions – a non-state transnational organization, the Rotary Club, and its representations in the Shanghai press, which has long been considered as a key medium for shaping and disseminating information in modern China (Rankin, 1990; Huang, 1993; Wakeman, 1993; Wagner, 2007). Previous research on the Chinese public sphere presents to main limitations. One the one hand, scholars have focused on theoretical discussions regarding the transferability of Western concepts in China, instead of examining its concrete manifestations in the press and how it was put in practice by social actors. On the other hand, scholars who have used the press as a source have essentially relied on the close reading of subjectively selected articles, without providing the possibility to contextualize their findings and to assess whether/to what extent the selected texts or passages were representative of larger trends.

Taking advantage of the massive, multilingual corpora recently made available in full text by the ENP-China (Elites, Networks and Power in modern China) project, this paper introduces a mixed-method approach based on topic modeling to enable a change of scale in the analysis of the historical press and to overcome certain limitations of manual reading.Topic modeling is a computational, statistical method aimed at automatically detecting hidden themes (topics) in large collections of unstructured texts, based on the co-occurrences of words in documents. In this paper, we rely on structural topic modeling (STM). STM is based on Latent Dirichlet Allocation (LDA), a probabilistic model that treats topics as mixtures of words and documents as mixtures of topics. This implies that words can belong to different topics, while topics can be represented in multiple documents with varying proportions. In addition, STM is able to incorporate document metadata such as the date of publication, which enables to analyze topical changes over time. More specifically, we will use the stm R package which includes several built-in functions designed to facilitate the exploration of topics, including various visualizations and statistical outputs.

The purpose of this research is twofold. Substantively, our key questions include: How did the Shanghai press reported on the Rotary Club? How did the organization mediate between elite/society, business/politics, localism/internationalism? What does this reveal about how the press itself functioned as a public sphere? How did this emerging public sphere change over time and vary across languages? Methodologically, we aim to design a reliable method for conducting a bilingual, dynamic topic modeling approach of the historical press. More specifically, we address three major challenges: (1) to identify topics across multiple languages (in this paper, English and Chinese), (2) to trace topical changes over time, and (3) to adapt topic modeling to the heterogeneity of newspaper content, particularly to brevity-style articles made up of short pieces of unrelated news.

In this document, we focus on the English-language press, based on the ProQuest Chinese Newspapers Collection (CNC). For the Chinese-language press (Shenbao), see the counterpart document. Our workflow follows four main steps. First, we build the corpus from the ENP-China textbase using the HistText package. Second, we prepare the text data and build the topic models using the stm package. Next, we explore and label the topics using various visualizations and statistical measures. Finally, we analyze the effect of time on topic prevalence based on the date of publication.

Note: The purpose of this document is to describe our workflow and to make our methodological choices more explicit, testable and replicable. Historical questions and interpretations are kept to the minimum. For a comprehensive literature review and detailed interpretation of the findings embedded in the final narrative, see to the companion research paper to be published in the Journal of Digital History (JDH).

Corpus building

Load packages

library(histtext)
library(tidyverse)

We first search the “Rotary Club” in the ProQuest collection. Since we are investigating a very specific organization with few possible homonyms and low degree of ambiguity, we can rely on simple keywords. We only restrict the query to the period posterior to 1919, when the first Rotary Club in China was established in Shanghai:

rotary_eng_docs <- search_documents_ex('"rotary club"', corpus= "proquest", 
                      filter_query = c("publisher", "category"),
                      dates="[1919 TO 1949]", 
                      verbose = FALSE)

head(rotary_eng_docs)

When we retrieved the full text of the documents (not done here), we realized that the results contained many articles in which the Rotary Club was just mentioned in passing, amidst unrelated pieces of news. Using the entire document as a text unit would only reflect the messy structure of these texts. In order to alleviate the issue, we propose to apply topic modeling on finer segments of text instead of the entire document.

Example of problematic documents:

view_document(1319907286, "proquest", query = '"rotary club"')

## NULL

In the above example, the length of the targeted segment is 38 words, whereas the total length of the “article” is 2091 words.

Retrieve concordance

Instead of retrieving entire documents, therefore, we will retrieve finer strings of characters using the “concordance” function included in the histtext package. This function returns the queried terms in their context. The main challenge at this stage is to define the right context size. After a careful examination of a sample of articles, we decided to set the threshold at 400 characters to minimize the risk of overlap in cases when articles contain several occurrences of the queried terms:

rotary_eng_conc400 <- histtext::search_concordance_ex('"rotary club"', corpus = "proquest", 
                      filter_query = c("publisher", "category"),
                      dates="[1919 TO 1949]", 
                      context_size = 400, verbose = FALSE) 

head(rotary_eng_conc400)

The concordance table contains seven columns, including the unique identifier of the document (DocId), the date of publication, the title of the article (Title), the name of the periodical (Source), the queried terms (Matched), and the terms preceding (Before) and following (After) the key words.

Filter collections

In the next step, we remove the Hong Kong-based periodical South China Morning Post to focus on mainland periodicals:

rotary_eng_conc400_filtered <- rotary_eng_conc400 %>% 
  filter(!Source %in% c("South China Morning Post Publishers Limited", "South China Morning Post Ltd.")) %>% 
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  mutate(Year = as.numeric(Year)) %>% 
  filter(Year > 1918) 

head(rotary_eng_conc400_filtered)

The filtered corpus contains 3,620 instances and 2,468 documents.

Next, we retrieve the genre of article as metadata, which will be used later to filter out irrelevant contents. Since the category of article can not be directly retrieved with the “search_concordance_ex” function in the current version of histtext, we need to use the function “get_search_field_content” instead.

rotary_eng_doc <- histtext::get_search_fields_content(rotary_eng_conc400_filtered, "proquest",
                                                     search_fields = c("title", "date", "publisher", "category", "fulltext"),
                                                     verbose = FALSE)

head(rotary_eng_doc)

In the next steps, we retain only unique documents, we discard the variables that are redundant with the concordance table (except for “DocId” which will be used for joining the two tables) and we clean the content of the “category” field:

rotary_eng_doc <- rotary_eng_doc %>% 
  unique() %>%
  select(DocId, category) %>% 
  mutate(category = str_remove_all(category,"\\]")) %>% 
  mutate(category = str_remove_all(category,"\\[")) %>% 
  mutate(category = str_remove_all(category,"\\'"))

head(rotary_eng_doc)

Let’s inspect the various categories and their relative importance in our corpus:

rotary_eng_doc %>% select(DocId, category) %>% 
  group_by(category) %>% 
  count(sort = TRUE)

On this basis, we decide to filter out the following non-significant categories:

rotary_eng_doc_filtered <- rotary_eng_doc %>% filter(!category %in% c("Advertisement", 
                                                                    "Classified Advertisement, Advertisement", 
                                                                    "General Information", 
                                                                    "Table of Contents, Front Matter"))

The resulting corpus contains 2387 documents. We can then join the list of document with the concordance table:

rotary_join <- inner_join(rotary_eng_doc_filtered, rotary_eng_conc400_filtered)

head(rotary_join)

The concordance table now includes the category of article among the metadata.

Create text units

In the next steps, we will transform the text data so that it can be processed by the topic model.

First, we create a new variable “Text” for the merged text:

# We first merge "Before" and "Matched" into a new "Text" variable
rotary_merged <- rotary_join %>% mutate(Text = paste(Before, Matched, sep = " ")) 
head(rotary_merged)

# Second, we merge "After" with the text resulting from previous operation (Text): 
rotary_merged <- rotary_merged %>% mutate(Text = paste(Text, After, sep = " ")) 
head(rotary_merged)

Finally, we re-unite each document to include all the occurrences it may contain, and we retain only one instance for each document:

library(data.table)

rotary_united <- rotary_merged %>%
  group_by(DocId, Date, Year, Source, category, Title, grp = rleid(DocId)) %>% 
  summarise(Text = str_c(Text, collapse=' '), .groups = 'drop') %>%
  ungroup %>%
  select(-grp)

head(rotary_united)

Our final corpus contains 2,387 reshaped documents spanning from 1919 to 1948:

rotary_united %>% 
  group_by(Year) %>% count(Source) %>%
  ggplot(aes(x=Year, y=n, fill=Source)) + 
  geom_col(alpha = 0.8) + 
  labs(title = "The Rotary Club in the English-language press",
       subtitle = "Number of articles mentioning the club",
       x = "Year", 
       y = "Number of articles",
       caption = "Based on ProQuest Chinese Newspaper Collection")

The peak in the 1930s largely reflects the increase in the volume of periodicals during this period. It also coincides with the growth of Chinese memberships and the most active period in the history of the Rotary Club. It was followed by a dramatic decline during the Sino-Japanese war (1937-1945) during which most foreign periodicals ceased publications and never fully recovered afterwards.

In addition, we observe significant differences between periodicals. Three major publications, in fact, dominated the corpus:

rotary_united %>% 
  group_by(Source)  %>%  
  summarise(n = n()) %>%
  mutate(ptg = paste0(round(n / sum(n) * 100, 0), "%")) %>%
  arrange(desc(n))

Before we move to the next step (text pre-processing), we recommend saving the dataset as a csv file:

write.csv(rotary_united, "rotary_eng_conc400.csv")

Pre-processing

Next, we need to prepare the text data to make it readable by topic model algorithms.

In the pre-processing phase, we chose not to stem and not lemmatize words because at this stage, we wanted to maintain all possible nuances conveyed in the original texts. We removed words which contained less than four characters and occurred in less than five documents. We also removed a customized list of stop words, especially the terms used to query the corpus, as well as high-frequency terms in this context, such as “China” and “Chinese”. Based on these parameters, 14517 out of 16895 terms (22589 of 82569 tokens) were removed due to frequency. The final corpus contains 2,387 documents, 2378 terms and 59980 tokens.

meta <- rotary_united %>% transmute(DocId, Date, Year, Source, category, Title)

corpus <- stm::textProcessor(rotary_united$Text,
                             metadata = meta, 
                             stem = FALSE, 
                             wordLengths = c(4, Inf), 
                             verbose = FALSE, 
                             customstopwords = c("rotary", "club", "china", "chinese", "will", "one", "two"))

stm::plotRemoved(corpus$documents, lower.thresh = c(0,10, by=5))

out <- stm::prepDocuments(corpus$documents, 
                          corpus$vocab, 
                          corpus$meta, 
                          lower.thresh = 5)

## Removing 14517 of 16895 terms (22589 of 82569 tokens) due to frequency 
## Your corpus now has 2387 documents, 2378 terms and 59980 tokens.

Before we move on towards building the models, a sound reflex is to inspect more closely which words were removed:

wordsremoved <- as_tibble(out$words.removed) 
wordsremoved

Notice that many of the removed words include errors in optical character recognition (OCR) and other noisy contents resulting from the process of digitization.

Model building

Choosing the right number of topics k remains a highly debated question. There is no definite solution. Most topic modeling tools generally provide a set of metrics such as held-out likelihood, residual analysis, average exclusivity and semantic coherence, to help the researcher to determine the optimal number of topics for a given corpus. According to the authors of the manual of the stm package, for small corpora ranging from a few hundred to a few thousand documents, the best number of topics should range between 5 and 50 topics ¹. Ultimately, however, only the researcher’s interpretational needs can determine what is the most appropriate number of topics for a given specific research.

In the stm R package, the searchK function provides a wide range of metrics to guide our choice, including held-out likelihood, residual analysis, average exclusivity and semantic coherence. Only the default properties (held-out likelihood, residuals, semantic coherence, lower bound) are displayed below:

set.seed(1111)
K<-seq(5,50, by=10) 
kresult <- searchK(out$documents, out$vocab, K,  data=out$meta, prevalence =~ Year + Source + category, verbose=FALSE)
plot(kresult)

After several experiments, we decided build three models with 5, 10 and 20 topics, which will enable us to navigate different levels of granularity:

# 5-topic model
mod.5 <- stm::stm(out$documents, 
                   out$vocab, K=5, 
                   data=out$meta, 
                   prevalence =~ Year + Source + category, 
                   verbose = FALSE)

# 10-topic model
mod.10 <- stm::stm(out$documents, 
                   out$vocab, K=10, 
                   data=out$meta, 
                   prevalence =~ Year + Source + category, 
                   verbose = FALSE)

# 20-topic model
mod.20 <- stm::stm(out$documents, 
                   out$vocab, K=20, 
                   data=out$meta, 
                   prevalence =~ Year + Source + category,  
                   verbose = FALSE)

Finally, we incorporate the time variables in the models to further analyze topical changes over time:

year5 <- stm::estimateEffect(1:5 ~ Year, mod.5, meta=out$meta)
year10 <- stm::estimateEffect(1:10 ~ Year, mod.10, meta=out$meta)
year20 <- stm::estimateEffect(1:20 ~ Year, mod.20, meta=out$meta)

Additionally, we can incorporate other relevant metadata such as the source (name of periodical) and the genre of article (category), if we want to investigate differences in topical contents between periodicals or between categories of articles.

Estimate source effect:

source5 <- stm::estimateEffect(1:5 ~ Source, mod.5, meta=out$meta)
source10 <- stm::estimateEffect(1:10 ~ Source, mod.10, meta=out$meta)
source20 <- stm::estimateEffect(1:20 ~ Source, mod.20, meta=out$meta)

Estimate source effect:

category5 <- stm::estimateEffect(1:5 ~ category, mod.5, meta=out$meta)
category10 <- stm::estimateEffect(1:10 ~ category, mod.10, meta=out$meta)
category20 <- stm::estimateEffect(1:20 ~ category, mod.20, meta=out$meta)

Finally, we save the models as an “.RData” file to save time and computational power in the future:

save.image('rotaryeng.RData')

Model evaluation

To compare the three models, we can plot the semantic coherence of topics against their exclusivity. As the plot suggested, the higher the number of topics, the lower their semantic coherence, and the higher their exclusivity:

mod5df<-as.data.frame(cbind(c(1:5),exclusivity(mod.5), semanticCoherence(model=mod.5, out$documents), "PQ5T"))
mod10df<-as.data.frame(cbind(c(1:10),exclusivity(mod.10), semanticCoherence(model=mod.10, out$documents), "PQ10T"))
mod20df<-as.data.frame(cbind(c(1:20),exclusivity(mod.20), semanticCoherence(model=mod.20, out$documents), "PQ20T"))

models<-rbind(mod5df, mod10df, mod20df)
colnames(models)<-c("Topic","Exclusivity", "SemanticCoherence", "Model")

models$Exclusivity<-as.numeric(as.character(models$Exclusivity))
models$SemanticCoherence<-as.numeric(as.character(models$SemanticCoherence))

options(repr.plot.width=7, repr.plot.height=6, repr.plot.res=100)

plotmodels <-ggplot(models, aes(SemanticCoherence, Exclusivity, color = Model))+
  geom_point(size = 2, alpha = 0.7) + 
  geom_text(aes(label=Topic), nudge_y=.04)+
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Comparing exclusivity and semantic coherence", 
       subtitle = "English-language corpus (ProQuest)")


plotmodels

Model exploration

In the first step, we highly recommend using the package stminsights to explore the models:

library(stminsights)
run_stminsights()

Stminsights is an R Shiny application which provides a set of visualizations and statistical tools for exploring the topics in one or across multiple models. While building on the stm package itself, it greatly facilitates the preliminary exploration. In the next sections, we shall provide the full code for reproducing and adjusting the outputs produced through the “stm insights” application.

Topic proportions θ

The package “stm” stores the document-topic proportions and the topic-word distributions in two matrices, θ (which is also referred to, somewhat confusingly, as γ) and β. We can then take a closer look at θ, which can be called directly from the model. Alternatively, it is possible and perhaps more convenient to use the built-in function “make.dt()”. The latter allows to incorporate the metadata, which in our case is helpful since we aim to examine the influence of the data of publication in topic prevalence. The table below display the proportions of topics for each document, along with their metadata.

Extract topic proportions for each model:

topicprop5<-make.dt(mod.5, meta)
topicprop10<-make.dt(mod.10, meta)
topicprop20<-make.dt(mod.20, meta)

Consulting the table might be a bit cumbersome unless we want to examine the topic proportions of a specific document. The “plot.STM” function associated with the “hist” argument helps to better visualize the estimates of document-topic proportions:

plot.STM(mod.5, "hist")

plot.STM(mod.10, "hist")

plot.STM(mod.20, "hist")

Next, we can examine more closely the words that define the topics in order to better understand what each topic is really about.

Word per topic β

In the stm package, the function “plot.STM” with argument “summary” displays the general distribution of topics (which topics are overall more common in the corpus) along with the most common words for each topic. In the example below, we set the number of desired words to 5:

plot.STM(mod.5,"summary", n=5)

plot.STM(mod.10, "summary", n=5)

plot.STM(mod.20, "summary", n=5)

Alternatively, we can plot words proportions per topics as bar plots using a tidy approach. In the example below, we focus on the 10-topic model. The plots display the 10 most frequent words for each topic:

# load packages

library(tidyverse)
library(tidytext)

# extract β proportions for the 10-topic model 

td_beta10_eng <- tidytext::tidy(mod.10) 

# plot the the distribution of words (10 first words) for each topic 

options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100) 

td_beta10_eng %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Highest word probabilities for each topic",
       subtitle = "Different words are associated with different topics",
       caption = "Based on ProQuest Chinese Newspapers Collection")

Explore the top 10 words for each topic:

topic10eng_top_words <- td_beta10_eng %>%
  group_by(topic) %>%
  top_n(10, beta) 

topic10eng_top_words %>% arrange(topic, desc(beta))

Topic labeling

The function “labelTopics” (or sageLabels) provides a deeper insight on the popular words in each topic. In addition to words probability, other metrics can be computed, including the FREX words (FREX weights words by frequency and exclusivity to the topic), lift words (frequency divided by frequency in other topics), and score (similar to lift, but with log frequencies). In the example below, we set the number of words to 10:

labelTopics(mod.5, n=10)

## Topic 1 Top Words:
##       Highest Prob: shanghai, president, tiffin, members, meeting, held, yesterday, hotel, weekly, rotarians 
##       FREX: fong, elected, district, presided, hangchow, chair, fitch, thursdays, delegates, president 
##       Lift: chair, heads, informative, luther, mccullough, yesterdays, chekiang, fort, franz, ningpo 
##       Score: president, tiffin, heads, elected, fong, metropole, ningpo, guests, weekly, conference 
## Topic 2 Top Words:
##       Highest Prob: shanghai, children, hospital, toys, made, school, christmas, work, committee, russian 
##       FREX: children, toys, christmas, charity, poor, drive, funds, hospital, child, fund 
##       Lift: card, charitable, refugee, advisory, beggars, bouts, brigadier, cases, cash, charity 
##       Score: toys, children, christmas, stands, hospital, charity, card, poor, funds, needy 
## Topic 3 Top Words:
##       Highest Prob: nanking, address, foreign, said, subject, shanghai, delivered, speech, told, government 
##       FREX: economic, tsinan, chinas, government, declared, bureau, arnold, greater, julean, delivered 
##       Lift: co-operative, economic, kann, lamb, necessity, bright, discusses, domestic, economics, holmes 
##       Score: planning, nanking, economic, tsinan, government, address, attache, julean, chinas, declared 
## Topic 4 Top Words:
##       Highest Prob: shanghai, first, said, international, world, shield, present, work, local, great 
##       FREX: shield, scouts, troop, competition, jamboree, scout, troops, team, trophy, jewish 
##       Lift: consisted, correct, displays, hanbury, jamboree, north-, resolution, troop, athletic, award 
##       Score: hobby, shield, troop, scouts, jamboree, competition, scout, trophy, prize, team 
## Topic 5 Top Words:
##       Highest Prob: shanghai, meeting, american, held, hotel, thursday, members, weekly, next, today 
##       FREX: brevities, tomorrow, closed, american, speak, columbia, next, regular, clock, parisien 
##       Lift: columbia, finest, genera, lounge, mason, potter, steamship, vacation, brevities, doremus 
##       Score: finest, american, thursday, meeting, brevities, astor, hotel, weekly, speak, chamber

labelTopics(mod.10, n=10)

## Topic 1 Top Words:
##       Highest Prob: shanghai, president, member, fong, international, secretary, board, elected, past, harris 
##       FREX: directors, board, fong, convention, officers, elected, harris, secretary, treasurer, honorary 
##       Lift: heads, luther, past-president, absent, joseph, richard, studio, survived, sanzetti, elect 
##       Score: president, elected, fong, heads, directors, conference, board, harris, officers, convention 
## Topic 2 Top Words:
##       Highest Prob: children, hospital, toys, shanghai, christmas, made, funds, charity, year, building 
##       FREX: toys, charity, hospital, drive, children, card, christmas, poor, matinee, needy 
##       Lift: bouts, card, henningsen, stands, suffering, beds, brigadier, camps, cash, charity 
##       Score: toys, children, christmas, hospital, stands, card, charity, needy, matinee, poor 
## Topic 3 Top Words:
##       Highest Prob: nanking, wang, international, foreign, hangchow, district, affairs, president, special, minister 
##       FREX: wang, tsinan, hangchow, nanking, soochow, capital, affairs, bureau, anniversary, ministry 
##       Lift: lamb, vice-minister, chapter, hansen, ministry, tsinan, holmes, planning, treaty, foochow 
##       Score: nanking, planning, wang, hangchow, tsinan, charter, ningpo, soochow, governor, affairs 
## Topic 4 Top Words:
##       Highest Prob: work, shanghai, public, committee, community, interest, relief, church, municipal, service 
##       FREX: amoy, ricksha, relief, problem, work, church, might, settlement, done, resolution 
##       Lift: muni, resolution, traffic, assisting, beggars, carrying, churches, cipal, hobby, kulangsu 
##       Score: hobby, ricksha, problem, relief, amoy, resolution, pullers, camp, beggars, essay 
## Topic 5 Top Words:
##       Highest Prob: american, shanghai, states, united, university, america, addressed, foreign, company, hongkong 
##       FREX: american, chamber, commercial, julean, america, hongkong, states, addressed, commerce, united 
##       Lift: finest, genera, publishing, ager, allen, domestic, doremus, julean, legion, married 
##       Score: american, states, finest, chamber, united, attache, trade, julean, commerce, commercial 
## Topic 6 Top Words:
##       Highest Prob: school, shanghai, miss, russian, shield, scouts, ball, troop, presented, boys 
##       FREX: shield, scouts, troop, team, jamboree, scout, cathedral, trophy, russian, school 
##       Lift: cathedral, displays, emigrants, hanbury, stall, teams, thos, willis, athletic, compete 
##       Score: shield, troop, school, scouts, jamboree, russian, competition, miss, trophy, thos 
## Topic 7 Top Words:
##       Highest Prob: meeting, hotel, held, shanghai, weekly, thursday, tiffin, metropole, speaker, yesterday 
##       FREX: metropole, weekly, brevities, meeting, closed, hotel, regular, next, speaker, metro 
##       Lift: chun, informative, metro, metropolo, slides, brevities, illustrated, unclaimed, origin, mctropole 
##       Score: metropole, weekly, meeting, hotel, tiffin, thursday, speaker, held, brevities, chun 
## Topic 8 Top Words:
##       Highest Prob: members, tiffin, guests, shanghai, house, astor, rotarians, party, held, dinner 
##       FREX: guests, attendance, astor, house, dinner, tuesday, party, clock, ladies, dance 
##       Lift: dance, feng, wore, enjoyable, fathers, sang, piano, guests, songs, tables 
##       Score: feng, guests, astor, tiffin, house, dinner, party, dance, ladies, attendance 
## Topic 9 Top Words:
##       Highest Prob: address, japan, japanese, government, peking, delivered, last, members, present, international 
##       FREX: government, relations, declared, tokyo, japanese, london, pacific, japan, manchuria, weeks 
##       Lift: italy, mukden, osaka, asserted, berlin, expedition, legal, manchester, okecki, operating 
##       Score: italy, government, japan, declared, manila, economic, japanese, tokyo, pacific, peking 
## Topic 10 Top Words:
##       Highest Prob: said, shanghai, world, years, address, speech, gave, talk, members, editor 
##       FREX: said, world, daily, editor, greater, history, speaking, paper, harry, peace 
##       Lift: depression, alexander, crisis, north-, zaria, privilege, pott, remarked, lord, daily 
##       Score: alexander, said, world, speech, editor, north-, talk, greater, crisis, speaking

labelTopics(mod.20, n=10)

## Topic 1 Top Words:
##       Highest Prob: shanghai, president, international, past, fong, member, local, harris, fitch, george 
##       FREX: fong, past, convention, fitch, representing, harris, luther, george, founder, vienna 
##       Lift: heads, douglas, luther, richard, founder, fong, vienna, convention, interna, past 
##       Score: fong, president, past, heads, harris, fitch, convention, representing, luther, george 
## Topic 2 Top Words:
##       Highest Prob: hospital, toys, children, shanghai, christmas, charity, made, year, funds, ward 
##       FREX: hospital, card, charity, toys, matinee, drive, orthopedic, ward, christmas, stand 
##       Lift: bouts, card, refugee, stands, boxing, crippled, damaged, fortunate, reconditioned, camps 
##       Score: toys, hospital, christmas, card, children, charity, stands, matinee, needy, drive 
## Topic 3 Top Words:
##       Highest Prob: foreign, trade, arnold, week, addresses, julean, various, commercial, members, commissioner 
##       FREX: arnold, julean, trade, attache, addresses, foreign, form, planning, railway, commercial 
##       Lift: arnold, attache, julean, machinery, lepers, planning, domestic, trade, muni, sees 
##       Score: planning, arnold, julean, attache, trade, foreign, form, commercial, addresses, chinas 
## Topic 4 Top Words:
##       Highest Prob: work, committee, relief, public, shanghai, community, international, done, interest, service 
##       FREX: relief, done, bring, ricksha, bodies, traffic, long, better, hobby, page 
##       Lift: peaceful, urging, famine, forth, hobby, carrying, dollars, churches, bodies, duty 
##       Score: hobby, relief, done, ricksha, problem, resolution, bring, show, work, better 
## Topic 5 Top Words:
##       Highest Prob: american, shanghai, states, united, company, member, addressed, thursday, university, commerce 
##       FREX: chamber, states, commerce, united, american, senator, company, judge, addressed, columbia 
##       Lift: finest, genera, publishing, shansi, sportif, chamber, doremus, senator, scottish, steamship 
##       Score: american, chamber, states, united, commerce, finest, company, senator, shansi, columbia 
## Topic 6 Top Words:
##       Highest Prob: school, shield, scouts, troop, russian, shanghai, president, jamboree, camp, scout 
##       FREX: shield, scouts, troop, jamboree, scout, jewish, troops, millington, cathedral, camp 
##       Lift: scouts, troop, cubs, displays, hanbury, jamboree, jewish, millington, scout, thos 
##       Score: shield, troop, scouts, jamboree, scout, russian, competition, jewish, school, thos 
## Topic 7 Top Words:
##       Highest Prob: shanghai, road, chang, institution, issue, work, building, charge, official, appeal 
##       FREX: issue, institution, chang, latest, pagoda, blind, organ, shown, appeal, official 
##       Lift: pagoda, chun, issue, moved, printing, latest, sick, institution, civil, contains 
##       Score: institution, chun, issue, chang, pagoda, appeal, latest, blind, mittee, sick 
## Topic 8 Top Words:
##       Highest Prob: hotel, shanghai, meeting, held, weekly, metropole, thursday, members, yesterday, speaker 
##       FREX: metropole, hotel, brevities, usual, weekly, metro, thursday, informative, re-union, reunion 
##       Lift: feng, informative, unclaimed, reunion, slides, metropole, brevities, metro, oregon, commander 
##       Score: metropole, hotel, weekly, thursday, feng, speaker, brevities, tiffin, usual, re-union 
## Topic 9 Top Words:
##       Highest Prob: meeting, today, program, speak, regular, held, next, closed, tomorrow, shang 
##       FREX: today, closed, tomorrow, program, regular, speak, thurs, postponed, reminded, hold 
##       Lift: italy, tomorrow, closed, morrow, today, reminded, postponed, transportation, todays, thurs 
##       Score: today, italy, closed, tomorrow, program, speak, meeting, thurs, regular, postponed 
## Topic 10 Top Words:
##       Highest Prob: said, hongkong, recently, well, talk, pacific, institute, great, read, speech 
##       FREX: australia, pacific, hongkong, institute, relations, story, read, soviet, never, think 
##       Lift: soviet, alexander, bright, depression, scotland, australia, ancient, wrong, security, story 
##       Score: alexander, said, hongkong, woodhead, australia, pacific, lord, institute, soviet, manila 
## Topic 11 Top Words:
##       Highest Prob: members, shanghai, dinner, evening, given, party, clock, ladies, night, afternoon 
##       FREX: clock, tuesday, evening, dinner, ladies, dance, wives, friday, night, instead 
##       Lift: stage, wives, lounge, gala, tuesday, arts, dance, clock, reservations, dancing 
##       Score: stage, dinner, dance, evening, tuesday, clock, ladies, wives, ball, majestic 
## Topic 12 Top Words:
##       Highest Prob: wang, conference, district, hangchow, governor, members, international, president, tsinan, held 
##       FREX: governor, wang, conference, tsinan, district, hangchow, ningpo, delegates, mayor, formed 
##       Lift: mccullough, wuhu, chekiang, governor, tsinan, foochow, ningpo, wang, conference, district 
##       Score: wuhu, wang, conference, governor, hangchow, tsinan, district, ningpo, delegates, mayor 
## Topic 13 Top Words:
##       Highest Prob: nanking, government, national, bureau, affairs, soochow, special, central, capital, minister 
##       FREX: soochow, bureau, nanking, ministry, pullers, capital, administration, central, affairs, government 
##       Lift: co-operative, lamb, necessity, pullers, soong, okecki, request, treaty, hansen, soochow 
##       Score: nanking, lamb, soochow, pullers, bureau, ministry, government, okecki, co-operative, central 
## Topic 14 Top Words:
##       Highest Prob: address, shanghai, delivered, present, last, speech, editor, years, subject, history 
##       FREX: daily, editor, settlement, modern, political, north-, greater, gold, economic, delivered 
##       Lift: north-, beggars, cold, discusses, kann, newspapers, area, pott, pointing, politics 
##       Score: area, address, north-, editor, daily, delivered, economic, situation, beggars, greater 
## Topic 15 Top Words:
##       Highest Prob: miss, shanghai, french, school, russian, children, girls, society, race, donation 
##       FREX: miss, donation, race, donations, girls, french, thanks, receipts, orthodox, kings 
##       Lift: emigrants, oeuvres, orthodox, willis, acknowledge, harvey, sassoon, cash, garage, kings 
##       Score: miss, willis, donation, russian, donations, girls, orthodox, children, toys, french 
## Topic 16 Top Words:
##       Highest Prob: shanghai, meet, international, first, radio, local, team, tennis, american, presented 
##       FREX: tennis, radio, prize, student, awarded, schools, team, contest, class, field 
##       Lift: fathers, prize, student, training, winner, athletic, awarded, bursary, dorothy, jean 
##       Score: winner, team, tennis, trophy, tournament, contest, prize, essay, scholarship, competition 
## Topic 17 Top Words:
##       Highest Prob: tiffin, meeting, guests, held, address, members, gave, weekly, yesterday, interesting 
##       FREX: tiffin, gave, guests, attendance, thursdays, visiting, interesting, presided, chair, introduced 
##       Lift: chapman, attempts, ladows, tavern, outlines, aviation, attendance, thursdays, yesterdays, laid 
##       Score: tiffin, astor, guests, attendance, attempts, weekly, thursdays, house, introduced, presided 
## Topic 18 Top Words:
##       Highest Prob: tientsin, left, peking, church, meeting, house, astor, union, service, shanghai 
##       FREX: tientsin, left, church, photo, peking, yang, frank, invitation, arrived, woman 
##       Lift: yang, defense, photo, sanzetti, studio, sino-foreign, fore, fortnightly, ject, tientsin 
##       Score: yang, tientsin, church, astor, left, peking, sanzetti, photo, woman, frank 
## Topic 19 Top Words:
##       Highest Prob: japanese, life, rotarians, rotarian, business, members, years, first, great, said 
##       FREX: spirit, addressing, japanese, life, chicago, rotarian, probably, factory, canada, harbin 
##       Lift: osaka, mail, becomes, harbin, addressing, remarked, africa, spirit, probably, discovered 
##       Score: mail, rotarian, chicago, spirit, addressing, japanese, harbin, tokyo, travels, membership 
## Topic 20 Top Words:
##       Highest Prob: president, general, elected, secretary, board, shanghai, year, meeting, directors, chairman 
##       FREX: elected, directors, board, secretary, general, honorary, officers, treasurer, vice, head 
##       Lift: named, directors, election, treasurer, elect, ensuing, elected, philippines, board, honorary 
##       Score: elected, directors, president, board, named, officers, secretary, treasurer, berents, honorary

For example, we can display the 10 FREX words for the three topics that mixed local and international concerns in the 10-topic model:

plot.STM(mod.10, "labels", topics=c(1,5,10), label="frex", n=10, width=50)

Word clouds

Word clouds provide a more intuitive, though less rigorous way of visualizing word prevalence in topics. The example below displays the word clouds of the same three topics as above:

cloud(mod.10, topic = 1, scale = c(4, 0.4))

cloud(mod.10, topic = 5, scale = c(4, 0.4))

cloud(mod.10, topic = 10, scale = c(4, 0.4))

Quotations

Some topics may still remain unclear and require that we have a closer look at a sample of documents in order to better understand what they are about. The “findThoughts” function enables us to have a glimpse at the most representative documents for each topic, and plot them with the function “plotQuote”.

T10_thoughts1 <- findThoughts(mod.10,texts=rotary_united$Text, topics=1, n=5)$docs[[1]]
T10_thoughts5 <- findThoughts(mod.10,texts=rotary_united$Text, topics=5, n=5)$docs[[1]]
T10_thoughts10 <- findThoughts(mod.10,texts=rotary_united$Text, topics=10, n=5)$docs[[1]]

par(mfrow=c(1,3), mar=c(1,1,2,2))
plotQuote(T10_thoughts1, width=50, maxwidth=400, text.cex=0.5, main="Topic 1")
plotQuote(T10_thoughts5, width=50, maxwidth=400, text.cex=0.5, main="Topic 5")
plotQuote(T10_thoughts10, width=50, maxwidth=400, text.cex=0.5, main="Topic 10")

Topic correlation

Topic correlation helps to better structure the exploration of the model. It shows relations between topics based on the proportions of words they have in common. The “stm” package provides two options for estimating topic correlations. The “simple” method simply thresholds the covariances whereas the “huge” method uses the semi-parametric procedure in the package. Let’s compare the two procedures:

corrsimple <- topicCorr(mod.20, method = "simple", verbose = FALSE)
corrhuge <- topicCorr(mod.20, method = "huge", verbose = FALSE)
par(mfrow=c(1,2), mar=c(0,0,2,2))
plot(corrsimple, main = "Simple method")
plot(corrhuge, main = "Huge method")

We can further use the package gggraph to get visualize with greater precision the topic proportions and the weights of correlation. Let’s start with the simple method:

# extract network 
stm_corrs <- get_network(model = mod.20,
                         method = 'simple',
                         labels = paste('Topic', 1:20),
                         cutoff = 0.001,
                         cutiso = TRUE)

# plot network with ggraph 
library(ggraph)

ggraph(stm_corrs, layout = 'fr') +
  geom_edge_link(
    aes(edge_width = weight),
    label_colour = '#fc8d62',
    edge_colour = '#377eb8') +
  geom_node_point(size = 4, colour = 'black')  +
  geom_node_label(
    aes(label = name, size = props),
    colour = 'black',  repel = TRUE, alpha = 0.85) +
  scale_size(range = c(2, 10), labels = scales::percent) +
  labs(size = 'Topic Proportion',  edge_width = 'Topic Correlation', title = "Simple method") + 
  scale_edge_width(range = c(1, 3)) +
  theme_graph()

Interactive visualization

The “stm” package also includes a function “LDAvis” which produces an interactive visualization of an LDA topic model. The main graphical elements include:

Default topic circles - K circles, one for each topic, whose areas are set to be proportional to the proportions of the topics across the N total tokens in the corpus.
Red bars - represent the estimated number of times a given term was generated by a given topic.
- Blue bars - represent the overall frequency of each term in the corpus
- Topic-term circles - K×W circles whose areas are set to be proportional to the frequencies with which a given term is estimated to have been generated by the topics.

stm::toLDAvis(mod.5, doc=out$documents)
stm::toLDAvis(mod.10, doc=out$documents)
stm::toLDAvis(mod.20, doc=out$documents)

In the next section, we propose to analyze the effect of time (date of publication) on topic prevalence in the three models.

Topics over time

First, select topic proportions

topic5prop <- topicprop5 %>% select(c(2:6))
topic10prop <- topicprop10 %>% select(c(2:11))
topic20prop <- topicprop20 %>% select(c(2:21))

Compute topic proportions per year

topic_proportion_per_year5 <- aggregate(topic5prop, by = list(Year = rotary_united$Year), mean)
topic_proportion_per_year10 <- aggregate(topic10prop, by = list(Year = rotary_united$Year), mean)
topic_proportion_per_year20 <- aggregate(topic20prop, by = list(Year = rotary_united$Year), mean)

Reshape data frame

library(reshape)
vizDataFrame5y <- melt(topic_proportion_per_year5, id.vars = "Year")
vizDataFrame10y <- melt(topic_proportion_per_year10, id.vars = "Year")
vizDataFrame20y <- melt(topic_proportion_per_year20, id.vars = "Year")

Plot topic proportions per year as bar plots:

library(pals)

# 5-topic model: 
ggplot(vizDataFrame5y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="The Rotary Club in the English-language press", 
       subtitle = "Topic proportion over time (5-topic model)")

# 10-topic model:
ggplot(vizDataFrame10y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="The Rotary Club in the English-language press", 
       subtitle = "Topic proportion over time (10-topic)")

# 20-topic model:
ggplot(vizDataFrame20y, aes(x=Year, y=value, fill=variable)) + 
  geom_bar(stat = "identity") + ylab("proportion") + 
  scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "Topic") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(title="The Rotary Club in the English-language press", 
       subtitle = "Topic proportion over time (20-topic)")

Concluding remarks

Methodologically, the contribution of this paper is threefold:

First, it has offered a simple yet efficient solution, based on concordance, to the problem of article segmentation in digitized newspapers. This preliminary method needs to be refined and adjusted to different types of texts depending on their varying length and relevance to the research question.
Second, this paper constitutes a rare instance of cross-lingual comparison involving the Chinese language during its transitional stage between classical and modern Chinese. As a low-resource language, pre-modern Chinese presents significant challenges for the application of natural language processing tools and computer-assisted text analysis. As E. Kaske has demonstrated (Kaske, 2007), the Chinese language was highly unstable during the century under study, which was a pivotal phase in the creation of a standard vernacular (baihua) national language (guoyu). In his data-driven study of the Shenbao, P. Magistry has shown that the Chinese language actually evolved through six main stages between 1872 and 1949. Further research should pay greater attention to the decisions made during the pre-processing phase and design strict protocols for evaluating the impact of tokenization and language variation on the resulting topics. From a multilingual perspective, future research could also benefit from more sophisticated techniques for the automatic alignment of topics across languages. It could also investigate the differences between the various English periodicals included in the ProQuest collection, especially between the British North China Herald, the American China Weekly Review, and the Chinese-owned China Press.
Third, this research has demonstrated the value of combining different models with different k number of topics, instead of focusing on a single, definitive model. This multi-model approach is particularly appropriate when dealing with corpora of different sizes and with different structures. This multi-scalar reading of corpora enables scholars to navigate between different levels of granularity and to select in each model the topics that are the most relevant to the research question.

The results of this topic modeling exercise can be used as a starting point for addressing more specific research questions. The inferred topics point to the existence of two main categories of articles that can be further investigated using adequate methods. On the one hand, topics related to meetings, organization, and philanthropy are generally rich with names of individuals, organizations, and locations. Named entity recognition (NER) and network analysis can then be utilized to automatically extract the names of these actors and further analyze their connections. On the other hand, topics related to lectures and discussions (forums), which are richer in semantic contents, lend themselves to a deeper examination of the discourses articulated by the various actors, using methods such as semantic and sentiment analysis. Finally, while the Rotary Club has served as a test case in this paper, our methodology can be expanded to investigate other public sphere institutions and more abstract concepts related to the public sphere. Furthermore, it can be transposed to similar digitized texts in English, Chinese, and possibly other languages, beyond the specific corpora utilized in this research.

References

Armand, Cécile. “Foreign Clubs with Chinese Flavor: The Rotary Club of Shanghai and the Politics of Language.” In Knowledge, Power, and Networks: Elites in Transition in Modern China, edited by Cécile Armand, Christian Henriot, and Huei-min Sun, 233–59. Leiden: Brill, 2022. Huang, Philip C.C. “‘Public Sphere “/”Civil Society’ in China? The Third Realm between State and Society.” Modern China Modern China 19, no. 2 (1993): 216–40. Kaske, Elisabeth. The Politics of Language in Chinese Education, 1895-1919. Vol. 82. Sinica Leidensia. Leiden: Brill, 2007. Magistry, Pierre. “Languages(s) of the Shun-Pao, a Computational Linguistics Account.” In 10th International Conference of Digital Archives and Digital Humanities. Taipei, Taiwan, 2019. Rankin, Mary Backus. “The Origins of a Chinese Public Sphere. Local Elites and Community Affairs in the Late Imperial Period.” Études Chinoises 9, no. 2 (1990): 13–60. Wagner, Rudolf G, ed. Joining the Global Public: Word, Image, and City in Early Chinese Newspapers, 1870-1910. Albany, NY: State University of New York Press, 2007. Wakeman, Frederic. “The Civil Society and Public Sphere Debate: Western Reflections on Chinese Political Culture.” Modern China 19, no. 2 (1993): 108–38.

Acknowledgements

This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 788476) and a CCFK grant.

Margaret Roberts et al., “Stm: Estimation of the Structural Topic Model,” September 18, 2020, 64–65, https://CRAN.R-project.org/package=stm ↩︎