This is the companion documentation to a conference paper initially entitled “Local elites with global outreach: A bilingual topic modeling of the Shanghai press (1919-49)” given at the ESSCS conference “New Foundations in Chinese Digital History” held at Aix-Marseille University, 24-25 June 2022. This paper aims to map the formation of a transnational public sphere in republican China using a bilingual structural topic modeling approach. This document focuses on the English-language corpus built from the ProQuest Chinese newspaper collection.
This paper seeks to investigate the formation of a transnational public sphere in republican China, through the joint empirical study of two key institutions – a non-state transnational organization, the Rotary Club, and its representations in the Shanghai press, which has long been considered as a key medium for shaping and disseminating information in modern China (Rankin, 1990; Huang, 1993; Wakeman, 1993; Wagner, 2007). Previous research on the Chinese public sphere presents to main limitations. One the one hand, scholars have focused on theoretical discussions regarding the transferability of Western concepts in China, instead of examining its concrete manifestations in the press and how it was put in practice by social actors. On the other hand, scholars who have used the press as a source have essentially relied on the close reading of subjectively selected articles, without providing the possibility to contextualize their findings and to assess whether/to what extent the selected texts or passages were representative of larger trends.
Taking advantage of the massive, multilingual corpora recently made available in full text by the ENP-China (Elites, Networks and Power in modern China) project, this paper introduces a mixed-method approach based on topic modeling to enable a change of scale in the analysis of the historical press and to overcome certain limitations of manual reading.Topic modeling is a computational, statistical method aimed at automatically detecting hidden themes (topics) in large collections of unstructured texts, based on the co-occurrences of words in documents. In this paper, we rely on structural topic modeling (STM). STM is based on Latent Dirichlet Allocation (LDA), a probabilistic model that treats topics as mixtures of words and documents as mixtures of topics. This implies that words can belong to different topics, while topics can be represented in multiple documents with varying proportions. In addition, STM is able to incorporate document metadata such as the date of publication, which enables to analyze topical changes over time. More specifically, we will use the stm R package which includes several built-in functions designed to facilitate the exploration of topics, including various visualizations and statistical outputs.
The purpose of this research is twofold. Substantively, our key questions include: How did the Shanghai press reported on the Rotary Club? How did the organization mediate between elite/society, business/politics, localism/internationalism? What does this reveal about how the press itself functioned as a public sphere? How did this emerging public sphere change over time and vary across languages? Methodologically, we aim to design a reliable method for conducting a bilingual, dynamic topic modeling approach of the historical press. More specifically, we address three major challenges: (1) to identify topics across multiple languages (in this paper, English and Chinese), (2) to trace topical changes over time, and (3) to adapt topic modeling to the heterogeneity of newspaper content, particularly to brevity-style articles made up of short pieces of unrelated news.
In this document, we focus on the English-language press, based on the ProQuest Chinese Newspapers Collection (CNC). For the Chinese-language press (Shenbao), see the counterpart document. Our workflow follows four main steps. First, we build the corpus from the ENP-China textbase using the HistText package. Second, we prepare the text data and build the topic models using the stm package. Next, we explore and label the topics using various visualizations and statistical measures. Finally, we analyze the effect of time on topic prevalence based on the date of publication.
Note: The purpose of this document is to describe our workflow and to make our methodological choices more explicit, testable and replicable. Historical questions and interpretations are kept to the minimum. For a comprehensive literature review and detailed interpretation of the findings embedded in the final narrative, see to the companion research paper to be published in the Journal of Digital History (JDH).
We first search the “Rotary Club” in the ProQuest collection. Since we are investigating a very specific organization with few possible homonyms and low degree of ambiguity, we can rely on simple keywords. We only restrict the query to the period posterior to 1919, when the first Rotary Club in China was established in Shanghai:
rotary_eng_docs <- search_documents_ex('"rotary club"', corpus= "proquest", filter_query = c("publisher", "category"), dates="[1919 TO 1949]", verbose = FALSE) head(rotary_eng_docs)
When we retrieved the full text of the documents (not done here), we realized that the results contained many articles in which the Rotary Club was just mentioned in passing, amidst unrelated pieces of news. Using the entire document as a text unit would only reflect the messy structure of these texts. In order to alleviate the issue, we propose to apply topic modeling on finer segments of text instead of the entire document.
Example of problematic documents:
view_document(1319907286, "proquest", query = '"rotary club"')
In the above example, the length of the targeted segment is 38 words, whereas the total length of the “article” is 2091 words.
Instead of retrieving entire documents, therefore, we will retrieve finer strings of characters using the “concordance” function included in the histtext package. This function returns the queried terms in their context. The main challenge at this stage is to define the right context size. After a careful examination of a sample of articles, we decided to set the threshold at 400 characters to minimize the risk of overlap in cases when articles contain several occurrences of the queried terms:
rotary_eng_conc400 <- histtext::search_concordance_ex('"rotary club"', corpus = "proquest", filter_query = c("publisher", "category"), dates="[1919 TO 1949]", context_size = 400, verbose = FALSE) head(rotary_eng_conc400)
The concordance table contains seven columns, including the unique identifier of the document (DocId), the date of publication, the title of the article (Title), the name of the periodical (Source), the queried terms (Matched), and the terms preceding (Before) and following (After) the key words.
In the next step, we remove the Hong Kong-based periodical South China Morning Post to focus on mainland periodicals:
rotary_eng_conc400_filtered <- rotary_eng_conc400 %>% filter(!Source %in% c("South China Morning Post Publishers Limited", "South China Morning Post Ltd.")) %>% mutate(Year = stringr::str_sub(Date,0,4)) %>% mutate(Year = as.numeric(Year)) %>% filter(Year > 1918) head(rotary_eng_conc400_filtered)
The filtered corpus contains 3,620 instances and 2,468 documents.
Next, we retrieve the genre of article as metadata, which will be used later to filter out irrelevant contents. Since the category of article can not be directly retrieved with the “search_concordance_ex” function in the current version of histtext, we need to use the function “get_search_field_content” instead.
rotary_eng_doc <- histtext::get_search_fields_content(rotary_eng_conc400_filtered, "proquest", search_fields = c("title", "date", "publisher", "category", "fulltext"), verbose = FALSE) head(rotary_eng_doc)
In the next steps, we retain only unique documents, we discard the variables that are redundant with the concordance table (except for “DocId” which will be used for joining the two tables) and we clean the content of the “category” field:
rotary_eng_doc <- rotary_eng_doc %>% unique() %>% select(DocId, category) %>% mutate(category = str_remove_all(category,"\\]")) %>% mutate(category = str_remove_all(category,"\\[")) %>% mutate(category = str_remove_all(category,"\\'")) head(rotary_eng_doc)
Let’s inspect the various categories and their relative importance in our corpus:
rotary_eng_doc %>% select(DocId, category) %>% group_by(category) %>% count(sort = TRUE)
On this basis, we decide to filter out the following non-significant categories:
rotary_eng_doc_filtered <- rotary_eng_doc %>% filter(!category %in% c("Advertisement", "Classified Advertisement, Advertisement", "General Information", "Table of Contents, Front Matter"))
The resulting corpus contains 2387 documents. We can then join the list of document with the concordance table:
rotary_join <- inner_join(rotary_eng_doc_filtered, rotary_eng_conc400_filtered) head(rotary_join)
The concordance table now includes the category of article among the metadata.
In the next steps, we will transform the text data so that it can be processed by the topic model.
First, we create a new variable “Text” for the merged text:
# We first merge "Before" and "Matched" into a new "Text" variable rotary_merged <- rotary_join %>% mutate(Text = paste(Before, Matched, sep = " ")) head(rotary_merged)
# Second, we merge "After" with the text resulting from previous operation (Text): rotary_merged <- rotary_merged %>% mutate(Text = paste(Text, After, sep = " ")) head(rotary_merged)
Finally, we re-unite each document to include all the occurrences it may contain, and we retain only one instance for each document:
library(data.table) rotary_united <- rotary_merged %>% group_by(DocId, Date, Year, Source, category, Title, grp = rleid(DocId)) %>% summarise(Text = str_c(Text, collapse=' '), .groups = 'drop') %>% ungroup %>% select(-grp) head(rotary_united)
Our final corpus contains 2,387 reshaped documents spanning from 1919 to 1948:
rotary_united %>% group_by(Year) %>% count(Source) %>% ggplot(aes(x=Year, y=n, fill=Source)) + geom_col(alpha = 0.8) + labs(title = "The Rotary Club in the English-language press", subtitle = "Number of articles mentioning the club", x = "Year", y = "Number of articles", caption = "Based on ProQuest Chinese Newspaper Collection")
The peak in the 1930s largely reflects the increase in the volume of periodicals during this period. It also coincides with the growth of Chinese memberships and the most active period in the history of the Rotary Club. It was followed by a dramatic decline during the Sino-Japanese war (1937-1945) during which most foreign periodicals ceased publications and never fully recovered afterwards.
In addition, we observe significant differences between periodicals. Three major publications, in fact, dominated the corpus:
rotary_united %>% group_by(Source) %>% summarise(n = n()) %>% mutate(ptg = paste0(round(n / sum(n) * 100, 0), "%")) %>% arrange(desc(n))
Before we move to the next step (text pre-processing), we recommend saving the dataset as a csv file:
Next, we need to prepare the text data to make it readable by topic model algorithms.
In the pre-processing phase, we chose not to stem and not lemmatize words because at this stage, we wanted to maintain all possible nuances conveyed in the original texts. We removed words which contained less than four characters and occurred in less than five documents. We also removed a customized list of stop words, especially the terms used to query the corpus, as well as high-frequency terms in this context, such as “China” and “Chinese”. Based on these parameters, 14517 out of 16895 terms (22589 of 82569 tokens) were removed due to frequency. The final corpus contains 2,387 documents, 2378 terms and 59980 tokens.
meta <- rotary_united %>% transmute(DocId, Date, Year, Source, category, Title) corpus <- stm::textProcessor(rotary_united$Text, metadata = meta, stem = FALSE, wordLengths = c(4, Inf), verbose = FALSE, customstopwords = c("rotary", "club", "china", "chinese", "will", "one", "two")) stm::plotRemoved(corpus$documents, lower.thresh = c(0,10, by=5))