Abstract
This is the companion documentation to a conference paper initially entitled “Local elites with global outreach: A bilingual topic modeling of the Shanghai press (1919-49)” given at the ESSCS conference “New Foundations in Chinese Digital History” held at Aix-Marseille University, 24-25 June 2022. This paper aims to map the formation of a transnational public sphere in republican China using a bilingual structural topic modeling approach. This document focuses on the Chinese-language corpus built from the newspaper Shenbao.
This paper seeks to investigate the formation of a transnational public sphere in republican China, through the joint empirical study of two key institutions – a non-state transnational organization, the Rotary Club, and its representations in the Shanghai press, which has long been considered as a key medium for shaping and disseminating information in modern China (Rankin, 1990; Huang, 1993; Wakeman, 1993; Wagner, 2007). Previous research on the Chinese public sphere presents to main limitations. One the one hand, scholars have focused on theoretical discussions regarding the transferability of Western concepts in China, instead of examining its concrete manifestations in the press and how it was put in practice by social actors. On the other hand, scholars who have used the press as a source have essentially relied on the close reading of subjectively selected articles, without providing the possibility to contextualize their findings and to assess whether/to what extent the selected texts or passages were representative of larger trends.
Taking advantage of the massive, multilingual corpora recently made available in full text by the ENP-China (Elites, Networks and Power in modern China) project, this paper introduces a mixed-method approach based on topic modeling to enable a change of scale in the analysis of the historical press and to overcome certain limitations of manual reading.Topic modeling is a computational, statistical method aimed at automatically detecting hidden themes (topics) in large collections of unstructured texts, based on the co-occurrences of words in documents. In this paper, we rely on structural topic modeling (STM). STM is based on Latent Dirichlet Allocation (LDA), a probabilistic model that treats topics as mixtures of words and documents as mixtures of topics. This implies that words can belong to different topics, while topics can be represented in multiple documents with varying proportions. In addition, STM is able to incorporate document metadata such as the date of publication, which enables to analyze topical changes over time. More specifically, we will use the stm R package which includes several built-in functions designed to facilitate the exploration of topics, including various visualizations and statistical outputs.
The purpose of this research is twofold. Substantively, our key questions include: How did the Shanghai press reported on the Rotary Club? How did the organization mediate between elite/society, business/politics, localism/internationalism? What does this reveal about how the press itself functioned as a public sphere? How did this emerging public sphere change over time and vary across languages? Methodologically, we aim to design a reliable method for conducting a bilingual, dynamic topic modeling approach of the historical press. More specifically, we address three major challenges: (1) to identify topics across multiple languages (in this paper, English and Chinese), (2) to trace topical changes over time, and (3) to adapt topic modeling to the heterogeneity of newspaper content, particularly to brevity-style articles made up of short pieces of unrelated news.
In this document, we focus on the Chinese-language newspaper Shenbao (Chinese)1. For the English-language press (ProQuest Collection of Chinese Newspapers), see the counterpart document. Our workflow follows four main steps. First, we build the corpus from the ENP-China textbase using the HistText package. Second, we prepare the text data and build the topic models using the stm package. Next, we explore and label the topics using various visualizations and statistical measures. Finally, we analyze the effect of time on topic prevalence based on the date of publication.
Note: The purpose of this document is to describe our workflow and to make our methodological choices more explicit, testable and replicable. Historical questions and interpretations are kept to the minimum. For a comprehensive literature review and detailed interpretation of the findings embedded in the final narrative, see to the companion research paper to be published in the Journal of Digital History (JDH).
Load packages
library(histtext)
library(tidyverse)
We first search “扶輪社” (fulunshe) in the Shenbao.
Since we are investigating a very specific organization with few
possible homonyms and low degree of ambiguity, we can rely on simple
keywords. We simply exclude the quasi-homonym “國學扶輪社” (guoxue
fulunshe), which referred to a publishing enterprise established in the
early 20th century with no connection with the Rotary Club.
Additionally, we restricted the query to the period posterior to 1919,
when the first Rotary Club in China was established in Shanghai:
rotaryzh_doc <- search_documents_ex('"扶輪社" NOT "國學扶輪社"', corpus="shunpao", dates="[1919 TO 1947]")
head(rotaryzh_doc)
When we retrieved the full text of the documents (not done
here), we realized that the results contained many articles in which the
Rotary Club was just mentioned in passing, amidst unrelated pieces of
news. Using the entire document as a text unit would only reflect the
messy structure of these texts. In order to alleviate the issue, we
propose to apply topic modeling on finer segments of text instead of the
entire document.
Example of problematic documents (the queried term is highlighted in red):
view_document("SPSP193602290401", "shunpao", query = '"扶輪社"')
## NULL
In the above example, the length of the targeted segment is 56
characters, whereas the total length of the “article” is 9169
characters.
Instead of retrieving entire documents, therefore, we will retrieve finer strings of characters using the “concordance” function included in the histtext package. This function returns the queried terms in their context. The main challenge at this stage is to define the right context size. After a careful examination of a sample of articles, we decided to set the threshold at 100 characters to minimize the risk of overlap in cases when articles contain several occurrences of the queried terms:
rotaryzh_conc100 <- search_concordance_ex('"扶輪社" NOT "國學扶輪社"',
context = 100, corpus="shunpao",
dates="[1919 TO 1947]")
head(rotaryzh_conc100)
The concordance table contains seven columns, including the
unique identifier of the document (DocId), the date of publication, the
title of the article (Title), the name of the periodical (Source), the
queried terms (Matched), and the terms preceding (Before) and following
(After) the key words.
We can first count the number of occurrences per article:
rotaryzh_conc100 <- rotaryzh_conc100 %>% group_by(DocId) %>% add_tally()
rotaryzh_conc100 %>% arrange(desc(n))