The golden age of the returned students in China (1)

Building a corpus from the English-language press

Cécile Armand

2021-05-20

Abstract

This tutorial develops a complete workflow for building, exploring and analyzing a corpus of historical newspapers using the enpchina package and additional packages available on R Studio. It relies on a concrete case study - the emergence of the “returned students” as a specific social group and their contribution to the post-imperial reconstruction of China (pre-1949). We argue that the quantitative and qualitative analysis of historical newspapers may contribute to a new understanding of the returned students’ role in Chinese history, by enlarging the picture and transcending the biographical or sector-based perspective that has dominated the scholarship so far. This experiment is based on the ProQuest Chinese Newspapers Collection (CNC).

Prologue

This tutorial develops a complete workflow for building, exploring and analyzing a corpus of historical newspapers using the enpchina package and additional packages available on R Studio.

This tutorial relies on a concrete case study: the emergence of 留美學生 liumei xuesheng (American returned students) as a specific social group and their contribution to the “reconstruction” of China during the post-imperial period (pre-1949). We argue that the quantitative and qualitative analysis of historical newspapers may contribute to a new understanding of the returned students’ role in Chinese history. It enables to enlarge the picture in order to transcend the existing scholarship, which has tended to focus on the anecdotal (biographies or sector-based studies) so far.

For this case study, we focus on a English-language periodicals from the ProQuest Chinese Newspapers Collection (CNC). In a future tutorial, we will also explore a major Chinese newspaper - the Shenbao - which has been made available in full text by the ERC-funded project “Elites, Networks and Power in modern China” ENP-China.

This tutorial relies on the enpchina package that has been specifically developed by the ENP-China team for exploring these massive corpora of historical newspapers. As we move forward along the tutorial, we will also use a variety of packages that are available in the R Studio environment. The major advantage of R Studio is that it offers an integrated framework enabling the infinite exploration, experiment, analysis and manipulation of the data initially extracted with the enpchina package.

Throughout the tutorial you will learn how to:

Build a corpus
Explore and analyze its textual content using various methods (word frequency, collocation, sentiment analysis, topic modeling, text reuse)
Extract named entities and map their networks
Develop more advanced techniques for working with metadata (classification of articles) and coping with some major digitization problems (OCR, text segmentation).

This tutorial focuses on the first step: how to build a corpus from a keyword-based query and to conduct a preliminary exploration of this corpus.

Obtain the list of articles (search_documents)

We try different keywords to find articles that mentioned returned students:

rs_docs <- search_documents('"returned students" | "returned student" | "returnee"', "proquest")
datatable(rs_docs)

We obtain 2739 documents spanning from 1841 to 1953.

Concordance Table (search_concordance)

rs_conc <- search_concordance('"returned students" | "returned student" | "returnee"', corpus = "proquest", context_size = 100)
datatable(rs_conc)

The table of concordances contains 3832 occurrences, meaning that returned students may be mentioned several times in the same article. The table of concordance reveals that the four first mentions (prior to 1884) result from OCR errors, so we will remove them from the list. The first actual mention of returned students appeared in 1884.

Distribution

Distribution over time (count_documents)

The distribution is uneven over time. Returned students were barely mentioned in the late 19th century:

enpchina::count_documents('"returned students" | "returned student" | "returnee"', "proquest") %>% 
  mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>% 
  mutate(Year= year(Date)) %>%
  group_by(Year) %>% summarise(N=sum(N)) %>% 
  filter(Year >= 1884) %>%
  ggplot(aes(Year,N)) + 
  geom_col(alpha = 0.8) +
  labs(title = "The returned students in the English-language press (1884-1953)", 
       subtitle = "Distribution of articles over time", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"",
       y = "Number of articles", size = 2)

They began to emerge in the first decades of the 20th and peaked in the 1920s. They maintained a strong presence until the Sino-Japanese war (1937-1945). They declined and even disappeared in the early 1940s but reappeared shortly in the postwar years.

enpchina::count_documents('"returned students" | "returned student" | "returnee"', "proquest") %>% 
  mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>% 
  mutate(Year= year(Date)) %>%  
  group_by(Year) %>% summarise(N=sum(N)) %>% 
  filter(Year > 1890) %>%
  ggplot(aes(Year,N)) + 
  geom_col(alpha = 0.8) +
  labs(title = "The returned students in the English-language press (1890-1953)", 
       subtitle = "Distribution of articles over time", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"",
       y = "Number of articles", size = 2)

Distribution across periodicals

rs_docs %>% 
  group_by(Source) %>% 
  count(sort=TRUE)

The returned students were mostly mentioned in three major, Shanghai-based but nationwide periodicals: the American China Weekly Review (761), the British North-China Herald (740) and the Chinese-American China Press (719). They also appeared quite frequently in the missionary Chinese Recorder (132), in the Canton Times (112) and the Shanghai Times (88). Despite the importance of Tsinghua College in preparing students departing to the United States, returned students were less often mentioned in Beijing-based periodicals - Peking Daily News (65), Peking Gazette (59) and Peking Leader (30). This, however, also reflects the fact that the collection of Beijing periodicals available through ProQuest is far less complete than the Shanghai press.

Periodicals over time

rs_docs %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  filter(Year > 1890) %>% 
  group_by(Year) %>% 
  count(Source) %>%
  ggplot(aes(x=Year, y=n, fill=Source)) + 
  geom_col()  + 
  scale_x_discrete(breaks = pretty_breaks())  +
  labs(title = "The returned students in the English-language press", 
       subtitle = "Distribution of articles by periodical", 
       fill = "Periodical",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"",
       y = "Number of articles", size = 2)

The returned students were mentioned in the North-China Herald during the entire period. Their appearance in the Peking Daily News, Peking Gazette and Canton Times was restricted to the 1910s and to the mid 1910s-early 1920s for the two local newspapers Shanghai Gazette and Shanghai Times (which reflects the state of the collection). They began to appear in the China Weekly Review and the China Press in the 1920s (which reflects the development of the American press in China) and dominated in these two periodicals until the end.

Let’ focus on the three most important periodicals:

top3 <- rs_docs %>% 
  group_by(Source) %>%
  count()%>%
  filter(n>700)

rs_docs %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  filter(Year > 1890) %>% 
  group_by(Year) %>% 
  count(Source) %>%
  filter(Source %in% top3$Source)%>%
  ggplot(aes(x=Year, y=n, fill=Source)) + 
  geom_col()  + 
  scale_x_discrete(breaks = pretty_breaks())  +
  labs(title = "The returned students in the English-language press", 
       subtitle = "Distribution of articles by periodical over time", 
       fill = "Three top periodicals",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"",
       y = "Number of articles", size = 2)

rs_docs %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  filter(Year > 1890) %>% 
  filter(Source %in% c("Peking Daily News", "Peking Gazette", "The Peking Leader", "The Canton Times"))%>%
  group_by(Source, Year) %>% 
  count() %>%
  ggplot(aes(x=Year, y=n, fill=Source)) + 
  geom_col()  + 
  scale_x_discrete(breaks = pretty_breaks())  +
  labs(title = "The returned students in the English-language press (1890-1953)", 
       subtitle = "Distribution of articles by periodical over time (outside of Shanghai)", 
       fill = "Non-Shanghai Periodicals",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"",
       y = "Number of articles", size = 2)

Retrieve full texts

rs_docs_ft <- enpchina::get_documents(rs_docs, "proquest")

## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
## [1] 2001
## [1] 2501

datatable(rs_docs_ft)

Save the results

At this stage, it is recommended to save and export the results as csv files:

write.csv(rs_docs, "rs_docs.csv") # list of documents
write.csv(rs_conc, "rs_conc.csv") # concordance table
write.csv(rs_docs_ft, "rs_full_text.csv") # full texts

In the next section, we will see how we can take advantage of various R packages for text analysis to delve deeper into the corpus.

To be continued…