Abstract
This guide aims to demonstrate how China historians can take advantage of the “enpchina” package to explore massive corpora of historical newspapers - i.e. ProQuest “Chinese Newspapers Collection” - taking the Rotary Club of Shanghai as a case study.
This guide aims to present the various functions available in the “enpchina” R package. This package was developed by the European Research Council (ERC) project “Elites, Networks and Power in modern China” (ENP-China), especially by our computational linguist Pierre Magistry.
This package was designed to enable China historians to explore massive corpora of historical newspapers. Ultimately, we hope to demonstrate how historians can harness digital techniques to explore historical sources at an unprecedented scale and develop alternative approaches to historical research. Such practices do not displace qualitative analyses but complement and contextualize the close reading of historical documents.
The “enpchina” package relies on R studio. We chose R-Studio because its language is relatively simple to handle for novice programmers (compared to Python, for instance) and because its widely-used platform (called an Integrated Programming Environment or IDE) provides a unique environment for performing the complete chain of operations - from extraction to analyses - within a single framework.
This guide addresses historians with no a priori knowledge of language programming. We only provide the code for the sake of traceability, but the display is optional. One can choose to display it or not. It is easy to skip the code and just focus on the results and analyses. For technical details on the development of the package, please refer to the dedicated GitHub page (forthcoming). .
This tutorial will demonstrate how historians can use the various functions included in the “enphina” package in order to master the complete chain of operations from the extraction of historical information to the analysis and interpretation of the data extracted. In the following, we provide only basic analyses and interpretation of the results. For this demonstration, we will rely on a concrete case study: the Rotary Club of Shanghai (Shanghai fulunshe 上海扶輪社). Moreover, this guide applies specifically to the English-language press (ProQuest collection of Historical Newspapers: Chinese Newspapers Collection (CNC). Another tutorial is devoted to the Chinese-language daily Shenbao 申報.
In this guide, we will use the various functions available in the “enpchina” package in order to perform the complete chain of operations from extracting historical information to analyzing and interpreting the data extracted:
First, we need to load the “enpchina” package (and other necessary packages) in R Studio:
library(enpchina)
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidygraph)
library(igraph)
library(tidyr)
library(stringr)
library(htmlwidgets)
library(htmltools)
library(webshot)
The basic command below lists the corpora available with the “enpchina” package:
enpchina::list_corpora()
## [1] "kmt9k" "imh" "wikibio-zh" "proquest" "wikibio-en"
## [6] "shunpao" "shunpao2" "waiguozaihua" "wikibio"
The corpus we are interested in for this tutorial is “proquest”.
The Chinese Newspapers Collection (CNC) on ProQuest Chinese Newspapers Collection (CNC) on ProQuest contains twelve English-language newspapers published in China from 1832 to 1953. Most of them were published in Shanghai, but the collection also includes one newspaper from Guangzhou (Canton Times) and three from Beijing (Peking Daily News, Peking Gazette, Peking Leader). There may be great variation in their temporal and spatial scope. This study focuses on three major Shanghai-based periodicals that experienced the longest lifespan and the widest circulation: the British pioneer North-China Herald (1850-1941), the American China Weekly Review (1917-1953) (formerly Millard’s Review of the Far East) and the Sino-American China Press (Dalubao 大陸報) (1925-1938). Although based in Shanghai, their readership was truly national in scope. The North-China Herald reached a circulation of 10,000 in the early 1930s, with an increasing number of Chinese readers (1933 Circulation Census). The corpus also includes the Hong-Kong based South-China Morning Post (1903-2001).
Founded in Chicago in 1905, the Rotary Club was introduced in China after World War I. The organization aimed to enable business and professional men to socialize and legitimize their position by promoting higher standards of business and devising a new ethics of public service. The first club was established in Shanghai in 1919 - Shanghai fulunshe 上海扶輪社 - with some twenty members, exclusively foreigners. Its membership increased to over 150 and Chinese members became the majority before the Sino-Japanese war (1937-1945). In 1948, a Chinese-speaking club - the Rotary Club of Shanghai West (Huxi fulunshe 滬西扶輪社) - was established in parallel with the original fulunshe in order to incorporate non-English speakers. After the Communist takeover and the founding of the People’s Republic of China (RPC), the organization was terminated in 1952.
Let’s start with a broad picture of the presence of the Rotary Club of Shanghai in the English-language press. The following graph plots the number of articles that mentions the Rotary Club of Shanghai in the ProQuest Collection of Chinese Newspapers between 1919 and 1948, using two keywords - “Shanghai Rotary” and “Rotary Club of Shanghai”.
enpchina::count_documents('"Shanghai Rotary" | "Rotary Club of Shanghai"', "proquest") %>%
mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1919 & Year<=1950) %>%
ggplot(aes(Year,N)) + geom_col(alpha = 0.8) +
labs(title = "The Rotary Club of Shanghai in the English-language press",
subtitle = "Number of articles mentioning the club",
x = "Year",
y = "Number of articles")
Distribution over time
The English-language press almost fully covers the 30 years of existence of the Rotary Club of Shanghai, since its very founding in 1919 until the early years of the People’s Republic of China (1949-1952). Its presence spans from 1919 to 1948, yet with great variations over time. The club remained discrete during the first decade, with less than 25 occurrences per year, except in 1925, which coincided with the May Thirtieth Incident in Shanghai and also a year during which the club was particularly active (American School Scholarship). Its publicity increased in the 1930s, with an average of 75 articles per year, and reached a peak in 1936 (which coincides with the Annual Convention of Rotary and a turning point in the history of the organization at both the local - Berents’ resignation as president of the Shanghai Club - and international level - Wang Zhengtin’s resignation as district governor). It declined during the Sino-Japanese war (1937-1945), which also reflects the suspension of most newspapers during the war. Overall, the growing presence of the club in the press during the 1930s corresponds to its most active period and the most significant increase in Chinese membership. From a comparative perspective, the Rotary Club of Shanghai received much more attention and space in the English-language press than in the Chinese-language press (Shenbao) (cf.Case 1). Furthermore, it appeared earlier in English newspapers and its presence was more continuous than in the Shenbao.
One might object that the results are biased by the fact the ProQuest collection comprises several newspapers, whereas the Chinese press in our research is represented exclusively by the Shenbao. As we will see below, however, the Rotary Club was mostly mentioned in three main English periodicals, which considerably corrects the presumed bias.
Distribution across publications
The graph below reveals that the coverage of the Rotary club in the ProQuest collection varied greatly between periodicals.
search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest") %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
group_by(Year) %>% count(Source) %>%
ggplot(aes(x=Year, y=n, fill=Source)) +
geom_col(alpha = 0.8) +
scale_x_discrete(breaks = c(1919, 1925, 1930, 1935, 1940, 1948)) +
theme(legend.position="bottom",
legend.text = element_text(size=8)) +
labs(title = "The Rotary Club of Shanghai in the English-language press (1919-1948)",
subtitle = "Distribution of mentions between newspapers",
caption = "based on data extracted from the ProQuest collection 'Chinese Historical Newspapers'",
y = "Number of articles mentioning the club", size = 2)
As the table below demonstrates, the Rotary Club was principally mentioned in three major periodicals: the China Press (583 articles, 62%), the North-China Herald (209, 22%) and the China Weekly Review (137, 15%).
by_source <- search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest") %>%
group_by(Source) %>%
count() %>%
arrange(desc(n))
by_source
The graph below reveals that the distribution in each periodical also varies greatly over time. Rotary’s presence was more continuous in the China Press than in the two other papers. It covers the year 1925-1937 with no interruption. The China Weekly Review is the only periodical that covered the earlier years of the club, whereas then North-China Herald (NCH) really begun to publicize its activities in the late 1920 (the only mention before that date relates to the founding of the club in 1919). Interestingly, this coincides with the incorporation in 1927 of F.C. Millington (manager of the British advertising agency Millington, Ltd) who maintained strong connections with the British newspaper. The NCH is also the only publication to cover the wartime period.
search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest") %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
group_by(Year) %>% count(Source) %>%
filter(n>3) %>%
ggplot(aes(x=Year, y=n, fill=Source)) +
geom_col(alpha = 0.8) +
facet_wrap(~ Source, nrow = 3)+
scale_y_continuous(breaks = pretty_breaks())+
scale_x_discrete(breaks = c(1919, 1925, 1930, 1935, 1940, 1948))+
theme(legend.position="bottom",
legend.text = element_text(size=8))+
labs(title = "The Rotary Club of Shanghai in the English-language press (1919-1948)",
subtitle = "Distribution of articles between three major periodicals",
caption = "based on data extracted from the ProQuest database 'Chinese Historical Newspapers'",
y = "Number of articles mentioning the club")
search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest") %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
group_by(Year) %>% count(Source) %>%
filter(n>3) %>%
ggplot(aes(x=Year, y=n)) +
geom_col(alpha = 0.8) +
facet_wrap(~ Source, nrow = 3)+
scale_y_continuous(breaks = pretty_breaks())+
scale_x_discrete(breaks = c(1919, 1925, 1930, 1935, 1940, 1948))+
theme(legend.position="bottom",
legend.text = element_text(size=8))+
labs(title = "The Rotary Club of Shanghai in the English-language press (1919-1948)",
subtitle = "Distribution of articles between three major periodicals",
caption = "based on data extracted from the ProQuest database 'Chinese Historical Newspapers'",
y = "Number of articles mentioning the club")
# 2. Build the corpus
Next, we propose to build a corpus containing all the articles that mention the Rotary Club of Shanghai in the English-language. We proceed in two steps: 1. We list all the articles mentioning the club with their metadata. 2. We retrieve the full text of the articles.
The function “search_documents()” produces the list of unique documents (articles) mentioning the club, regardless of the number of occurrences in each document. The command below produces the table of results (docs). For the sake of legibility, only the ten first results are displayed:
docs <- search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest")
docs
The table of documents contains four columns:
The function “search_concordance()” produces the concordance table that contains all mentions of the club and help situate each occurrence in its context (conc). In this example, we chose to retain 30 characters surrounding the keyword, but one can easily increase or decrease the length of the segment just by modifying the number after “context_size =”.
conc <- search_concordance('"Shanghai Rotary"| "Rotary Club of Shanghai"', corpus = "proquest", context_size = 30)
conc
Compared to the table of documents, the concordance table contains three additional columns:
The table of documents contains 936 results (unique articles), whereas the concordance table contains 1054 results, which means that the club may be mentioned several times in the same article.
Let’s have a closer look at the content of the articles. For this purpose, we will retrieve the full text of the documents and create a sub-corpus on this basis. The table below displays only the first ten documents:
corpus_with_fulltexts <- enpchina::get_documents(docs, "proquest")
## [1] 1
## [1] 501
corpus_with_fulltexts
The above table contains the same columns as in the list of documents, plus an additional column (“Text”) that contains the full text of each article.
Furthermore, the function “proquest_view()” provides access to the original article, based on its unique identifier:
# proquest_view(1369908142) # restricted access to authorized users only (login and password required)
At this stage, it is recommended to save and export the results as csv files.
write.csv(docs, "docs.csv") # list of documents
write.csv(conc, "concordance.csv") # concordance table
write.csv(corpus_with_fulltexts, "fulltext.csv") # full texts
Who was related to the Rotary Club of Shanghai? Named entity recognition (NER) is one tool that can help answer the question. The function “ner_on_corpus()” included in the package allows users to extract the name of all the persons, organizations, places and events mentioned in any corpus of documents. The algorithm used for detecting named entities in the ProQuest corpus relies on OntoNotes. It identifies eight distinct types of entity: person, organization, location, date, time, money, percentage, and miscellaneous entities.
In order to identify the actors associated with the Rotary Club of Shanghai, we apply this function to the corpus of articles mentioning the Rotary Club in ProQuest.
docs_pq <- search_documents('"Shanghai Rotary"| "Rotary Club of Shanghai"', "proquest")
full_text_pq <- enpchina::get_documents(docs_pq, "proquest") %>% rename(Id = "DocID")
ner_results_pq <- enpchina::ner_on_corpus(full_text_pq, "proquest")
ner_results_pq
The table above lists all the entities recognized in our corpus “Rotary” (only the ten first results are displayed). In addition to the documents metadata (DocID), the table of results contains four new columns:
In the “Rotary Club” corpus, “persons” represent the largest category (31%), followed by organizations (26%) and locations (25%). Miscellaneous entities represent 10% and temporal or numerical entities (date, time, money, percent) less than 10%.
ner_count_pq <- ner_results_pq %>%
group_by(Type) %>%
count() %>%
arrange(desc(n))
ner_count_pq
ner_results_pq %>%
group_by(Type) %>%
count() %>%
ggplot(aes(reorder(Type, n), n)) + geom_col(alpha = 0.8) +
labs(title = "The network of the Rotary Club of Shanghai in the English-language press",
subtitle = "Named entities associated with the club, per category",
x = "Type of entity",
y = "Number of entities",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
If we retain only distinct entities, we obtain a different distribution. Organizations now rank first (39%), followed by persons (33%) and locations (14%). Organizations are less likely to be repeated than individual persons, who may be mentioned a greater number of times. Other entities total only 12%.
ner_u_count <- ner_results_pq %>%
group_by(Type) %>%
distinct(Text) %>%
count() %>%
arrange(desc(n))
ner_u_count
ner_results_pq %>%
group_by(Type) %>%
distinct(Text) %>%
count() %>%
ggplot(aes(reorder(Type, n), n)) + geom_col(alpha = 0.8) +
labs(title = "The network of the Rotary Club of Shanghai in the English-language press",
subtitle = "Unique named entities, per category",
x = "Type of entity",
y = "Number of entities",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
We can also visualize the distribution over time :
ner_results_pq_y <- ner_results_pq %>%
mutate(DocID = as.character(DocID)) %>%
rename(Id = DocID)
ner_results_pq_y <- full_join(docs_pq, ner_results_pq_y) %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
na.omit()
ggplot(data = ner_results_pq_y) +
geom_bar(mapping = aes(x = Year, fill = Type), alpha = 0.8)+
scale_x_discrete(breaks = c(1920, 1925, 1930, 1935, 1940))+
labs(title = "The Rotary Club of Shanghai in the English-language press",
subtitle = "Distribution of named entities over time (1919-1948)",
x = "Year",
y = "Number of entities",
fill = "Type of Entity",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
ner_results_pq_y <- ner_results_pq %>%
mutate(DocID = as.character(DocID)) %>%
rename(Id = DocID)
ner_results_pq_y <- full_join(docs_pq, ner_results_pq_y) %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
na.omit()
ggplot(data = ner_results_pq_y) +
geom_bar(mapping = aes(x = Year, fill = Type))+
facet_wrap(~ Type, nrow = 4)+
scale_x_discrete(breaks = c(1920, 1925, 1930, 1935, 1940))+
scale_y_continuous(breaks = pretty_breaks())+
labs(title = "The Rotary Club of Shanghai in the English-language press",
subtitle = "Distribution of named entities over time (1919-1948)",
x = "Year",
y = "Number of entities",
fill = "Type of Entity",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
Let’s focus on the three most frequent types of entities:
ner_results_pq_main <- ner_results_pq_y %>%
filter(Type %in% c("ORGANIZATION", "PERSON", "LOCATION"))
ggplot(data = ner_results_pq_main) +
geom_bar(mapping = aes(x = Year, fill = Type), alpha = 0.8)+
scale_x_discrete(breaks = c(1920, 1925, 1930, 1935, 1940))+
labs(title = "The Rotary Club of Shanghai in the English-language press",
subtitle = "Distribution of the 3 main types of named entity over time",
x = "Year",
y = "Number of entities",
fill = "Type of Entity",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
ner_results_pq_main <- ner_results_pq_y %>%
filter(Type %in% c("ORGANIZATION", "PERSON", "LOCATION"))
ggplot(data = ner_results_pq_main) +
geom_bar(mapping = aes(x = Year, fill = Type), alpha = 0.8)+
facet_wrap(~ Type, nrow = 3)+
scale_x_discrete(breaks = c(1920, 1925, 1930, 1935, 1940))+
labs(title = "The Rotary Club of Shanghai in the English-language press",
subtitle = "Distribution of the 3 main types of named entity over time",
x = "Year",
y = "Number of entities",
fill = "Type of Entity",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
## Explore entities
Who (what persons) is most frequently associated with the club? In what position?
Which organizations are related to the club? Of what nature (private companies, schools or universities, administration, government - local or national), philanthropic or voluntary associations? What does this suggest about the activities of the club and its cooperation with other institutions?
Which locations and events are the most often mentioned in connection with the club - local meetings and places of meeting, or larger events and more abstract locations?
First, we build an index based on the number of occurrences for each entity (only the ten first results are displayed):
ner_index_pq <- ner_results_pq %>% group_by(Type, Text) %>%
tally() %>%
arrange(desc(n))
ner_index_pq
As shown in the table above, the algorithm extracts personal pronouns as “persons”. We need to remove these pronouns in order to obtain more accurate results:
ner_index_pq2 <- ner_index_pq %>%
filter(!Text %in% c("his", "her", "him", "she", "he", "him", "His", "Her", "She", "He"))
ner_index_pq2
We can then filter by type and examine separately each category of entity.
ner_pers_pq <- ner_index_pq2 %>% filter(Type == "PERSON")
ner_pers_pq
The 10 most frequent persons include several Chinese and foreign leaders of the Rotary Club, such as E.F. Harris (founding member and president of the club in 1933), Wang Zhengting (a diplomat who held various positions in Rotary International), Kuang Fuzhuo (Fong Sec) (president of the club in 1932-1933 and Rotary International officer), and George Fitch (founding member and president of the club in 1931-1932). The most common patronyms, however, may include homonyms (Wang, Wu, Chen, Chang, Wong). Such cases call for a closer examination. Overall, these results contrast sharply with the list of persons drawn from the Shenbao, which does not include any Rotarian in the top 10 (Case 1).
Note: The algorithm mistook Hankow (a city) for a person.
ner_org_pq <- ner_index_pq2 %>% filter(Type == "ORGANIZATION")
ner_org_pq
Outside the Rotary Club, the organizations that are most frequently associated with the Rotary include the Shanghai Municipal Council (162), the Navy (81), and the Metropole Hotel. The latter was the major place of meeting of the Rotary Club in the 1930s. The club co-organized various events with the Navy and often invited Navy officers and members of the Shanghai Municipal Council as guests of honor in its special events. From a comparative perspective, the Shenbao and the English press share two major organizations: the Rotary Club and the Shanghai Municipal Council (Case 1).
Note: The algorithm mistook two cities (Hongkong, Qingdao) for organizations.
ner_loc_pq <- ner_index_pq2 %>% filter(Type == "LOCATION")
ner_loc_pq
Outside Shanghai and China, the 10 most frequent locations fall into three main categories:
ner_time_pq <- ner_index_pq2 %>% filter(Type %in% c("DATE", "TIME"))
ner_time_pq
The 10 most frequent temporal markers are mostly weekdays and particular moments of the day. “Thursday” tops the list as the day on which the Rotary Club of Shanghai held its regular tiffin. Special events were held on other weekdays or during the week-end. Although Rotary tiffins took place at lunch time, “noon” does not appear in the list. Press reports emphasize more exceptional meeting time - outdoor activities in the afternoon, dinner in the evening, summer trip.
Note: Rotary “tiffins” (or luncheons) consisted in having lunch together while attending a lecture on a timely topic of interest or related to Rotarians’ field of specialty (classification talks). In Shanghai, such tiffins took place every Thursday at noon. They were restricted to Rotarians and their guests. In addition to regular meetings, the club held special events (dinner dancing, garden party, Valentine’s Day, Christmas, tennis or golf competition) open to the Rotarians’ family and special guests. Closed meetings took place once a year for the election of officers and new members, and more occasionally in critical circumstances, such as the “Lincheng Incident” in 1923, which involved the kidnapping of several foreigners on the Pukow-Tientsin train, or the Japanese invasion of Manchuria in September 1931.
NER raises questions as to why certain entities appear in connection to the club and how to interpret their presence. Some are obvious (Rotary, China, United States, members of the club), but others are more intriguing. They call for a more in-depth inquiry, which may in turn open unexpected research paths.
Don’t forget to save the results!
write.csv(ner_results_pq_y, "ner_results_pq.csv") # all results
write.csv(ner_index_pq, "ner_index_pq.csv")
While simple lists of entities are useful for basic statistical analyses, they obscure the links that exist between entities - textual co-occurrences that may refer to actual, social relations. By contrast, network graphs provide a powerful tool to explore these relations.
In the next sections, we propose to build two types of network based on the list of entities we previously extracted: two-mode networks linking entities with the documents in which they appear, and one-mode network linking entities that co-occurred in the same documents. In the last step, we will also create a two-mode network linking entities of different nature (e.g. persons and organizations).
First we select only persons in the list of named entities and we remove personal pronouns:
ner_results_pers_pq <- ner_results_pq %>%
filter(Type == "PERSON") %>%
filter(!Text %in% c("his", "her", "him", "she", "he", "him", "His", "Her", "She", "He")) %>% mutate(Text= str_remove_all(Text, "-- ")) %>%
mutate(Text= str_remove_all(Text, "--"))
We prepare the edge list linking Person(s) with documents :
persondata_pq <- ner_results_pers_pq %>%
select(DocID, Text)
persondata_pq
We prepare the edge and the nodes of the network and we build the two-mode network Persons - Documents using igraph and tidygraph
edges_pq1 <- persondata_pq %>% transmute(from=DocID, to=Text)
ig_pq1 <- graph_from_data_frame(d=edges_pq1, vertices=NULL, directed = FALSE)
tg_pq1 <- tidygraph::as_tbl_graph(ig_pq1)
Finally, we project the network into Padagraph - a powerful network graph visualization tool developed by our computational linguist Pierre Magistry:
tg_pq1 %N>% mutate(label=name) %>% enpchina::in_padagraph("pq-PersDoc")
The graph is accessible here. At first you need to click on “global” or “+10” to display the graph.
Now we want to project this two mode-network into a one-mode network linking persons to persons through documents.
First we create an edge list in the form of a table linking the source person (from) to the target person (to) - which is the standard format for igraph object:
edges_pq2 <- inner_join(persondata_pq, persondata_pq, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
edges_pq2 %>% arrange(from, to)
The inner_join() function joins the table with itself through DocID. It creates a link for each couple of relation. “Distinct()” is used to eliminate duplicates in documents.
Next we create the one-mode network Person-to-person using igraph and tidygraph:
edges_pers_pq_tg <- edges_pq2 %>% transmute(from=from, to=to)
ig_pq2 <- graph_from_data_frame(d=edges_pers_pq_tg, vertices=NULL, directed = FALSE)
tg_pq2 <- tidygraph::as_tbl_graph(ig_pq2)
Finally we project the one-mode network Person-Person into Padagraph:
tg_pq2 %N>% mutate(label=name) %>% enpchina::in_padagraph("pq-PersPers")
The graph is accessible here. At first you need to click on “global” or “+10” to display the graph.
Depending on your research questions, you can replace persons by organizations, locations, or events. You may also build a multimodal network linking entities of different nature (e.g. persons and organizations):
Let’s say we want to create a two-mode network linking persons and organizations using Padagraph.
First we select only persons and organizations in the original list of named entities:
ner_results_pq_pers_org <- ner_results_pq %>%
filter(Type %in% c("PERSON", "ORGANIZATION")) %>%
select(Type, Text, DocID)
ner_results_pq_pers_org
We create the edge list linking persons and organization by joining the table of entities with itself through DocID:
edges_pq3 <- ner_results_pq_pers_org %>%
inner_join(ner_results_pq_pers_org, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
edges_pq3 %>%
arrange(from, to)
We create the two-mode network Person-to-Organization using igraph and tidygraph:
edges_pers_org_pq_tg <- edges_pq3 %>% transmute(from=from, to=to)
ig_pq3 <- graph_from_data_frame(d=edges_pers_org_pq_tg, vertices=NULL, directed = FALSE)
tg_pq3 <- tidygraph::as_tbl_graph(ig_pq3)
Finally we project the one-mode network Person-Person into Padagraph:
tg_pq3 %N>% mutate(label=name) %>% enpchina::in_padagraph("pq-PersOrg")
The graph is accessible [here]((https://pdg.enpchina.eu/rstudio?gid=pq-PersOrg#+10). At first you need to click on “global” or “+10” to display the graph.
You may save the edge lists to rebuild the networks of entities with any social network analysis software (Gephi, Cytoscape…):
write.csv(edges_pq1, "edgelist_pers_doc_pq.csv") # two-mode network person-document
write.csv(edges_pq2, "edgelist_pers_pers_pq.csv") # one-mode network person-person
write.csv(edges_pq3, "edgelist_pers_org_pq.csv") # two-mode network person-org
Relying on the rosters of members available in the archives of Rotary International, we established a list of 131 Chinese Rotarians in Shanghai from 1919 to 1951. This list is available on Zenodo ENP-China Community. We propose to search these individuals in our corpus in order to determine how far they were active in the club. We assumes the number of times they appear in the Rotary sub-corpus (containing only documents related to the Rotary Club) is a preliminary indicator of the degree of their commitment to the club.
The first thing we need to do is to import the list of members and to format their names so as to enable performing the query. We used the transliteration in Wade-Giles which was commonly used in the English-language press. The table below indicates their full name, surname, given name, initials, and most common name. Only the 10 first members are displayed:
library(readr)
RotarianList_allNation <- read_delim("RotarianList_allNation.csv",
";", escape_double = FALSE, trim_ws = TRUE)
rotarian_wg <- RotarianList_allNation %>% filter(Nationality == "Chinese") %>%
rename(FullName = Name_eng, Surname = LastName, GivenName = FirstName, CommonName = Common)
rotarian_wg <- rotarian_wg %>%
select(FullName, Surname, GivenName, Initials, CommonName) %>%
na.omit() %>%
arrange(FullName)
rotarian_wg
Due to transliteration issues, we experimented with various possible spellings for their names in Wade-Giles (Surname only, Surname+Initials, Surname+Given Name) in order to obtain the most accurate results. The last column indicates the most common spelling that eventually yields the best results.
In order to identify the Chinese Rotarians in the press and examine their relation with the club, we proceed in two steps:
We found all the 131 individuals from the original list. The disambiguation of homonyms, however, would probably revise down these results. In the Rotary subcorpus, the 131 Rotarians total 226 occurrences, i.e. an average of 1.7 per person, with great variations from one individual to the other. In the whole corpus, they represent 5,247 occurrences (40/person), with even greater discrepancies between them. We observe that the number of identified Rotarians is higher in ProQuest, but they appear a greater number of times in the Shenbao. The following sections describe in detail the method for obtaining these results and offer a preliminary interpretation.
The table below lists all the Rotarians mentioned in the ProQuest Collection of Chinese Newspapers in connection with the Rotary club. It contains five columns with their name (full name and common name used for the query), the identifier of the article in which they appear (Id), the date of publication (Date), the title of the article (Title), and the source (name of the periodical). Only the ten first results are displayed:
rotarianwg <- rotarian_wg %>%
select(CommonName) %>% na.omit() %>%
mutate(Queries=str_glue('"{CommonName}"'))
multiple_search <- function(queries, corpus) {
results <- enpchina::search_documents(queries[1], corpus) %>%
mutate(Q=queries[1])
for(q in queries){
new_result <- enpchina::search_documents(q, corpus) %>%
mutate(Q=q)
results <- dplyr::bind_rows(results, new_result)
}
distinct(results)
}
rotarians_in_proquest <- multiple_search(rotarianwg$Queries, "proquest")
rotarians_in_pq_subcorpus <- inner_join(docs_pq, rotarians_in_proquest, by = "Id")
rotarians_in_pq_subcorpus <- rotarians_in_pq_subcorpus %>%
select(Q, Id, Date.x, Title.x, Source.x) %>%
rename (Date= Date.x) %>%
rename (Title= Title.x) %>%
rename (Source= Source.x) %>%
rename (Queries= Q)
rotarians_in_pq_subcorpus <- full_join(rotarians_in_pq_subcorpus, rotarianwg)
rotarians_in_pq_subcorpus <- select(rotarians_in_pq_subcorpus,-c(Queries))
rotarians_in_pq_subcorpus <- inner_join(rotarians_in_pq_subcorpus, rotarian_wg, by = "CommonName") %>%
select(FullName, CommonName, Id, Date, Title, Source)
rotarians_in_pq_subcorpus
Based on these results, we propose to measure the strength of each individual’s connection with the club by counting the number of times they appear in the subcorpus. The table below displays the ten first individuals, ranked by decreasing order of importance (number of occurrences):
rotarians_in_pq_subcorpus_uniq <- rotarians_in_pq_subcorpus %>%
distinct(FullName, Id) %>%
group_by(FullName) %>%
count() %>%
arrange(desc(n))
rotarians_in_pq_subcorpus_uniq
The table reveals high discrepancies between Rotarians. The five most frequently mentioned represent 35% of all occurrences. The five next on the list with more than 2 occurrences represent 11%, while the remaining 121 Rotarians appear only once:
rotarians_in_pq_subcorpus_uniq %>%
group_by(n) %>%
count() %>%
ggplot(aes(n, nn))+
geom_smooth()+
geom_jitter(alpha = 0.5)+
scale_x_log10("Number of occurrences")+
scale_y_continuous()+
labs(title = "The Chinese Rotarians in the English-language press",
subtitle = "Index of commitment to the Rotary Club of Shanghai",
x = "Number of occurrences",
y = "Number of individuals",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
The ranking generally reflects members’ positions in the club. Zhu Boquan 朱博泉 (Percy Chu) held a dozen of positions during its membership, as officer, director and chairman of important committees. As a banker, he served as the treasurer of the club for five consecutive years (1928-1935) and was eventually elected president in 1934-1935. Li Yuanxin 李元信 (William Yinson Lee) sit at the board of directors and was the chairman of the program committee for two consecutive years (1934-1936), among other committees. Guo Baoshu 郭寶樹 (Percy Kwok) was elected vice-president of the club in 1930 and chaired several important committees (Club Service, Fellowship). The diplomat Gu Weijun 顾维钧 (Wellington Koo) did not hold any position locally, but as a honorary member of the club, he was frequently invited as a guest of honor during its special events. Xu Jianping (Jabin Hsu) 許建屏 was a member of the board of directors (1924-1925) and of the Fellowship committee (1933-1934). Zhu Shen’en 朱神恩 (Luther Jee) was the first Chinese president elected in 1926. He also chaired the Rotary Extension committee (which sponsored the creation of new clubs in China) in 1930 and 1933, and contributed to other important committees (Program, Fellowship, Club Service).
Some important leaders are noticeably missing, however. This applies especially to Kuang Fuzhuo 鄺富灼 (Fong Sec), the second Chinese president of the club elected in 1932. Kuang was also appointed to prominent positions in Rotary International. He actively campaigned for the sinicization of the organization and sponsored the creation of the first Chinese-speaking club in Shanghai in 1936. His anomalously low position in the list results from transliteration and OCR inaccuracies.
From a comparative perspective, however, this index is a better reflection of the Rotarians’ actual influence in the club than the one drawn from the Shenbao (Case 1).
We are also interested in measuring the Rotarians’ reputation and social importance more generally, based on the number of times they appeared in the Shenbao. To this end, we extend the query to the entire corpus. We obtain a table with 5,247 results, which corresponds to the total number of occurrences for all members. In this table, a person may be mentioned several times in the same article. The table includes the unique identifier (Id), the title of the article (Title), the name of the newspaper (Source), and the name of the person (Q). Only the 10 first results are displayed:
rotarianwg <- rotarian_wg %>%
select(CommonName) %>% na.omit() %>%
mutate(Queries=str_glue('"{CommonName}"'))
multiple_search <- function(queries, corpus) {
results <- enpchina::search_documents(queries[1], corpus) %>%
mutate(Q=queries[1])
for(q in queries){
new_result <- enpchina::search_documents(q, corpus) %>%
mutate(Q=q)
results <- dplyr::bind_rows(results, new_result)
}
distinct(results)
}
rotarians_in_proquest <- multiple_search(rotarianwg$Queries, "proquest")
rotarians_in_proquest <- rotarians_in_proquest %>%
rename (Queries= Q)
rotarians_in_proquest <- full_join(rotarians_in_proquest, rotarianwg)
rotarians_in_proquest <- select(rotarians_in_proquest,-c(Queries))
rotarians_in_proquest <- inner_join(rotarians_in_proquest, rotarian_wg, by = "CommonName") %>%
select(FullName, CommonName, Id, Date, Title, Source)
rotarians_in_proquest
Based on these results, we built an index of reputation for each individual by counting the number of times they appear in the ProQuest corpus. Only the 10 first individuals are displayed below:
rotarians_in_proquest_uniq <- rotarians_in_proquest %>%
distinct(FullName, Id) %>%
group_by(FullName) %>%
count() %>%
arrange(desc(n))
rotarians_in_proquest_uniq
The ranking in the whole corpus is slightly different from one based on the subcorpus. The most frequently mentioned names include highly committed Rotarians (e.g. Gu Weijun/Wellington Koo, Xu Jianping/Jabin Hsu, Zhu Boquan/Percy Chu, Guo Baoshu/Percy Kwok, Li Yuanxin/Yinson Lee) but also individuals who rank lower in the Rotary subcorpus, such as Zhu Shenzhi 祝慎之 (Ernest Tso), Huang Hanyan 黃漢彥 (James Wong) or Philip Ho, who occur only once in connection with the club. Except for the diplomat Gu Weijun (Wellington Koo), the list highlights professionals and businessmen rather than politicians.
rotarians_in_proquest_uniq %>%
group_by(n) %>%
count() %>%
ggplot(aes(x=n)) +
geom_histogram(alpha=0.8)+
labs(title = "The Chinese Rotarians in the English-language press",
subtitle = "Index of reputation",
x = "Number of occurrences",
y = "Number of individuals",
caption = "Based on the ProQuest collection of Chinese Historical Newspapers")
From these differences, three main profiles may be distinguished:
Comparing more systematically the general distribution and the relative position of each individual in the two corpora will help determine whether the ranking reflects their importance strictly in the club and/or their social influence more generally beyond the club.
From a comparative perspective, to conclude, we notice two major differences with the Chinese-language press (Case 1). In the subcorpus, the ranking in ProQuest reflects more strongly the members’ relative influence in the club. If we consider the whole corpus, businessmen and professionals rank higher in the English-language press, whereas the Shenbao tends to highlight diplomats and politicians.
At this stage, it is recommended to save and export the results. You can export the complete lists of occurrences for each individual in the two corpora and the synthetic indices we visualized above. In order to facilitate the comparison between the two corpora, we combined the two indices (general reputation in the entire corpus and index of committment to the Rotary) into a single table:
write.csv(rotarians_in_proquest, "rotarians_in_proquest.csv") # list of occurrences in the ProQuest
write.csv(rotarians_in_pq_subcorpus, "rotarians_in_pq_subcorpus.csv") # list of occurrences in the "Rotary" subcorpus
rotarians_pq_index <- full_join(rotarians_in_proquest_uniq, rotarians_in_pq_subcorpus_uniq, by = "FullName") %>%
rename(ProQuest = n.x) %>%
rename(Rotary = n.y)
write.csv(rotarians_pq_index, "rotarians_pq_index.csv") # synthetic indices for each individual
How and why did these indices evolve over time? We propose to examine the variations in their reputation and commitment to the club during the course of their life. The following analyses focus on the 15 most popular Rotarians.
rotarians_in_proquest <- full_join(rotarians_in_proquest, rotarians_pq_index)
integer_breaks <- function(n = 5, ...) {
fxn <- function(x) {
breaks <- floor(pretty(x, n, ...))
names(breaks) <- attr(breaks, "labels")
breaks
}
return(fxn)
}
rotarians_in_proquest %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
filter(ProQuest>47) %>%
group_by(Year) %>%
count(FullName)%>%
ggplot(aes(Year, n, fill = as.factor(FullName))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ FullName, scales = "free_y", ncol = 3) +
scale_x_discrete(breaks = pretty_breaks())+
scale_y_continuous(breaks = pretty_breaks())+
labs(x = "Year", y = "Number of occurrences",
title = "The 15 most popular Chinese Rotarians in the English-language press (> 47 occ.)",
subtitle = "Index of reputation over time (1850-1950)",
caption = "Based on data extracted from the 'ProQuest' collection of Chinese Historical Newspapers")
For most Rotarians, their popularity timespan covers the years 1920-1940s. Their appearance in the Shenbao follows the same regular, linear pattern starting with a few occurrences in the 1920s, followed by a peak in the 1930s, and eventually a slow decrease by the end of the war.
This general pattern reflects a generational effect. The majority of the Chinese Rotarians in Shanghai were born during the waning years of the Qing dynasty, between 1880 and 1896 (31, 53%), with two peaks in 1884 and 1896. They joined the club in their thirties or forties, i.e. at the height of their professional career (Armand 2021, Liu 2012). The peaks on the graphs therefore coincide with the most active phase in their life.
There are specific deviations from this general pattern, however.
The orthopedist Niu Huisheng (1892-1937) for instance, appeared and disappeared earlier than his fellow Rotarians, for natural reasons. Born in 1892, he died of illness in 1937. On the opposite, younger members such as the cotton industrialist David Kwok and the financier Philip Ho, appeared later and made a shorter appearance. Their year of birth can not be ascertained but they joined the club later than the average, in 1930 and 1936 respectively.
rotarians_in_proquest %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
filter(ProQuest>47) %>%
group_by(Year) %>%
count(FullName)%>%
filter(FullName %in% c("New, Way-Sung", "Kwok, David", "Ho, Philip")) %>%
ggplot(aes(Year, n, fill = as.factor(FullName))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ FullName, scales = "free_y", ncol = 3) +
scale_x_discrete(breaks = pretty_breaks())+
scale_y_continuous(breaks = integer_breaks())+
labs(x = "Year", y = "Number of occurrences",
title = "The most popular Chinese Rotarians in the English-language press",
subtitle = "Three deviating profiles",
caption = "Based on data extracted from the 'ProQuest' collection of Chinese Historical Newspapers")
The most remarkable profile is certainly Gu’s (Wellington Koo), whose public appearance as a diplomat extended over almost 50 years, from the late Qing dynasty to the early years of the People’s Republic of China (PRC). By contrast, other Rotarians, such as the pediatrician Ernest Tso or the clothing industrialist James Wong, experienced a less continuous trajectory:
rotarians_in_proquest %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
filter(ProQuest>47) %>%
group_by(Year) %>%
count(FullName)%>%
filter(FullName %in% c("Koo, V.K.W.", "Wong, James", "Tso, Ernest")) %>%
ggplot(aes(Year, n, fill = as.factor(FullName))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ FullName, scales = "free_y", ncol = 3) +
scale_x_discrete(breaks = pretty_breaks())+
scale_y_continuous(breaks = integer_breaks())+
labs(x = "Year", y = "Number of occurrences",
title = "The most popular Chinese Rotarians in the English-language press",
subtitle = "Enduring vs. discrete popularity",
caption = "Based on data extracted from the 'ProQuest' collection of Chinese Historical Newspapers")
Similarly, we propose to examine how the commitment of the most popular Rotarians changed over time. We focus on the Rotarians who are mentioned more than twice in the subcorpus:
rotarians_in_pq_subcorpus <- full_join(rotarians_in_pq_subcorpus, rotarians_pq_index)
rotarians_in_pq_subcorpus %>%
mutate(Year = stringr::str_sub(Date,0,4)) %>%
filter(Rotary>2) %>%
group_by(Year) %>%
count(FullName) %>%
ggplot(aes(Year, n, fill = as.factor(FullName))) +
geom_col(alpha = 0.9, show.legend = FALSE)+
facet_wrap(~ FullName, scales = "free_y", ncol = 3) +
scale_x_discrete(breaks = pretty_breaks())+
scale_y_continuous(breaks = integer_breaks())+
labs(x = "Year", y = "Number of occurrences",
title = "The most committed Chinese Rotarians in the English-language press (> 2 occ.)",
subtitle = "Index of commitment (1921-1938)",
caption = "Based on the 'ProQuest' collection of Chinese Historical Newspapers")
The general pattern follows a normal shape. It consists of a peak of activity in the mid-1930s, preceded by a short appearance that points to their admission to the club which usually occurred five to ten years earlier (around 1925), and a final decrease by the outbreak of the war.
Variations to this pattern are of three main types:
In order to further analyze the interactions between Rotarians within and outside of the club, we propose to build a two-mode network linking individuals with the documents in which they are mentioned in the Rotary subcorpus. “Two-mode” means that the network contains two categories of nodes: individuals (represented by triangles on the graph) and documents (square nodes). We chose to retain only the members that appear at least 3 times (1107 articles).
First, we create the corpus containing the documents that mention the Rotarians who appeared at least 3 times in the corpus:
subcorpus_pq <- rotarians_in_proquest %>%
group_by(Id) %>%
count() %>%
filter(n>2)
subcorpus_pq <- enpchina::get_documents(subcorpus_pq, "proquest")
## [1] 1
Then we create the node list of individuals:
regexps_pq <- rotarianwg %>%
transmute(
Regexp = CommonName,
Type="PER"
)
We create the edge list:
rotarianwg_mentions <- enpchina::extract_regexps_from_subcorpus(subcorpus_pq, regexps_pq)
edges_pq <- rotarianwg_mentions %>% transmute(from=DocID, to=Match)
We custom the shape of nodes (triangles):
rotarianwg_nodes <- rotarianwg_mentions %>%
select(Match) %>% distinct() %>%
transmute(
name=Match,
label=Match,
Text=" ",
shape="triangle"
)
Similarly, we create the node list of documents:
docs_nodes_pq <- subcorpus_pq %>% distinct() %>%
transmute(
name=as.character(DocID),
label=str_sub(str_replace_all(Title, "[^[:alnum:]]", " "),1,10),
Text=str_replace_all(Text, "[^[:alnum:]]", " "),
shape="square"
)
We can finally project the graph into “Padagraph”:
ig_pq4 <- graph_from_data_frame(d=edges_pq, directed = FALSE, vertices=(docs_nodes_pq %>% bind_rows(rotarianwg_nodes)))
tg_pq4 <- tidygraph::as_tbl_graph(ig_pq4)
tg_pq4 %>% enpchina::in_padagraph("PQRotarians")
The interactive version of the graph can be explored in Padagraph. At first you need to click on “global” or “+10” to display the graph.
The graph contains a total of 55 nodes (36 documents and 19 individuals) and 123 edges. Square nodes represent documents whereas triangular nodes represent individuals. Distinct colors represent clusters, i.e. groups of more densely connected nodes.
Click on any node or edge to display its individual properties. The interactive panel on the right contains two tabs: the “details” tab displays the attributes of any selected node or edge, whereas the “list” tab lists all the nodes that constitute the network. You can sort them by label (name), type (individual or document), cluster, or degree centrality (number of neighbors).
The panel on the left allows users to zoom in/out, change the layout, the size of the label and the general settings. One can navigate the graph by focusing on a particular node and explore its ego-network by adding or removing a selected number of its neighbors. Alternatively, one can use the list of linked entities in the right-side panel. The panel also provides access to the full text associated with each document-node. You can use the search engine at the bottom to search individual names. Click on “global” to return to the original network.
Padagraph is a powerful tool for examining patterns of co-occurrence in large corpora. Such visualization makes it possible to explore entities in their broader context (individuals in documents, documents in corpora). It assists researchers in identifying the most relevant documents to focus on for close reading (nodes with the highest degree), with no a priori knowledge of their content. In addition, the colored clusters raise intriguing questions about why certain individuals were likely to group and meet more frequently than others - in what context, based on which shared activities and common interests?
This graph more specifically allows us to examine the Rotarians’ co-participation in social events. The largest nodes refer to either the most attended meetings - i.e. the events that involved the highest number of participants - or to the most active individuals, who participated to a large number of meetings. The latter reproduces the index of commitment we built earlier. For instance, four Rotarians (Percy Chu, Percy Kwok, David Kwok, James Tsao) attended the dinner party of the Rotary Club held at the French Club on September 1935 (doc n°1425819368).
To take another example, five Chinese Rotarians sent flowers for the funerals of Mrs. Liang (the US-born wife of an American-returned student) in March 1934 (doc n°1416482396). In these cases, the Rotarians did not actually meet, but their co-occurrence in relation to the same event is indicative of how their network of personal contacts may have intersected. Similary, the fact four other Rotarians donated to the Flood Relief Fund in September 1935 (doc. n°1371425931) is indicative of their shared concern for social relief and possibly of their common origin from the flooded Jiangsu province.
The graph also helps detect possible inaccuracies in the indexation of documents. This particularly applies to the “Chinese social notes” section in the North-China Herald, which usually features several events on the same page with the effect of accidentally bringing together the names of individuals who did not actually meet. It may also help disambiguate homonyms and other false positives, as in the case of the three Rotarians’s wives who sponsored the work of a hospital on Bubbling Well Road (doc n°1425464955). Since they are refered to by the name as their husband, preceded by “Mrs”, the algorithm mistook them for their husband. The ambiguity, however, reveals the extent to which the network of Rotarians’ wives mirrored their husbands’.
It is eventually possible to export the edge and node lists so as to reuse them with any Social Network Analysis software.
write.csv(edges_pq, "edgelist_pq.csv")
write.csv(docs_nodes_pq , "docs_nodes_pq.csv")
write.csv(regexps_pq, "pers_nodes_pq.csv")
Cécile Armand, “Foreign Clubs with Chinese Flavor: The Rotary Club of Shanghai and the Politics of Language”, In Cécile Armand, Christian Henriot and Huei-min Sung (ed.), Knowledge, Power, and Networks: Elites in Transition in Modern China, (forthcoming).
Guan Yuting and Chen Yunqian, “Minguoshiqi Zhongguo Fulunshe Fazhan Chutan (The Development of Rotary Clubs in Republican China),” Jiangxi Shehuikexue, no. 6 (2009): 156;
Jiang Pei and Geng Keyan, “Minguoshiqi Tianjin Zujie Waiqiao Jingying Shetuan – Fulunshe Shu Lun (Foreign Elite Organization in Republican Tianjin: Tianjin Foreign Settlements: The Tianjin Rotary Club),” Lishi Jiaoxue (Xiaban Yuekan) 12, no. 673 (2013): 7.
Liu Bensen, “Jindai Shanghai Shangye Jingying Yu Fulunshe (The Shanghai Rotary Club and Business Elites in Republican Shanghai),” Suzhou Keji Xueyuan Xuebao (Shehuikexue Ban) 29, no. 5 (2012): 64–69.