Abstract
This document is part of a series of three scripts developed to support a comprehensive study of early Chinese PhDs from 1905 to 1962. The study is based on a dataset derived from three catalogs of Chinese doctoral dissertations compiled by the librarian and bibliographer Yuan Tongli 袁同禮 (1895–1965), which cover the United States (1905–1960), the United Kingdom (1916–1961), and Continental Europe (1905–1962). Following the first script, which focuses on the dissertations themselves, this second script analyzes the social and educational backgrounds of their authors.
This second script aims to analyze the background of Chinese PhDs using external data — specifically, the China University Student Dataset (CUSD) compiled by the Lee-Campbell Research Group. This comprehensive dataset offers detailed information on the social and geographic family backgrounds of nearly 300,000 Chinese students who enrolled in Chinese universities during the late Qing and Republican periods. It includes critical information not provided in the original dissertation catalogs—most notably, students’ place and date of birth, gender, and the educational institutions they attended both in China and abroad. Since the dataset is not yet publicly available, we provide only the subset resulting from the match with our initial dataset of Chinese PhDs.
Of the 4,683 individuals in the main dataset, 3,716 (79.4%) were successfully matched with entries in the CUSDOS dataset. Match coverage was highest for those who studied in the United States (83.4%), followed closely by the United Kingdom (83.2%) and Europe (71.3%).
# load dataset
library(readr)
phd_cus_all <- read_delim("Data/phd_cus_all.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
library(tidyverse)
phd_cus_all %>% group_by(Region) %>% count(sort = TRUE)
| Matched in CUSD-OS | (%) | |
|---|---|---|
| USA | 2,315 | 83.4% |
| Europe | 1,113 | 71.3% |
| UK | 288 | 83.2% |
| Total | 3,716 | 79.4% |
It is important to note that biographical information—such as
gender and provincial origin—is not systematically available. As a
result, even when a Chinese PhD was successfully matched with a
corresponding record in the CUSD, their biographical data may still be
incomplete.
colSums(is.na(phd_cus_all))
## ID NameZH University
## 0 0 0
## cus_Univ_Eng cus_Univ_For_ZhT IDCUS
## 936 591 0
## cus_NameZH cus_NameZH_uniq cus_Name_Alt_Eng
## 0 3716 3715
## cus_Name_Alt_Zh cus_Zi cus_Birth_Loc_ZhT
## 3632 3716 3699
## cus_Birth_Prov_ZhT cus_Gender cus_Univ_China_ZhT
## 3007 1445 2855
## cus_Univ_For_Py cus_City cus_Lat
## 3341 1021 1197
## cus_Long cus_Country_Eng cus_Discipline_Level2
## 1211 0 2524
## cus_Discipline_Level2_ZhT cus_Discipline_Level1 cus_Discipline_Level1_ZhT
## 2524 2371 2372
## cus_Discipline_Level0 cus_Discipline_Level0_ZhT cus_Code_Level1
## 1828 1829 1828
## cus_Major cus_BirthYear cus_DeathYear
## 1817 3716 3716
## cus_NameZH_Uniq cus_Univ_China_Orig Region
## 3714 3099 0
## cus_Birth_Country_ZhT
## 3716
sapply(phd_cus_all, function(x) sum(is.na(x)) / length(x) * 100)
## ID NameZH University
## 0.00000 0.00000 0.00000
## cus_Univ_Eng cus_Univ_For_ZhT IDCUS
## 25.18837 15.90420 0.00000
## cus_NameZH cus_NameZH_uniq cus_Name_Alt_Eng
## 0.00000 100.00000 99.97309
## cus_Name_Alt_Zh cus_Zi cus_Birth_Loc_ZhT
## 97.73950 100.00000 99.54252
## cus_Birth_Prov_ZhT cus_Gender cus_Univ_China_ZhT
## 80.92034 38.88590 76.82992
## cus_Univ_For_Py cus_City cus_Lat
## 89.90850 27.47578 32.21206
## cus_Long cus_Country_Eng cus_Discipline_Level2
## 32.58881 0.00000 67.92250
## cus_Discipline_Level2_ZhT cus_Discipline_Level1 cus_Discipline_Level1_ZhT
## 67.92250 63.80517 63.83208
## cus_Discipline_Level0 cus_Discipline_Level0_ZhT cus_Code_Level1
## 49.19268 49.21959 49.19268
## cus_Major cus_BirthYear cus_DeathYear
## 48.89666 100.00000 100.00000
## cus_NameZH_Uniq cus_Univ_China_Orig Region
## 99.94618 83.39612 0.00000
## cus_Birth_Country_ZhT
## 100.00000
As shown above, gender information is missing for approximately
39% of individuals, the province of birth is unreported in 81% of cases,
and the university previously attended in China is unknown for over 83%
of the sample. Information regarding gender, geographical origin, and
previous university attended in China is available for 359 individuals,
representing less than 10% of the initial population:
phd_cus_complete <- phd_cus_all[complete.cases(phd_cus_all[, c("cus_Gender", "cus_Birth_Prov_ZhT", "cus_Univ_China_Orig")]), ]
phd_cus_complete
Nevertheless, we have obtained valuable information about the
PhD holders that was not available in the original dissertation
catalog.
phd_all_master %>% group_by(Sex) %>% count(sort = TRUE)
phd_all_master %>% group_by(Region, Sex) %>% count() # by Region
Birth year data is available for a limited number of cases, primarily for PhDs awarded in the United States. The CUSD-OS dataset contributes additional birth years for just over 2,600 PhDs. To address remaining gaps, we calculated the average age at graduation based on known cases. Although there were some variations—particularly among the earliest PhDs, whose ages at graduation ranged from their early 20s to over 50—a general trend toward convergence and standardization is evident over time. Most individuals completed their degrees between the ages of 28 and 34.
phd_all_master %>% group_by(Birth_year_corr) %>% count(sort = TRUE)
phd_all_master %>% drop_na(Birth_year_corr) %>%
ggplot( aes(x=Birth_year_corr)) +
geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin size = 1") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
) +
labs(title = "Chinese PhDs Abroad (1905-1962)",
subtitle = "Year of Birth",
x = "Year",
y = "Frequency",
caption = "Based on Yuan T'ung-li's Guides to Doctoral Dissertations (1961, 1962, 1964)")
# Reshape and convert
generation <- gather(phd_all_master, key = "Event", value = "Year", Birth_year_corr, Degree_Year, Death_year) %>%
drop_na(Year) %>%
mutate(Year = as.numeric(Year),
Event = factor(Event, levels = c("Birth_year_corr", "Degree_Year", "Death_year")))
# Plot
ggplot(generation, aes(x = Year , fill = Event)) +
geom_histogram(position = "identity", alpha = 0.8, bins = 30) +
labs(title = "Chinese PhDs in the US (1905-1962)",
subtitle = "Birth, Graduation, and Death Year",
x = "Year",
y = "Frequency",
caption = "Based on Yuan T'ung-li's Guides to Doctoral Dissertations") +
scale_fill_manual(
values = c("Birth_year_corr" = "light green", "Degree_Year" = "steelblue", "Death_year" = "red"),
labels = c("Birth", "Graduation", "Death")
) +
theme_minimal() +
theme(legend.position = "bottom")
phd_all_master %>% filter(!Birth_year_corr == "NA") %>% group_by(Age_Grad) %>% count() %>%
ggplot(aes(Age_Grad, n)) + geom_col(alpha = 0.8) + geom_smooth(span = .25) +
labs(title = "Chinese PhDs Abroad (1905-1962)",
subtitle = "Age at graduation",
x = "Age",
y = "Frequency",
caption = "Based on Yuan Tongli's Guides to Doctoral Dissertations")
ggplot(phd_all_master, aes(x=Birth_year_corr, y=Age_Grad) ) +
geom_hex(bins = 35) +
scale_fill_continuous(type = "viridis") +
theme_bw() +
labs(title = "Chinese PhDs Abroad (1905-1962)",
subtitle = "Age at graduation in relation to generation",
x = "Year of birth",
y = "Age",
caption = "Based on Yuan Tongli's Guides to Doctoral Dissertations")
phd_cus_all %>% group_by(cus_Birth_Prov_ZhT) %>% drop_na(cus_Birth_Prov_ZhT) %>%
count(sort = TRUE) %>%
mutate(percent = round(n/709*100, 2))
## USA
phd_cus_USA_match %>% group_by(cus_Birth_Prov_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/2315*100, 2))
## United Kingdom
phd_cus_UK_match %>% group_by(cus_Birth_Prov_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/288*100, 2))
## Europe
phd_cus_EUR_match %>% group_by(cus_Birth_Prov_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/1113*100, 2))
# load package
library(networkD3)
# select variables
birth_region <- phd_cus_all %>% select(cus_Birth_Prov_ZhT, Region) %>% na.omit()
# create links and compute weight
birth_region_link <- birth_region %>%
rename(source = cus_Birth_Prov_ZhT, target = Region) %>%
group_by(source, target) %>%
count() %>%
rename(value = "n")
# create nodes
birth_region_node <- data.frame(
name=c(as.character(birth_region_link$source),
as.character(birth_region_link$target)) %>% unique()
)
# Plot
birth_region_link$IDsource <- match(birth_region_link$source, birth_region_node$name)-1
birth_region_link$IDtarget <- match(birth_region_link$target, birth_region_node$name)-1
# Make the Network
p1 <- sankeyNetwork(Links = birth_region_link, Nodes = birth_region_node,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
fontSize = 14,
fontFamily = "Arial",
nodeWidth = 30,
sinksRight=FALSE)
p1
phd_cus_all %>% drop_na(cus_Univ_China_ZhT) %>%
group_by(cus_Univ_China_ZhT) %>%
count(sort = TRUE) %>%
mutate(percent = round(n/709*100, 2))
## USA
phd_cus_USA_match %>% drop_na(cus_Univ_China_ZhT) %>% group_by(cus_Univ_China_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/2315*100, 2))
## United Kingdom
phd_cus_UK_match %>% drop_na(cus_Univ_China_ZhT) %>% group_by(cus_Univ_China_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/288*100, 2))
## Europe
phd_cus_EUR_match %>% drop_na(cus_Univ_China_ZhT) %>% group_by(cus_Univ_China_ZhT) %>% count(sort = TRUE) %>% mutate(percent = round(n/1113*100, 2))
# select variables
china_us <- phd_cus_USA_match %>% select(cus_Univ_China_ZhT, cus_Univ_Eng) %>% na.omit()
# create links and compute weight
china_us_link <- china_us %>%
rename(source = cus_Univ_China_ZhT, target = cus_Univ_Eng) %>%
group_by(source, target) %>%
count() %>%
rename(value = "n")
china_us_link <- china_us_link %>% filter(value > 1)
# create nodes
china_us_node <- data.frame(
name=c(as.character(china_us_link$source),
as.character(china_us_link$target)) %>% unique()
)
# plot
china_us_link$IDsource <- match(china_us_link$source, china_us_node$name)-1
china_us_link$IDtarget <- match(china_us_link$target, china_us_node$name)-1
# Make the Network
p2 <- sankeyNetwork(Links = china_us_link, Nodes = china_us_node,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
fontSize = 14,
fontFamily = "Arial",
nodeWidth = 30,
sinksRight=FALSE)
p2
# select variables
china_europe <- phd_cus_EUR_match %>% select(cus_Univ_China_ZhT, cus_Univ_Eng) %>% na.omit()
# create links and compute weight
china_europe_link <- china_europe %>%
rename(source = cus_Univ_China_ZhT, target = cus_Univ_Eng) %>%
group_by(source, target) %>%
count() %>%
rename(value = "n")
# create nodes
china_europe_node <- data.frame(
name=c(as.character(china_europe_link$source),
as.character(china_europe_link$target)) %>% unique()
)
# plot
china_europe_link$IDsource <- match(china_europe_link$source, china_europe_node$name)-1
china_europe_link$IDtarget <- match(china_europe_link$target, china_europe_node$name)-1
# Make the Network
p3 <- sankeyNetwork(Links = china_europe_link, Nodes = china_europe_node,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
fontSize = 14,
fontFamily = "Arial",
nodeWidth = 30,
sinksRight=FALSE)
p3
# select variables
china_uk <- phd_cus_UK_match %>% select(cus_Univ_China_ZhT, cus_Univ_Eng) %>% na.omit()
# create links and compute weight
china_uk_link <- china_uk %>%
rename(source = cus_Univ_China_ZhT, target = cus_Univ_Eng) %>%
group_by(source, target) %>%
count() %>%
rename(value = "n")
# create nodes
china_uk_node <- data.frame(
name=c(as.character(china_uk_link$source),
as.character(china_uk_link$target)) %>% unique()
)
# plot
china_uk_link$IDsource <- match(china_uk_link$source, china_uk_node$name)-1
china_uk_link$IDtarget <- match(china_uk_link$target, china_uk_node$name)-1
# Make the Network
p4 <- sankeyNetwork(Links = china_uk_link, Nodes = china_uk_node,
Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name",
fontSize = 14,
fontFamily = "Arial",
nodeWidth = 30,
sinksRight=FALSE)
p4
In the third and final script, we will attempt to trace the post-graduation careers of Chinese PhDs using external data from web sources such as Wikipedia and Baidu.
We extend our sincere gratitude to the Lee-Campbell Research Group for generously sharing their datasets prior to their publication.
Chen, Liang 梁晨, Yunzhu Ren 任韵竹, and Zhongqing Li 李中清 Qishanlinzhe: Zhongguo Xiandai Zhishi Jieceng de Xingcheng yu Tezheng Yanjiu 启山林者:中国现代知识阶层的形成与特征研究,1912–1952 (The Enlightenment of the Mountains and Forests: A Study on the Formation and Characteristics of China’s Modern Intellectual Class) Forthcoming.
Liang Chen, Zhang Hao, Li Lan, Ruan Danqing, Cameron Campbell, and James Lee Wusheng de Geming: Beijing Daxue, Suzhou Daxue de Xuesheng Shehui Laiyuan 《无声的革命: 北京大学、苏州大学的学生社会来源 1949–2002》 (Silent Revolution: The Social Origins of Peking University and Soochow University Undergraduates, 1949–2002) Beijing: Beijing Joint Publishing, 2013.
Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in America, 1905–1960 Washington, D.C.: Published under the auspices of the Sino-American Cultural Society, 1961.
Yuan, T’ung-li Doctoral Dissertations by Chinese Students in Great Britain and Northern Ireland, 1916–1961 Taipei: Chinese Cultural Research Institute, 1963.
Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in Continental Europe, 1907–1962 Taipei: Chinese Culture Quarterly Review, 1964.