Chinese PhDs in the United States, Europe, and United Kingdom (1905-1962)

Episode 1: Mapping the Dissertations

Cécile Armand, Christian Henriot

2025-05-19

Abstract

This document is part of a series of three scripts developed to support a comprehensive study of early Chinese PhDs from 1905 to 1962. The study is based on a dataset derived from three catalogs of Chinese doctoral dissertations compiled by the librarian and bibliographer Yuan Tongli 袁同禮 (1895–1965), which cover the United States (1905–1960), the United Kingdom (1916–1961), and Continental Europe (1905–1962).The first script examines the dissertations themselves, while the second and third analyze the authors’ backgrounds and their post-graduation trajectories.

# load packages

library(readr)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(FactoMineR)
library(Factoshiny)
library(networkD3)
library(BTM)
library(histtext)

Introduction

This document is part of a series of three scripts developed to support a comprehensive study of early Chinese PhDs from 1905 to 1962. It is based on the three catalogs of Chinese doctoral dissertations compiled by the librarian and bibliographer Yuan Tongli 袁同禮 (1895–1965), covering the United States (1905–1960), the United Kingdom (1916–1961), and Continental Europe (1905–1962). These scripts provide the complete data and code underlying our analysis and narrative—material that could not be included in the final published work—to ensure transparency and enable rigorous traceability of our methods for interested readers.

The documentation series follows the general structure of the paper and includes:

Script 1 analyzes the dissertations themselves, examining their temporal and geographical distribution, disciplinary majors, host institutions, and research topics. This analysis is based exclusively on the original dissertation catalog.
Script 2 focuses on the authors—the Chinese PhDs who produced these dissertations—exploring their backgrounds using additional external data compiled by the Lee-Campbell Research Group.
Script 3 investigates the post-graduation trajectories of these PhDs, particularly their experiences following the 1949 Communist Revolution. This analysis leverages biographical entries available in online sources such as Wikipedia and Baidu.

As illustrated by the figure and table below, we lost a significant number of individuals when incorporating external sources, resulting in a reduced sample size compared to our initial population. Nonetheless, the integration of qualitative biographical information enables us to draw more nuanced and in-depth insights into the life experiences of a substantial subset of PhDs. This approach ultimately supports robust empirical conclusions grounded in rich, contextual data.

	Matched in CUSD-OS	(%)	Found in Baidu/Wikipedia	(%)
USA	2,315	83.4%	681	24.5%
Europe	1,113	71.3%	268	17.2%
UK	288	83.2%	130	37.6%
Total	3,716	79.4%	1,079	23.0%

The Early Chinese PhD Population Through Concentric Circles of Documentation

Dataset

The dataset phd_all_master is a compilation of the three lists of doctors from the U.S., U.K. and Europe:

library(readr)
phd_all_master <- read_delim("Data/phd_all_master.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

phd_all_master

names(phd_all_master)

##  [1] "ID"              "Srce_ID"         "NameWG"          "NameZH"         
##  [5] "NamePY"          "Sex"             "Birth_year_Src"  "Birth_year_corr"
##  [9] "Death_year"      "Age_Grad"        "Country"         "University"     
## [13] "School"          "Discipline_Srce" "Degree"          "Degree_Year"    
## [17] "Degree_Period2"  "Degree_Period"   "Thesis"          "Title_Src"      
## [21] "Discipline"      "Field"           "Region"          "City"           
## [25] "State"           "RID"             "Lat"             "Long"

The dataset contains 4,717 unique dissertations (rows) and 28 attributes on these dissertations and their authors, including

the various identifiers: unique individual (doctor) (ID), source ID (Srce_ID)
biographical information on doctors: names, gender, year of birth, year of death
information on dissertations: information: age at graduation, region and country of graduation, university of graduation, discipline of graduation, year of graduation, title of dissertation (as in source, and translated)
spatial data: city of university, geographical coordinates

Geographical Distribution

Geographical distribution of PhDs by world region and by country:

phd_all_master %>% group_by(Region) %>% count(sort = TRUE)

phd_all_master %>% group_by(Country) %>% count(sort = TRUE)

Timeline

When did they graduate?

phd_all_master %>% drop_na(Degree_Year) %>% 
  ggplot( aes(x=Degree_Year)) +
  geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Bin size = 1") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  ) +
  labs(title = "Chinese PhDs Abroad (1905-1962)",
       subtitle = "Year of Graduation",
       x = "Year",
       y = "Frequency", 
       caption = "Source: Yuan T'ung-li's Guides to Doctoral Dissertations (1961, 1962, 1964)")

By Region

phd_all_master %>% 
  group_by(Region, Degree_Year) %>% 
  count() %>% 
  ggplot( aes(x=Degree_Year, y=n, group=Region, fill=Region)) +
  geom_area() +
  scale_fill_viridis(discrete = TRUE) +
  theme(legend.position="none") +
  labs(title = "Chinese PhDs Abroad (1905-1964)",
       subtitle = "Year of Graduation",
       x = "Year ",
       y = "Frequency", 
       caption = "Source: Yuan T'ung-li's Guides to Doctoral Dissertations (1961, 1962, 1964)") +
  theme_ipsum() +
  theme(
    legend.position="none",
    panel.spacing = unit(0, "lines"),
    strip.text.x = element_text(size = 8),
    plot.title = element_text(size=13)
  ) +
  facet_wrap(~Region, ncol = 1)

Periodization

phd_all_master %>% group_by(Degree_Period) %>% count(sort = TRUE) # taking the Nationalist takeover as a cut-off

phd_all_master %>% group_by(Degree_Period2) %>% count(sort = TRUE) # taking end of WWI as a cut-off

Universities

From which university did they graduate?

Distribution by institution, as well as distribution of institutions by region, country, and period of graduation:

phd_all_master %>% group_by(University) %>% count(sort = TRUE)

phd_all_master %>% group_by(Region, University) %>% count(sort = TRUE) # by region

phd_all_master %>% group_by(Country, University) %>% count(sort = TRUE) # by country

phd_all_master %>% group_by(Degree_Period, University) %>% count() %>% arrange(Degree_Period, desc(n)) # by period

Cities (GIS)

library(leaflet)

map_all <- phd_all_master %>% group_by(City, Lat, Long) %>% 
  count() %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers( radius = ~log(n)*3,
                    label = ~City,
                    color = "white",
                    weight = 1,
                    opacity = 0.6,
                    fill = TRUE,
                    fillColor = "red",
                    fillOpacity = 0.9,
                    stroke = TRUE,
                    popup = ~paste( "City:", City ,
                                    "",
                                    "Chinese PhDs, 1905-1962", n))

# add legend 

legend_html <- "
<div style='background-color: white; padding: 10px;'>
  <h4>Chinese PhDs, 1905-1962</h4>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: red; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
    <span>10</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: red; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
    <span>20</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: red; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
    <span>50</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: red; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
    <span>100</span>
  </div>
</div>
"

# Add the custom legend to the map
map_all %>% 
  addControl(html = legend_html, position = "bottomright")

1905-1926

map1 <- phd_all_master %>% 
  filter(Degree_Period == "1905-1926") %>% 
  group_by(City, Lat, Long) %>% 
  count() %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers( radius = ~log(n)*3,
                    label = ~City,
                    color = "white",
                    weight = 1,
                    opacity = 0.6,
                    fill = TRUE,
                    fillColor = "orange",
                    fillOpacity = 0.9,
                    stroke = TRUE,
                    popup = ~paste( "City:", City ,
                                    "<br>",
                                    "Chinese PhDs, 1905-1926:", n)) 


legend1 <- "
<div style='background-color: white; padding: 10px;'>
  <h4>Chinese PhDs, 1905-1926</h4>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: orange; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
    <span>10</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: orange; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
    <span>20</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: orange; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
    <span>50</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: orange; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
    <span>100</span>
  </div>
</div>
"

# Add the custom legend to the map
map1 %>% 
  addControl(html = legend1, position = "bottomright")

1927-1937

map2 <- phd_all_master %>% 
  filter(Degree_Period == "1927-1937") %>% 
  group_by(City, Lat, Long) %>% 
  count() %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers( radius = ~log(n)*3,
                    label = ~City,
                    color = "white",
                    weight = 1,
                    opacity = 0.6,
                    fill = TRUE,
                    fillColor = "blue",
                    fillOpacity = 0.9,
                    stroke = TRUE,
                    popup = ~paste( "City:", City ,
                                    "",
                                    "Chinese PhDs, 1927-1937", n)) 


legend2 <- "
<div style='background-color: white; padding: 10px;'>
  <h4>Chinese PhDs, 1927-1937</h4>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: blue; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
    <span>10</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: blue; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
    <span>20</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: blue; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
    <span>50</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: blue; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
    <span>100</span>
  </div>
</div>
"

# Add the custom legend to the map
map2 %>% 
  addControl(html = legend2, position = "bottomright")

1938-1951

map3 <- phd_all_master %>% 
  filter(Degree_Period == "1938-1951") %>% 
  group_by(City, Lat, Long) %>% 
  count() %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers( radius = ~log(n)*3,
                    label = ~City,
                    color = "white",
                    weight = 1,
                    opacity = 0.6,
                    fill = TRUE,
                    fillColor = "darkgreen",
                    fillOpacity = 0.9,
                    stroke = TRUE,
                    popup = ~paste( "City:", City ,
                                    "",
                                    "Chinese PhDs, 1938-1951", n)) 


legend3 <- "
<div style='background-color: white; padding: 10px;'>
  <h4>Chinese PhDs, 1938-1951</h4>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: darkgreen; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
    <span>10</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: darkgreen; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
    <span>20</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: darkgreen; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
    <span>50</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: darkgreen; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
    <span>100</span>
  </div>
</div>
"

# Add the custom legend to the map
map3 %>% 
  addControl(html = legend3, position = "bottomright")

1952-1962

map4 <- phd_all_master %>% 
  filter(Degree_Period == "1952-1962") %>% 
  group_by(City, Lat, Long) %>% 
  count() %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers( radius = ~log(n)*3,
                    label = ~City,
                    color = "white",
                    weight = 1,
                    opacity = 0.6,
                    fill = TRUE,
                    fillColor = "purple",
                    fillOpacity = 0.9,
                    stroke = TRUE,
                    popup = ~paste( "City:", City ,
                                    "",
                                    "Chinese PhDs, 1952-1962", n)) 


legend4 <- "
<div style='background-color: white; padding: 10px;'>
  <h4>Chinese PhDs, 1952-1962</h4>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: purple; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
    <span>10</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: purple; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
    <span>20</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: purple; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
    <span>50</span>
  </div>
  <div style='display: flex; align-items: center;'>
    <div style='background-color: purple; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
    <span>100</span>
  </div>
</div>
"

# Add the custom legend to the map
map4 %>% 
  addControl(html = legend4, position = "bottomright")

Disciplines

# all regions included 
phd_all_master %>% group_by(Field) %>% count(sort = TRUE)

phd_all_master %>% group_by(Discipline) %>% count(sort = TRUE)

# by regions  

phd_all_master %>% group_by(Region) %>% count(Field) %>% arrange(Region, desc(n))

phd_all_master %>% group_by(Region) %>% count(Discipline) %>% arrange(Region, desc(n))

# by periods  

phd_all_master %>% group_by(Degree_Period) %>% count(Field) %>% arrange(Degree_Period, desc(n))

phd_all_master %>% group_by(Degree_Period) %>% count(Discipline) %>% arrange(Degree_Period, desc(n))

Correspondence Analyses

Region and Field (CA)

region_field <- phd_all_master %>% 
  drop_na(Field) %>%
  select(Field, Region) %>% 
  group_by(Region, Field) %>% 
  tally() %>% 
  spread(key = Field, value = n) 

# read first column as row names

region_field <- column_to_rownames(region_field, var = "Region") 

# replace NA values with 0

region_field <- mutate_all(region_field, ~replace(., is.na(.), 0))

# load packages for CA

library(FactoMineR)

res.ca1<-CA(region_field,graph=FALSE)
plot.CA(res.ca1,cex=0.9,cex.main=0.9,cex.axis=0.9,title="Chinese PhDs: Region and Field of Study")

Country and Discipline (CA)

country_disc <- phd_all_master %>% 
  drop_na(Discipline) %>%
  select(Discipline, Country) %>% 
  group_by(Country, Discipline) %>% 
  tally() %>% 
  spread(key = Discipline, value = n) 

# read first column as row names

country_disc <- column_to_rownames(country_disc, var = "Country") 

# replace NA values with 0

country_disc <- mutate_all(country_disc, ~replace(., is.na(.), 0))

# load packages for CA

library(factoextra)

res.ca2<-CA(country_disc,graph=FALSE)

# Get the contributions of the columns (disciplines)
col_contrib <- get_ca_col(res.ca2)$contrib


# For example, highlight the top 5 contributing disciplines
top_disciplines <- names(sort(col_contrib[,1], decreasing = TRUE))[1:15]

fviz_ca_biplot(res.ca2, 
               repel = TRUE, 
               title = "Chinese PhDs: Country and Discipline",
               cex = 0.5, 
               cex.main = 0.8, 
               cex.axis = 1,
               col.row = "blue",               # Set country labels and dots to blue
               col.col = "red",              # Set discipline labels and dots to red
               pointsize.col = col_contrib[,1]/max(col_contrib[,1]) * 5,  # Scale discipline dot sizes
               label = "all",                  # Show labels for both rows and columns
               select.col = list(name = top_disciplines))  # Limit discipline labels to top contributors)

Region, Period, Field (MCA)

mca_field_period <- phd_all_master %>% distinct(RID, Region, Degree_Period2, Field) %>% drop_na(RID, Region, Degree_Period2, Field) %>% rename(Period = Degree_Period2)
mca_field_period <- column_to_rownames(mca_field_period, "RID")

res.MCA<-MCA(mca_field_period,graph=FALSE)
plot.MCA(res.MCA, choix='var',title="Variables Plot",col.var=c(1,2,3))

# Create the MCA plot
mca_plot <- fviz_mca_var(res.MCA,
                         col.var = c("black", "black", "black", "red", "red", "red", "red", "red", 
                                     "lightgreen", "lightgreen", "lightgreen", "lightgreen", "lightgreen", "lightgreen"),  # Updated colors
                         repel = TRUE,   # Avoid overlapping labels
                         title = "Chinese PhDs: Region, Field, and Graduation Period",  # Title
                         label = "var")  # Show only variable labels

# Customize the legend using ggplot2
mca_plot + 
  scale_color_manual(
    values = c("black" = "black", "lightgreen" = "lightgreen", "red" = "red"),  # Map colors
    labels = c("Region", "Field", "Period")  # Custom legend labels
  ) +
  guides(color = guide_legend(title = "Variables"))

library(explor)

explor(res.MCA)

res <- explor::prepare_results(res.MCA)
explor::MCA_var_plot(res, xax = 1, yax = 2, var_sup = FALSE, var_sup_choice = ,
                     var_lab_min_contrib = 0, col_var = "Variable", symbol_var = NULL, size_var = "Cos2",
                     size_range = c(52.5, 700), labels_size = 10, point_size = 56, transitions = TRUE,
                     labels_positions = NULL, labels_prepend_var = FALSE, xlim = c(-2.7, 2.53),
                     ylim = c(-2.04, 3.19))

Region, Period, Discipline

mca_disc_period <- phd_all_master %>% distinct(RID, Region, Degree_Period2, Discipline) %>% drop_na(RID, Region, Degree_Period2, Discipline) %>% rename(Period = Degree_Period2)
mca_disc_period <- column_to_rownames(mca_disc_period, "RID")

res.MCA2<-MCA(mca_disc_period,graph=FALSE)
plot.MCA(res.MCA2, choix='var',title="Variables Plot",col.var=c(1,2,3))

res <- explor::prepare_results(res.MCA2)

Research Topics

# filter theses in the Humanities and Social Sciences (1620)

phd_shs <- phd_all_master %>% filter(Field == "Humanities and social sciences") %>% drop_na(Thesis)

Term Frequencies

library(tidyverse)
library(tidytext)

# clean titles 

phd_shs <- phd_shs %>% 
  mutate(title = str_replace_all(Thesis, "[:digit:]", "")) %>% 
  mutate(title = str_remove(title, "^The")) %>% 
  mutate(title = str_remove(title, "^A")) %>%
  mutate(title = str_replace_all(title, "- ", " ")) %>% 
  mutate(title = str_replace_all(title, "'s", "")) %>% 
  mutate(title = str_replace_all(title, "\\, ", " ")) %>% 
  mutate(title = str_replace_all(title, "China -", "China ")) %>% 
  mutate(title = str_remove(title, "-$")) %>% # remove comma end of string 
  mutate(title = str_remove(title, "\\.$")) %>% 
  mutate(title = trimws(title, "both")) %>%  
  mutate(title = str_squish(title)) %>% 
  relocate(title, .after= Thesis) %>% mutate(published = str_extract(title, "Published"))%>% 
  relocate(published, .after= title) 

# tokenization (unigram)

library(tidytext) 

data("stop_words")

shs_unigram <- phd_shs %>% 
  unnest_tokens(output = word, input = title) %>% 
  anti_join(stop_words)  # remove stop words 

shs_unigram <- shs_unigram %>% mutate(lgth = nchar(word))

shs_unigram_count <- shs_unigram %>% 
  group_by(word) %>% 
  tally() %>% 
  arrange(desc(n))

shs_unigram_count

# tf_idf by disciplines

shs_tf_idf_discipline <- shs_unigram %>% 
  count(Discipline, word)  %>%
  bind_tf_idf(word, Discipline, n) %>%
  arrange(desc(tf_idf))

shs_tf_idf_discipline %>% 
  group_by(Discipline) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Discipline)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Discipline, scales = "free") +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the dissertation titles", 
       subtitle = "tf-idf by discipline", 
       caption = "Source: Yuan, T’ung-li (1961, 1962)")

# tf-idf by period 

shs_tf_idf_period <- shs_unigram %>% filter(lgth >3) %>% 
  filter(!word == "preface") %>% 
  count(Degree_Period, word)  %>%
  bind_tf_idf(word, Degree_Period, n) %>%
  arrange(desc(tf_idf))

shs_tf_idf_period %>% 
  group_by(Degree_Period) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Degree_Period)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Degree_Period, scales = "free") +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the dissertation titles", 
       subtitle = "tf-idf by period", 
       caption = "Source: Yuan, T’ung-li (1961, 1962)")

# tf-idf by Region 

shs_tf_idf_region <- shs_unigram %>%  
  count(Region, word)  %>%
  bind_tf_idf(word, Region, n) %>%
  arrange(desc(tf_idf))

shs_tf_idf_region %>% 
  group_by(Region) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Region)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Region, scales = "free") +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the dissertation titles", 
       subtitle = "tf-idf by region", 
       caption = "Source: Yuan, T’ung-li (1961, 1962)")

Semantic Networks

library(widyr) 

word_pairs <- shs_unigram %>%
  pairwise_count(word, RID, sort = TRUE) 

word_pairs_filtered <- word_pairs %>% filter(!item1 == "study") %>% filter(!item2 == "study")

set.seed(2024)

library(igraph)
library(tidygraph)
library(ggraph)

word_pairs_filtered %>%
  filter(n > 5) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "orange", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, point.padding  = unit(0.2, "lines")) +
  theme_void()+
  labs(title = "Word co-occurrences in dissertation titles", 
       subtitle = "Most frequent pairs (n>5)", 
       caption = "Source: Yuan, T’ung-li (1961, 1962)")

# Create a two-mode network linking titles and the words they contain 

edge_2m <- shs_unigram %>% select(Thesis, word)
node_title <- edge_2m %>% distinct(Thesis) %>% rename(Name = Thesis) %>% mutate(Type = "Title")
node_word <- edge_2m %>% distinct(word) %>% rename(Name = word) %>% mutate(Type = "Word")
node <- bind_rows(node_title, node_word)


library(igraph)
title_net <- graph_from_data_frame(edge_2m, directed = FALSE, vertices = node)


# Community detection with Louvain

set.seed(2025)
title_net_cluster <- cluster_louvain(title_net)

# Extract community membership 

title_net_cluster_df <- data.frame(title_net_cluster$membership,
                                   title_net_cluster$names) %>% 
  group_by(title_net_cluster.membership) %>% 
  add_tally() %>% # add size of clusters
  rename(member = title_net_cluster.names, cluster = title_net_cluster.membership, size = n)

## add node type and degree centrality

node_type <- node %>% rename(member = Name)
title_net_cluster_df <- left_join(title_net_cluster_df, node_type)

degree <- degree(title_net, mode = "all", normalized = FALSE)
degree_df <- as.data.frame(degree)
degree_df <- rownames_to_column(degree_df, "member")

title_net_cluster_df <- left_join(title_net_cluster_df, degree_df)

# Plot the networks

# Index node size on degree centrality 

deg_cent <- degree(title_net, mode = "all", normalized = TRUE)
V(title_net)$size <- deg_cent * 100 

# Index node color on their type

V(title_net)$color <- ifelse(V(title_net)$Type == "Title", "red", "orange")

plot(title_net, 
     vertex.size = V(title_net)$size,
     vertex.color = V(title_net)$color,
     vertex.label.color = "black",
     vertex.label.cex = V(title_net)$size/10,
     main = "Dissertation Title Network")

# Plot the communities

V(title_net)$group <- title_net_cluster$membership
V(title_net)$color <- title_net_cluster$membership 

plot(title_net_cluster, title_net, 
     vertex.label=NA,
     vertex.label.color = "black", 
     vertex.label.cex = 0.5, 
     vertex.size=1.8,
     main="Semantic Communities", 
     sub = "Louvain method")

Biterm Topic Modeling

library(BTM)

x <- shs_unigram %>% filter(!word == "study") %>% 
  select(RID, word)


set.seed(2025)
model10  <- BTM(x, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:10:40 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:40 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:41 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:41 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:42 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:42 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:43 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:43 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:44 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:44 Start Gibbs sampling iteration 901/1000

model15  <- BTM(x, k = 15, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:10:45 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:45 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:46 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:47 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:47 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:48 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:49 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:49 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:50 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:51 Start Gibbs sampling iteration 901/1000

model20  <- BTM(x, k = 20, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:10:51 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:52 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:53 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:54 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:54 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:55 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:56 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:57 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:57 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:58 Start Gibbs sampling iteration 901/1000

model25  <- BTM(x, k = 25, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:10:59 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:00 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:01 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:02 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:03 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:04 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:05 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:06 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:06 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:07 Start Gibbs sampling iteration 901/1000

# Most frequent terms and scores for each model

# model10$theta
# terms(model10, top_n = 10)
# topicterms10
# scores <- predict(model10, newdata = x)
# head(scores, 10)
# hist(scores)


# model20$theta
# terms(model20)
# model20 <- terms(model20, top_n = 10)
# topicterms10
# scores <- predict(model10, newdata = x)
# head(scores, 10)
# hist(scores)

# Most frequent terms for each model 

library(BTM)

## 10-topic model 
terms(model10, top_n = 10)

## [[1]]
##         token probability
## 1       china  0.02625494
## 2   political  0.02450475
## 3    economic  0.02341089
## 4     chinese  0.01991052
## 5   influence  0.01597261
## 6         sen  0.01509752
## 7         sun  0.01509752
## 8         yat  0.01509752
## 9      social  0.01487875
## 10 philosophy  0.01465997
## 
## [[2]]
##          token probability
## 1      chinese 0.032772745
## 2    political 0.018459858
## 3       france 0.015823274
## 4      century 0.013563344
## 5  comparative 0.013563344
## 6      england 0.012810034
## 7          mid 0.012056724
## 8       school 0.010926759
## 9    religious 0.009796795
## 10       china 0.009420140
## 
## [[3]]
##          token probability
## 1        china  0.03591563
## 2      chinese  0.03536648
## 3    education  0.02723903
## 4      special  0.01944107
## 5       united  0.01867226
## 6    reference  0.01713464
## 7       school  0.01493803
## 8      program  0.01274142
## 9     students  0.01208244
## 10 educational  0.01142346
## 
## [[4]]
##            token probability
## 1        chinese  0.07174477
## 2            law  0.04844577
## 3          china  0.02537746
## 4         family  0.01153647
## 5        century  0.01130578
## 6         income  0.01130578
## 7    comparative  0.01084442
## 8            tax  0.01038305
## 9     individual  0.01015237
## 10 international  0.01015237
## 
## [[5]]
##                token probability
## 1              china  0.04161432
## 2            chinese  0.02801051
## 3            central  0.02187546
## 4         government  0.01867456
## 5               asia  0.01814108
## 6        development  0.01814108
## 7              local  0.01787434
## 8  industrialization  0.01760760
## 9            special  0.01387322
## 10         reference  0.01333974
## 
## [[6]]
##           token probability
## 1       chinese  0.02240356
## 2        theory  0.01728348
## 3     reference  0.01568345
## 4  experimental  0.01504344
## 5      learning  0.01440343
## 6        united  0.01440343
## 7    psychology  0.01344342
## 8       reading  0.01312341
## 9       special  0.01216339
## 10     relation  0.01152338
## 
## [[7]]
##            token probability
## 1          china  0.06767850
## 2  international  0.02494593
## 3        chinese  0.02463017
## 4        foreign  0.02178835
## 5      relations  0.02063057
## 6           sino  0.01515743
## 7         policy  0.01473642
## 8          trade  0.01231561
## 9        british  0.01157884
## 10      japanese  0.01115783
## 
## [[8]]
##          token probability
## 1        china 0.079265500
## 2      chinese 0.041746170
## 3    education 0.026704637
## 4          law 0.022648493
## 5       social 0.022310481
## 6       modern 0.015550241
## 7      ancient 0.011325091
## 8   philosophy 0.010987079
## 9       feudal 0.009128014
## 10 comparative 0.008959008
## 
## [[9]]
##           token probability
## 1         china  0.04866493
## 2       chinese  0.02296422
## 3         labor  0.02022562
## 4         rural  0.01875099
## 5       service  0.01811900
## 6        united  0.01411643
## 7        church  0.01306312
## 8  organization  0.01264180
## 9      economic  0.01137783
## 10          war  0.01116717
## 
## [[10]]
##            token probability
## 1            law  0.03367361
## 2  international  0.02540902
## 3         london  0.02020539
## 4          china  0.01592005
## 5        britain  0.01530786
## 6      political  0.01438957
## 7       american  0.01408347
## 8          basis  0.01285909
## 9    supervision  0.01194080
## 10        united  0.01194080

## 15-topic model 
terms(model15, top_n = 10)

## [[1]]
##          token probability
## 1      chinese  0.04227906
## 2          law  0.04086991
## 3          god  0.01926302
## 4         idea  0.01832359
## 5     doctrine  0.01691444
## 6           li  0.01597501
## 7    evolution  0.01456587
## 8     marriage  0.01456587
## 9        court  0.01268701
## 10 obligations  0.01268701
## 
## [[2]]
##          token probability
## 1        china  0.06098115
## 2      chinese  0.03597539
## 3    education  0.02742079
## 4     economic  0.02610470
## 5  educational  0.01974358
## 6    reference  0.01930488
## 7   philosophy  0.01798879
## 8       social  0.01733075
## 9      special  0.01733075
## 10      theory  0.01447921
## 
## [[3]]
##          token probability
## 1      chinese  0.03296725
## 2     students  0.02052170
## 3     selected  0.01783077
## 4        basis  0.01480348
## 5      schools  0.01480348
## 6       theory  0.01446711
## 7      college  0.01379438
## 8     american  0.01345801
## 9  comparative  0.01143982
## 10     english  0.01143982
## 
## [[4]]
##            token probability
## 1            law  0.06936891
## 2        chinese  0.05258273
## 3          china  0.02962515
## 4  international  0.02567546
## 5        century  0.01555438
## 6         german  0.01259211
## 7          civil  0.01209840
## 8         period  0.01185154
## 9       japanese  0.01111097
## 10         power  0.01111097
## 
## [[5]]
##            token probability
## 1  international  0.03728616
## 2            law  0.02757723
## 3          legal  0.02447038
## 4          china  0.02369366
## 5        special  0.01942174
## 6          world  0.01825667
## 7      reference  0.01476145
## 8         public  0.01398474
## 9        britain  0.01359638
## 10       federal  0.01359638
## 
## [[6]]
##                token probability
## 1              china  0.03496597
## 2        development  0.02887673
## 3            nations  0.02730532
## 4             united  0.02710889
## 5      international  0.02259107
## 6            chinese  0.01905538
## 7            special  0.01826967
## 8          reference  0.01748397
## 9           economic  0.01571612
## 10 industrialization  0.01237686
## 
## [[7]]
##         token probability
## 1   reference  0.02797119
## 2     british  0.02634514
## 3       china  0.02276785
## 4     chinese  0.02276785
## 5     special  0.02244264
## 6     foreign  0.01951576
## 7      united  0.01723930
## 8   relations  0.01658889
## 9      policy  0.01431243
## 10 diplomatic  0.01268639
## 
## [[8]]
##         token probability
## 1    students  0.02740171
## 2       china  0.02583612
## 3     chinese  0.02231355
## 4      social  0.02035656
## 5     english  0.01996517
## 6     factors  0.01722539
## 7    personal  0.01722539
## 8        york  0.01683399
## 9  conference  0.01487700
## 10     school  0.01331142
## 
## [[9]]
##         token probability
## 1       china  0.06842045
## 2   relations  0.04140673
## 3     chinese  0.03635744
## 4        sino  0.02550146
## 5     central  0.02045216
## 6      treaty  0.01893738
## 7      soviet  0.01792752
## 8  government  0.01565534
## 9  historical  0.01388808
## 10 diplomatic  0.01212083
## 
## [[10]]
##            token probability
## 1          china  0.04243260
## 2          labor  0.02794460
## 3         united  0.02656479
## 4        service  0.02552993
## 5            war  0.02208041
## 6        control  0.01828593
## 7  international  0.01759603
## 8        chinese  0.01656117
## 9      education  0.01552631
## 10   development  0.01449145
## 
## [[11]]
##         token probability
## 1       china  0.06001739
## 2   education  0.05882661
## 3     chinese  0.02381786
## 4     program  0.02381786
## 5  curriculum  0.01881661
## 6       rural  0.01881661
## 7      school  0.01548244
## 8      united  0.01548244
## 9   secondary  0.01524429
## 10   proposed  0.01429167
## 
## [[12]]
##          token probability
## 1       income  0.03437954
## 2   individual  0.02380285
## 3       george  0.02274518
## 4          tax  0.02274518
## 5        north  0.01904334
## 6     carolina  0.01798567
## 7  flexibility  0.01745683
## 8        built  0.01692800
## 9        motor  0.01375499
## 10   reference  0.01269732
## 
## [[13]]
##          token probability
## 1    political  0.05130770
## 2        china  0.04070279
## 3      chinese  0.02895140
## 4       social  0.02665845
## 5   philosophy  0.01977959
## 6          sen  0.01977959
## 7          sun  0.01977959
## 8          yat  0.01977959
## 9  development  0.01576692
## 10     science  0.01433383
## 
## [[14]]
##        token probability
## 1      china  0.09781908
## 2    chinese  0.03621726
## 3    foreign  0.02379038
## 4     system  0.01686683
## 5      trade  0.01384887
## 6    banking  0.01331629
## 7    special  0.01296124
## 8    dynasty  0.01118597
## 9     modern  0.01083091
## 10 reference  0.01029833
## 
## [[15]]
##            token probability
## 1        chinese  0.09670194
## 2   experimental  0.02467254
## 3       learning  0.02121907
## 4        reading  0.02121907
## 5       analysis  0.01776561
## 6  psychological  0.01628555
## 7     psychology  0.01529885
## 8     characters  0.01431214
## 9      movements  0.01283209
## 10       culture  0.01233874

## 20-topic model 
terms(model20, top_n = 10)

## [[1]]
##            token probability
## 1          basis  0.02323834
## 2        chinese  0.02076670
## 3           east  0.02027237
## 4    supervision  0.02027237
## 5       selected  0.01928372
## 6       american  0.01878939
## 7  international  0.01780074
## 8         county  0.01483477
## 9       economic  0.01384612
## 10           war  0.01384612
## 
## [[2]]
##          token probability
## 1        china  0.03397222
## 2      nations  0.03188179
## 3       united  0.02848482
## 4      chinese  0.02639439
## 5       social  0.02456525
## 6     analysis  0.02273612
## 7  development  0.02012307
## 8     economic  0.01907786
## 9  educational  0.01594220
## 10        life  0.01150002
## 
## [[3]]
##                token probability
## 1            special  0.04613937
## 2          reference  0.04322781
## 3              china  0.03494104
## 4        development  0.02508651
## 5           economic  0.02486254
## 6            chinese  0.02015924
## 7             united  0.02015924
## 8  industrialization  0.01456007
## 9             theory  0.01433611
## 10            system  0.01388817
## 
## [[4]]
##            token probability
## 1          china  0.08499351
## 2        chinese  0.03278874
## 3        foreign  0.03103797
## 4  international  0.02435321
## 5         policy  0.01973754
## 6            war  0.01941922
## 7          trade  0.01766845
## 8          labor  0.01544020
## 9    development  0.01496272
## 10          sino  0.01225698
## 
## [[5]]
##           token probability
## 1         china  0.06519758
## 2       chinese  0.05261633
## 3     education  0.05071008
## 4        modern  0.02745382
## 5        theory  0.01563507
## 6  constitution  0.01449132
## 7      movement  0.01411007
## 8         adult  0.01334757
## 9        social  0.01296632
## 10  obligations  0.01220382
## 
## [[6]]
##         token probability
## 1       china  0.07027134
## 2     chinese  0.04010004
## 3  government  0.03216023
## 4     central  0.02223546
## 5       local  0.01945652
## 6  provincial  0.01628059
## 7     ancient  0.01469263
## 8         law  0.01469263
## 9     dynasty  0.01310467
## 10     people  0.01310467
## 
## [[7]]
##         token probability
## 1       china  0.04946290
## 2   political  0.04420779
## 3    economic  0.02535124
## 4       ideas  0.02411475
## 5  philosophy  0.02349650
## 6         sen  0.02133263
## 7         sun  0.02133263
## 8         yat  0.02133263
## 9   confucius  0.01885964
## 10     social  0.01855052
## 
## [[8]]
##             token probability
## 1           china  0.09112718
## 2         chinese  0.02510984
## 3       treatment  0.01767126
## 4        critical  0.01488180
## 5    organization  0.01488180
## 6         favored  0.01348706
## 7          nation  0.01348706
## 8           rural  0.01348706
## 9  administrative  0.01255724
## 10        dynasty  0.01209233
## 
## [[9]]
##           token probability
## 1       chinese  0.05594034
## 2      learning  0.02547125
## 3      students  0.02547125
## 4       english  0.02001410
## 5  experimental  0.02001410
## 6       college  0.01910457
## 7     political  0.01910457
## 8    psychology  0.01819505
## 9       science  0.01501171
## 10      factors  0.01410219
## 
## [[10]]
##         token probability
## 1   relations  0.05329461
## 2       china  0.04225257
## 3      soviet  0.03265081
## 4        sino  0.02640966
## 5  diplomatic  0.02448931
## 6      treaty  0.02400922
## 7     chinese  0.02352913
## 8         tax  0.01920833
## 9     central  0.01680789
## 10      north  0.01680789
## 
## [[11]]
##            token probability
## 1            law  0.10140350
## 2        chinese  0.04198894
## 3  international  0.04119674
## 4          legal  0.02957789
## 5         status  0.02139189
## 6          china  0.02112782
## 7    comparative  0.01848717
## 8         united  0.01690278
## 9       american  0.01294181
## 10         civil  0.01294181
## 
## [[12]]
##          token probability
## 1      british  0.03346773
## 2        china  0.02907924
## 3        power  0.02085082
## 4       france  0.01591377
## 5      balance  0.01481664
## 6      chinese  0.01426808
## 7  comparative  0.01426808
## 8       market  0.01371952
## 9       treaty  0.01371952
## 10     england  0.01317096
## 
## [[13]]
##             token probability
## 1           china  0.04550410
## 2           motor  0.02747757
## 3            land  0.02661917
## 4  transportation  0.02490236
## 5         federal  0.02404395
## 6  administration  0.02146873
## 7       existence  0.02061033
## 8      efficiency  0.01717670
## 9        revision  0.01631830
## 10         system  0.01545989
## 
## [[14]]
##           token probability
## 1       chinese  0.05031443
## 2          idea  0.02124710
## 3        social  0.01565723
## 4  confucianism  0.01453925
## 5           god  0.01453925
## 6            ti  0.01453925
## 7        method  0.01398027
## 8      classics  0.01286229
## 9   development  0.01286229
## 10    francisco  0.01286229
## 
## [[15]]
##          token probability
## 1    education  0.05806760
## 2        china  0.05758970
## 3       school  0.02557020
## 4      chinese  0.02461439
## 5   curriculum  0.02055222
## 6      program  0.01983536
## 7     teachers  0.01601214
## 8    secondary  0.01529528
## 9  comparative  0.01457843
## 10    children  0.01410053
## 
## [[16]]
##        token probability
## 1    chinese  0.04990971
## 2      rural  0.02364490
## 3    service  0.02101842
## 4     school  0.01970518
## 5  political  0.01904856
## 6     united  0.01707870
## 7    program  0.01642208
## 8   theories  0.01576546
## 9     church  0.01510883
## 10     china  0.01379559
## 
## [[17]]
##             token probability
## 1         century  0.01933974
## 2          united  0.01933974
## 3        paradise  0.01832240
## 4         germany  0.01730505
## 5        literary  0.01730505
## 6         aspects  0.01628771
## 7  czechoslovakia  0.01628771
## 8          poland  0.01628771
## 9       political  0.01628771
## 10         rights  0.01628771
## 
## [[18]]
##            token probability
## 1        chinese  0.04747806
## 2       analysis  0.02374215
## 3     philosophy  0.01874512
## 4      reference  0.01624660
## 5             wu  0.01499735
## 6     missionary  0.01437272
## 7           john  0.01249883
## 8           june  0.01249883
## 9        special  0.01249883
## 10 psychological  0.01187420
## 
## [[19]]
##          token probability
## 1      english  0.04045552
## 2      chinese  0.03609944
## 3     language  0.01991972
## 4          mid  0.01991972
## 5      century  0.01805283
## 6       france  0.01805283
## 7        china  0.01743054
## 8      reading  0.01618594
## 9  seventeenth  0.01431905
## 10 achievement  0.01369675
## 
## [[20]]
##       token probability
## 1    feudal  0.03939604
## 2       law  0.03574893
## 3   chinese  0.02918414
## 4     china  0.02699588
## 5   studies  0.02188993
## 6  kingship  0.02116051
## 7   inquiry  0.01897225
## 8    nature  0.01824282
## 9    period  0.01751340
## 10       po  0.01386630

## 25-topic model 
terms(model25, top_n = 10)

## [[1]]
##       token probability
## 1   chinese  0.07815845
## 2     china  0.03059063
## 3      life  0.03059063
## 4        lu  0.02153010
## 5    native  0.01926496
## 6   tibetan  0.01926496
## 7   century  0.01813240
## 8  movement  0.01699983
## 9      pali  0.01699983
## 10 sanskrit  0.01699983
## 
## [[2]]
##         token probability
## 1    american  0.03200245
## 2    analysis  0.03132169
## 3     stories  0.02247183
## 4       china  0.01974880
## 5  literature  0.01906804
## 6     chinese  0.01770653
## 7       basis  0.01566425
## 8    critical  0.01498349
## 9        fair  0.01362197
## 10     return  0.01362197
## 
## [[3]]
##           token probability
## 1       special  0.04467763
## 2     reference  0.04421229
## 3       chinese  0.04095488
## 4       english  0.02187580
## 5      relation  0.01908374
## 6         china  0.01768771
## 7      american  0.01582633
## 8  experimental  0.01489565
## 9       dialect  0.01443030
## 10      reading  0.01443030
## 
## [[4]]
##              token probability
## 1           george  0.03700598
## 2           social  0.02714101
## 3           novels  0.02467476
## 4  characteristics  0.01974228
## 5         relation  0.01974228
## 6          attempt  0.01727603
## 7       historical  0.01727603
## 8            major  0.01604291
## 9          charles  0.01480979
## 10          mental  0.01480979
## 
## [[5]]
##          token probability
## 1        china  0.09179844
## 2    education  0.04541429
## 3      chinese  0.03016916
## 4       social  0.02497932
## 5  educational  0.01978949
## 6       people  0.01751893
## 7      factors  0.01719457
## 8      history  0.01687021
## 9       school  0.01622148
## 10       rural  0.01524838
## 
## [[6]]
##          token probability
## 1       income  0.04448151
## 2        china  0.03924916
## 3   individual  0.03336277
## 4          tax  0.03205468
## 5       theory  0.02616829
## 6        north  0.02486020
## 7     carolina  0.02224402
## 8        labor  0.02224402
## 9  flexibility  0.02158998
## 10       built  0.02093594
## 
## [[7]]
##           token probability
## 1    philosophy  0.05234942
## 2     confucius  0.03252310
## 3          john  0.02617867
## 4  confucianism  0.02538562
## 5         dewey  0.02459257
## 6      doctrine  0.02379952
## 7           neo  0.02379952
## 8         moral  0.02142036
## 9           god  0.01983425
## 10    political  0.01983425
## 
## [[8]]
##          token probability
## 1        china  0.08201941
## 2      special  0.03971972
## 3      chinese  0.03455335
## 4    reference  0.03164727
## 5     economic  0.02486640
## 6    education  0.02389771
## 7  development  0.02292901
## 8       system  0.01905423
## 9       malaya  0.01711684
## 10   relations  0.01679394
## 
## [[9]]
##          token probability
## 1      chinese  0.03803895
## 2      english  0.03289994
## 3     learning  0.02776093
## 4  comparative  0.02673313
## 5   manchurian  0.02159412
## 6      ethical  0.01748291
## 7         life  0.01748291
## 8        moral  0.01748291
## 9         text  0.01748291
## 10  obligation  0.01645511
## 
## [[10]]
##        token probability
## 1     london  0.03474916
## 2    chinese  0.03226767
## 3    studies  0.03061334
## 4    science  0.02978618
## 5    variety  0.02895901
## 6      china  0.02647752
## 7  political  0.02647752
## 8   american  0.01903305
## 9        law  0.01903305
## 10    school  0.01737872
## 
## [[11]]
##         token probability
## 1     chinese  0.05778911
## 2    students  0.04165127
## 3     college  0.03071918
## 4     english  0.01770478
## 5     reading  0.01770478
## 6  university  0.01718421
## 7    language  0.01458133
## 8      theory  0.01458133
## 9    children  0.01354018
## 10     method  0.01354018
## 
## [[12]]
##         token probability
## 1       china  0.06858973
## 2     central  0.04589627
## 3   relations  0.04438337
## 4  government  0.03127159
## 5   political  0.02421140
## 6       local  0.02168991
## 7      school  0.02017701
## 8  provincial  0.01866411
## 9  diplomatic  0.01815981
## 10     modern  0.01765551
## 
## [[13]]
##            token probability
## 1        nations  0.06253402
## 2         united  0.05488652
## 3          china  0.03689242
## 4    development  0.02384669
## 5         league  0.02384669
## 6  international  0.02159743
## 7       economic  0.02114757
## 8         county  0.01664905
## 9    supervision  0.01664905
## 10         basis  0.01484964
## 
## [[14]]
##        token probability
## 1    chinese  0.05209155
## 2      china  0.04674004
## 3  political  0.02925846
## 4   economic  0.02854493
## 5        law  0.02854493
## 6      ideas  0.02747462
## 7        sen  0.02462049
## 8        sun  0.02462049
## 9        yat  0.02462049
## 10    social  0.02247989
## 
## [[15]]
##            token probability
## 1          china  0.02935055
## 2       economic  0.02889907
## 3         united  0.02844759
## 4          labor  0.02709316
## 5        british  0.02573873
## 6  international  0.02528725
## 7        chinese  0.02167543
## 8          rural  0.02032100
## 9     government  0.01806361
## 10        church  0.01670918
## 
## [[16]]
##      token probability
## 1    china  0.10527618
## 2  chinese  0.02857362
## 3  foreign  0.02665606
## 4      war  0.02013634
## 5   policy  0.01802702
## 6    legal  0.01764351
## 7    trade  0.01725999
## 8   status  0.01649297
## 9      law  0.01476716
## 10 banking  0.01400013
## 
## [[17]]
##                token probability
## 1              motor  0.03415248
## 2  industrialization  0.03325396
## 3     transportation  0.03325396
## 4          developed  0.02337032
## 5            special  0.02247181
## 6          countries  0.02157330
## 7        development  0.02157330
## 8              china  0.02067478
## 9          reference  0.01977627
## 10            theory  0.01887776
## 
## [[18]]
##         token probability
## 1   education  0.05035715
## 2       china  0.04102180
## 3     chinese  0.03621268
## 4     program  0.02574577
## 5      school  0.02574577
## 6  curriculum  0.01923931
## 7    children  0.01839064
## 8   secondary  0.01810775
## 9     schools  0.01725908
## 10  reference  0.01527886
## 
## [[19]]
##           token probability
## 1       service  0.03662012
## 2        united  0.02656951
## 3       chinese  0.02441581
## 4      marriage  0.02154420
## 5        period  0.01580100
## 6       control  0.01508310
## 7  organization  0.01508310
## 8           car  0.01436520
## 9     functions  0.01436520
## 10   techniques  0.01436520
## 
## [[20]]
##            token probability
## 1           sino  0.04298268
## 2  international  0.04062762
## 3          china  0.03856695
## 4      relations  0.03238493
## 5         treaty  0.03150179
## 6            law  0.02973550
## 7         soviet  0.02090405
## 8       japanese  0.02060966
## 9     diplomatic  0.02002090
## 10       chinese  0.01972652
## 
## [[21]]
##           token probability
## 1       chinese  0.07496441
## 2         china  0.04300532
## 3       factors  0.03196490
## 4      learning  0.02905953
## 5        social  0.02731631
## 6        modern  0.02499201
## 7  constitution  0.02266771
## 8   development  0.01976234
## 9      movement  0.01976234
## 10     analysis  0.01860019
## 
## [[22]]
##             token probability
## 1         aspects  0.02320353
## 2         charles  0.01963513
## 3          church  0.01963513
## 4           china  0.01606673
## 5       christian  0.01606673
## 6         account  0.01517463
## 7          effect  0.01428253
## 8        injuries  0.01428253
## 9  responsibility  0.01428253
## 10           role  0.01339043
## 
## [[23]]
##            token probability
## 1            law  0.10452022
## 2        chinese  0.06466581
## 3  international  0.02684479
## 4    comparative  0.01911792
## 5          civil  0.01708453
## 6     succession  0.01667785
## 7         family  0.01545782
## 8         german  0.01545782
## 9           code  0.01423778
## 10       century  0.01301775
## 
## [[24]]
##          token probability
## 1      chinese  0.06572251
## 2   philosophy  0.02547696
## 3      ancient  0.02038259
## 4          law  0.01936371
## 5          tzu  0.01783540
## 6       system  0.01579765
## 7       nature  0.01528821
## 8  obligations  0.01477878
## 9   conference  0.01426934
## 10          li  0.01426934
## 
## [[25]]
##             token probability
## 1          united  0.03836408
## 2         england  0.03134066
## 3          france  0.02539777
## 4       reference  0.02485751
## 5     comparative  0.02053540
## 6  administration  0.01999514
## 7     application  0.01999514
## 8          theory  0.01999514
## 9         control  0.01837435
## 10            mid  0.01729382

# Plot the topics

library(BTM)
library(textplot)
library(ggraph)
library(concaveman)

plot(model10, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'", 
     subtitle = "Biterm topic model with 10 topics")

plot(model15, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'", 
     subtitle = "Biterm topic model with 15 topics")

plot(model20, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'", 
     subtitle = "Biterm topic model with 20 topics")

plot(model25, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'", 
     subtitle = "Biterm topic model with 25 topics")

By Region

# Build Models 

usa <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>% filter(Region == "USA") %>% 
  select(RID, word)
uk <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>%  filter(Region == "UK") %>% 
  select(RID, word)
eu <- shs_unigram %>% filter(!word == "study") %>%  
filter(lgth >3 ) %>% filter(Region == "Europe") %>% 
  select(RID, word)

set.seed(321)
model10us  <- BTM(usa, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:12 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:14 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:14 Start Gibbs sampling iteration 901/1000

model15us  <- BTM(usa, k = 15, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:14 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 901/1000

model20us  <- BTM(usa, k = 20, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:18 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:18 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:20 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:20 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:21 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:21 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:22 Start Gibbs sampling iteration 901/1000

model25us  <- BTM(usa, k = 25, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:22 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:23 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:23 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:24 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:24 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:25 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:25 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:26 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:26 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 901/1000

model5uk  <- BTM(uk, k = 5, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:27 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 901/1000

model10uk  <- BTM(uk, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:27 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 901/1000

model15uk  <- BTM(uk, k = 15, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:28 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 901/1000

model10eu  <- BTM(eu, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:28 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 901/1000

model15eu  <- BTM(eu, k = 15, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:29 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 901/1000

model20eu  <- BTM(eu, k = 20, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:31 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 901/1000

# Most frequent terms for each model 

## U.S. theses models with 10, 20, and 25 topics
terms(model10us)

## [[1]]
##      token probability
## 1  chinese  0.03615592
## 2 american  0.03389665
## 3   method  0.02335337
## 4  stories  0.02109409
## 5    basis  0.01808173
## 
## [[2]]
##           token probability
## 1         china  0.05946573
## 2       chinese  0.02801636
## 3       foreign  0.02405216
## 4 international  0.01929511
## 5      american  0.01850227
## 
## [[3]]
##        token probability
## 1    chinese  0.05854293
## 2   students  0.04042357
## 3    college  0.02021353
## 4    reading  0.01916818
## 5 university  0.01603214
## 
## [[4]]
##       token probability
## 1   chinese  0.03929508
## 2 reference  0.03764072
## 3   special  0.03267766
## 4    united  0.02812818
## 5     china  0.02564665
## 
## [[5]]
##         token probability
## 1       china  0.03343227
## 2      feudal  0.02407311
## 3 development  0.01805650
## 4      visual  0.01671948
## 5    kingship  0.01605097
## 
## [[6]]
##         token probability
## 1       china  0.06230025
## 2   education  0.05694646
## 3     chinese  0.02433702
## 4     program  0.01865876
## 5 educational  0.01736087
## 
## [[7]]
##           token probability
## 1        united  0.04545990
## 2       nations  0.04055451
## 3         china  0.02518428
## 4       chinese  0.02354915
## 5 international  0.02060591
## 
## [[8]]
##         token probability
## 1       china  0.02826449
## 2    economic  0.02562682
## 3      united  0.02487320
## 4 development  0.01997468
## 5     century  0.01771382
## 
## [[9]]
##        token probability
## 1    chinese  0.05317800
## 2    factors  0.02524894
## 3   learning  0.02148926
## 4 conference  0.01880378
## 5   analysis  0.01826668
## 
## [[10]]
##        token probability
## 1      china  0.03966740
## 2    chinese  0.02786303
## 3 philosophy  0.02408563
## 4     income  0.02314128
## 5  influence  0.01841954

terms(model20us)

## [[1]]
##        token probability
## 1      china  0.04960921
## 2    chinese  0.04357648
## 3  relations  0.02882978
## 4    foreign  0.02815948
## 5 diplomatic  0.02614857
## 
## [[2]]
##        token probability
## 1   analysis  0.03482582
## 2    chinese  0.02638575
## 3       farm  0.02216572
## 4 attainment  0.02111071
## 5       goal  0.02111071
## 
## [[3]]
##         token probability
## 1     chinese  0.07694527
## 2    american  0.03527237
## 3     stories  0.03527237
## 4 comparative  0.02565555
## 5     foreign  0.02031287
## 
## [[4]]
##      token probability
## 1 selected  0.02619861
## 2   united  0.02560332
## 3    china  0.02500804
## 4   county  0.02500804
## 5    hsien  0.02500804
## 
## [[5]]
##            token probability
## 1         united  0.04412680
## 2        special  0.04242983
## 3      reference  0.03960155
## 4 administration  0.03111672
## 5        control  0.02828844
## 
## [[6]]
##           token probability
## 1    conference  0.04726590
## 2       reading  0.04070302
## 3 international  0.03939044
## 4       chinese  0.03151498
## 5         ports  0.03151498
## 
## [[7]]
##         token probability
## 1      income  0.05945520
## 2  individual  0.05299408
## 3       built  0.03619518
## 4    carolina  0.03619518
## 5 flexibility  0.03619518
## 
## [[8]]
##         token probability
## 1      theory  0.03411998
## 2       china  0.02946831
## 3 development  0.02869304
## 4    economic  0.02326609
## 5   countries  0.02171554
## 
## [[9]]
##       token probability
## 1  students  0.03831224
## 2    school  0.03422614
## 3   chinese  0.03065081
## 4 community  0.02707548
## 5   factors  0.02656472
## 
## [[10]]
##           token probability
## 1       nations  0.05618346
## 2 international  0.04339048
## 3       british  0.04060939
## 4         china  0.03615966
## 5        united  0.03449101
## 
## [[11]]
##       token probability
## 1     china  0.06079296
## 2   chinese  0.05755110
## 3      land  0.02675344
## 4   special  0.02189065
## 5 formation  0.01945926
## 
## [[12]]
##       token probability
## 1 political  0.05849726
## 2   studies  0.03120309
## 3     china  0.02730392
## 4      asia  0.02535434
## 5   central  0.02437954
## 
## [[13]]
##       token probability
## 1 education  0.06472831
## 2     china  0.05649166
## 3     adult  0.04354835
## 4    social  0.02825171
## 5    theory  0.02707505
## 
## [[14]]
##         token probability
## 1     chinese  0.06193396
## 2    economic  0.05883764
## 3 development  0.03484124
## 4       trade  0.02555230
## 5 application  0.02090784
## 
## [[15]]
##        token probability
## 1      china  0.02764983
## 2    chinese  0.02501777
## 3   paradise  0.02370173
## 4     school  0.02238570
## 5 activities  0.02106967
## 
## [[16]]
##        token probability
## 1      china  0.08358536
## 2  education  0.05556414
## 3    chinese  0.03817166
## 4 curriculum  0.02705980
## 5    program  0.02367793
## 
## [[17]]
##          token probability
## 1        china  0.08623988
## 2    education  0.06232563
## 3       united  0.04348409
## 4    relations  0.02101918
## 5 organization  0.01884516
## 
## [[18]]
##         token probability
## 1    learning  0.04082702
## 2     chinese  0.03674500
## 3   reference  0.02381859
## 4 development  0.01973657
## 5  psychology  0.01905624
## 
## [[19]]
##        token probability
## 1 philosophy  0.04990286
## 2  education  0.03610141
## 3      china  0.03132398
## 4  political  0.02760821
## 5  confucius  0.02548491
## 
## [[20]]
##      token probability
## 1  chinese  0.05086197
## 2 american  0.03560592
## 3  english  0.02967301
## 4 children  0.02628278
## 5 language  0.02458766

terms(model25us)

## [[1]]
##       token probability
## 1 reference  0.05539808
## 2   special  0.04791253
## 3     china  0.04042698
## 4    united  0.03294142
## 5  economic  0.02345972
## 
## [[2]]
##     token probability
## 1   china  0.07144218
## 2   rural  0.03810567
## 3 chinese  0.02994163
## 4   hsien  0.02790062
## 5  united  0.02790062
## 
## [[3]]
##            token probability
## 1          china  0.04183630
## 2        england  0.03347090
## 3 administration  0.02789396
## 4        systems  0.02510550
## 5   organization  0.02324652
## 
## [[4]]
##       token probability
## 1   chinese  0.07711511
## 2 education  0.03588933
## 3  learning  0.03512589
## 4   reading  0.03283557
## 5     adult  0.03130869
## 
## [[5]]
##       token probability
## 1   chinese  0.06755160
## 2     china  0.03800701
## 3 formation  0.03378635
## 4   capital  0.03167602
## 5     prior  0.02956569
## 
## [[6]]
##       token probability
## 1   chinese  0.05573972
## 2 education  0.04002229
## 3    health  0.03716458
## 4   effects  0.03144915
## 5    modern  0.02573372
## 
## [[7]]
##       token probability
## 1 reference  0.05299017
## 2   special  0.05020170
## 3   english  0.04834272
## 4   chinese  0.04090681
## 5  american  0.02510550
## 
## [[8]]
##           token probability
## 1       nations  0.05453459
## 2        united  0.04293272
## 3       british  0.03365122
## 4   development  0.02959057
## 5 international  0.02611001
## 
## [[9]]
##       token probability
## 1     china  0.07719285
## 2   chinese  0.03524325
## 3 political  0.03076863
## 4    social  0.02853132
## 5      land  0.02461602
## 
## [[10]]
##         token probability
## 1       china  0.05406438
## 2 development  0.04760086
## 3     chinese  0.03761179
## 4    american  0.02174679
## 5    cultural  0.02057161
## 
## [[11]]
##        token probability
## 1 individual  0.05849426
## 2     income  0.03462392
## 3      north  0.03462392
## 4      built  0.03343041
## 5   carolina  0.03343041
## 
## [[12]]
##         token probability
## 1     chinese  0.05869632
## 2     college  0.04663708
## 3    students  0.04663708
## 4  university  0.02573441
## 5 comparative  0.02091071
## 
## [[13]]
##       token probability
## 1     china  0.04847091
## 2     local  0.03197369
## 3    united  0.02991153
## 4   central  0.02784938
## 5 relations  0.02578723
## 
## [[14]]
##        token probability
## 1      china  0.09039554
## 2  relations  0.06583420
## 3    foreign  0.04618513
## 4 diplomatic  0.04029041
## 5       asia  0.03341324
## 
## [[15]]
##      token probability
## 1    china  0.06063153
## 2   feudal  0.04961012
## 3    world  0.03445568
## 4 kingship  0.03170033
## 5 analysis  0.02205659
## 
## [[16]]
##       token probability
## 1    school  0.04613211
## 2   chinese  0.03881071
## 3  personal  0.03002504
## 4 community  0.02929290
## 5  children  0.02709648
## 
## [[17]]
##           token probability
## 1       effects  0.04388325
## 2      economic  0.03657181
## 3 international  0.03510953
## 4       control  0.03072266
## 5    attainment  0.02926037
## 
## [[18]]
##      token probability
## 1  studies  0.05649850
## 2   united  0.03307800
## 3  variety  0.03307800
## 4 american  0.02894498
## 5  aspects  0.02343427
## 
## [[19]]
##         token probability
## 1     chinese  0.04424594
## 2      county  0.03810238
## 3 supervision  0.03564495
## 4 application  0.03441624
## 5       basis  0.03318753
## 
## [[20]]
##         token probability
## 1    analysis  0.05674320
## 2    selected  0.02837906
## 3    business  0.02390052
## 4 cooperative  0.02390052
## 5      stores  0.02390052
## 
## [[21]]
##      token probability
## 1 railroad  0.05428631
## 2 american  0.03361073
## 3   united  0.02714961
## 4     fair  0.02585739
## 5   return  0.02585739
## 
## [[22]]
##               token probability
## 1            theory  0.04421236
## 2       development  0.04294951
## 3         influence  0.04042381
## 4         countries  0.03537241
## 5 industrialization  0.03537241
## 
## [[23]]
##           token probability
## 1    conference  0.07260775
## 2 international  0.03765494
## 3         ports  0.03496626
## 4     waterways  0.03227758
## 5         china  0.02690022
## 
## [[24]]
##       token probability
## 1     china  0.05466366
## 2 education  0.03426982
## 3    united  0.02937530
## 4      plan  0.02529653
## 5   control  0.02284927
## 
## [[25]]
##        token probability
## 1  education  0.10542486
## 2      china  0.06980996
## 3    chinese  0.04036831
## 4    program  0.03324532
## 5 curriculum  0.02849667

## UK theses models with 5 and 10 topics
terms(model5uk)

## [[1]]
##       token probability
## 1    treaty  0.04974356
## 2 reference  0.04041859
## 3   special  0.04041859
## 4     china  0.03731026
## 5   british  0.03109361
## 
## [[2]]
##       token probability
## 1   special  0.05121918
## 2      asia  0.04313620
## 3 borrowing  0.04313620
## 4 reference  0.04313620
## 5   capital  0.03774754
## 
## [[3]]
##        token probability
## 1     george  0.06235715
## 2      china  0.04388780
## 3 historical  0.03465312
## 4    charles  0.02541845
## 5    dickens  0.02541845
## 
## [[4]]
##       token probability
## 1 political  0.06648579
## 2    london  0.04613715
## 3   science  0.04206742
## 4   english  0.03935427
## 5   chinese  0.03121481
## 
## [[5]]
##           token probability
## 1 international  0.06078770
## 2        soviet  0.04648752
## 3        london  0.04291247
## 4     relations  0.03814574
## 5          sino  0.03814574

terms(model10uk)

## [[1]]
##        token probability
## 1      trade  0.06866824
## 2    british  0.04722282
## 3 government  0.04722282
## 4      local  0.04722282
## 5      china  0.04293373
## 
## [[2]]
##        token probability
## 1    chinese  0.07424330
## 2      china  0.05800765
## 3  relations  0.04873014
## 4 diplomatic  0.03713325
## 5     treaty  0.03713325
## 
## [[3]]
##           token probability
## 1 international  0.06768164
## 2        soviet  0.06599002
## 3        london  0.06260678
## 4     relations  0.05584031
## 5          sino  0.05414869
## 
## [[4]]
##       token probability
## 1 reference  0.10995782
## 2   special  0.10740125
## 3      asia  0.04093059
## 4  economic  0.03837403
## 5   capital  0.03581746
## 
## [[5]]
##        token probability
## 1      power  0.04035719
## 2      china  0.03459600
## 3 principles  0.03171540
## 4  structure  0.03171540
## 5 biological  0.02883480
## 
## [[6]]
##        token probability
## 1 government  0.12420928
## 2  borrowing  0.09464972
## 3    british  0.04735442
## 4   employed  0.04735442
## 5    methods  0.04735442
## 
## [[7]]
##         token probability
## 1      treaty  0.08361484
## 2     special  0.07316733
## 3   reference  0.06620233
## 4 parliaments  0.06271983
## 5       peace  0.06271983
## 
## [[8]]
##         token probability
## 1   political  0.08349025
## 2      london  0.06601961
## 3     science  0.06019606
## 4 comparative  0.04466660
## 5     english  0.04272542
## 
## [[9]]
##        token probability
## 1     george  0.10135851
## 2 historical  0.06912273
## 3    charles  0.05070228
## 4    dickens  0.05070228
## 5      eliot  0.05070228
## 
## [[10]]
##         token probability
## 1       china  0.08936645
## 2     dialect  0.07820262
## 3       tones  0.07820262
## 4 descriptive  0.05587497
## 5     chengtu  0.04471114

## Europe theses models with 10 and 20 topics
terms(model10eu)

## [[1]]
##     token probability
## 1  france  0.02956595
## 2   power  0.02442628
## 3  sacred  0.02314137
## 4   china  0.01671678
## 5 blessed  0.01543186
## 
## [[2]]
##           token probability
## 1         china  0.06866391
## 2         labor  0.04291763
## 3       service  0.03576588
## 4  organization  0.03290518
## 5 international  0.03075966
## 
## [[3]]
##     token probability
## 1   china  0.09764387
## 2 chinese  0.05069201
## 3  system  0.03895404
## 4   trade  0.03842050
## 5 foreign  0.03735341
## 
## [[4]]
##      token probability
## 1  chinese  0.06692331
## 2    china  0.02205060
## 3 relation  0.01732716
## 4   orphan  0.01417820
## 5 phonetic  0.01417820
## 
## [[5]]
##          token probability
## 1        china  0.08832970
## 2      chinese  0.06004295
## 3       modern  0.03060164
## 4   government  0.02482884
## 5 relationship  0.02078787
## 
## [[6]]
##           token probability
## 1       chinese  0.05436131
## 2         china  0.05129913
## 3          east  0.02603616
## 4     manchuria  0.02220844
## 5 international  0.02067735
## 
## [[7]]
##       token probability
## 1   chinese  0.04516429
## 2  doctrine  0.02608347
## 3 political  0.02544745
## 4    school  0.02163128
## 5     china  0.02099526
## 
## [[8]]
##        token probability
## 1    chinese  0.05043972
## 2   marriage  0.03009042
## 3      civil  0.02832092
## 4 succession  0.02743617
## 5       code  0.02389716
## 
## [[9]]
##        token probability
## 1      china  0.10834750
## 2    chinese  0.02891428
## 3 historical  0.02307360
## 4  relations  0.02102936
## 5    foreign  0.01986123
## 
## [[10]]
##        token probability
## 1    chinese  0.05225227
## 2 obligation  0.03681761
## 3   critical  0.03206848
## 4      moral  0.02731936
## 5    history  0.02375751

terms(model20eu)

## [[1]]
##          token probability
## 1        china  0.14846765
## 2       regime  0.04455184
## 3 organization  0.03795401
## 4    education  0.03630456
## 5       public  0.03630456
## 
## [[2]]
##       token probability
## 1   nations  0.06205839
## 2     china  0.04310137
## 3    united  0.03793127
## 4 manchuria  0.03620791
## 5    league  0.03103781
## 
## [[3]]
##         token probability
## 1       china  0.09289948
## 2      france  0.04866865
## 3 comparative  0.03392504
## 4       labor  0.02507888
## 5    teaching  0.02507888
## 
## [[4]]
##        token probability
## 1 obligation  0.06306916
## 2      moral  0.04905898
## 3       role  0.03971886
## 4      basis  0.02570868
## 5    charles  0.02570868
## 
## [[5]]
##       token probability
## 1     china  0.13346553
## 2   chinese  0.05203354
## 3 relations  0.03770754
## 4    powers  0.03242954
## 5 evolution  0.02790554
## 
## [[6]]
##     token probability
## 1   power  0.03703780
## 2 chinese  0.03086826
## 3 dynasty  0.02881175
## 4 liturgy  0.02881175
## 5    qing  0.02675523
## 
## [[7]]
##          token probability
## 1        china  0.09596712
## 2   government  0.06159597
## 3 relationship  0.03295334
## 4       modern  0.03008908
## 5    relations  0.02865695
## 
## [[8]]
##     token probability
## 1   china  0.07744694
## 2 chinese  0.05281033
## 3 ancient  0.04401154
## 4   power  0.03697251
## 5 balance  0.03169324
## 
## [[9]]
##          token probability
## 1      chinese  0.07946436
## 2       system  0.06212984
## 3        china  0.04768440
## 4       social  0.04768440
## 5 contribution  0.02746078
## 
## [[10]]
##       token probability
## 1     labor  0.06872255
## 2   service  0.05794466
## 3    german  0.03773610
## 4 political  0.03773610
## 5      life  0.02965268
## 
## [[11]]
##        token probability
## 1      china  0.08141824
## 2   critical  0.07195320
## 3      legal  0.04355810
## 4 historical  0.03030705
## 5 commercial  0.02841404
## 
## [[12]]
##        token probability
## 1    chinese  0.05989970
## 2      legal  0.05830259
## 3      china  0.05031703
## 4 historical  0.03594302
## 5     status  0.03274879
## 
## [[13]]
##     token probability
## 1   china  0.10056870
## 2 chinese  0.05606672
## 3 foreign  0.05433287
## 4   trade  0.05144314
## 5  policy  0.03179291
## 
## [[14]]
##        token probability
## 1    chinese  0.06381977
## 2  character  0.03343664
## 3 succession  0.03039832
## 4    worship  0.03039832
## 5     family  0.02584085
## 
## [[15]]
##      token probability
## 1  chinese  0.06758844
## 2    china  0.06115247
## 3   modern  0.03648124
## 4 doctrine  0.03219059
## 5    court  0.02897260
## 
## [[16]]
##       token probability
## 1   chinese  0.04382192
## 2 classical  0.04183092
## 3   studies  0.04183092
## 4    orphan  0.03585792
## 5   history  0.03187592
## 
## [[17]]
##         token probability
## 1     chinese  0.09881440
## 2       civil  0.04360271
## 3       china  0.03924389
## 4        code  0.03197919
## 5 development  0.02907332
## 
## [[18]]
##       token probability
## 1     china  0.04728346
## 2   chinese  0.04203071
## 3 political  0.02714793
## 4     ideas  0.02627248
## 5  economic  0.02452156
## 
## [[19]]
##            token probability
## 1        chinese  0.04191482
## 2          china  0.03443271
## 3         social  0.02994344
## 4       phonetic  0.02695059
## 5 czechoslovakia  0.02395774
## 
## [[20]]
##       token probability
## 1   chinese  0.04796592
## 2   society  0.03505698
## 3 christian  0.03321285
## 4    german  0.03136872
## 5      pope  0.02768045

# Plot the topics

plot(model10us, top_n = 8, 
     title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'", 
     subtitle = "Biterm topic model with 10 topics")

plot(model15us, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'", 
     subtitle = "Biterm topic model with 15 topics")

plot(model20us, top_n = 8, 
     title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'", 
     subtitle = "Biterm topic model with 20 topics")

plot(model25us, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'", 
     subtitle = "Biterm topic model with 25 topics")

plot(model5uk, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'", 
     subtitle = "Biterm topic model with 5 topics")

plot(model10uk, top_n = 8, 
     title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'", 
     subtitle = "Biterm topic model with 10 topics")

plot(model15uk, top_n = 8, 
     title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'", 
     subtitle = "Biterm topic model with 15 topics")

plot(model10eu, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'", 
     subtitle = "Biterm topic model with 10 topics")

plot(model15eu, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'", 
     subtitle = "Biterm topic model with 15 topics")

plot(model20eu, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'", 
     subtitle = "Biterm topic model with 20 topics")

By Period

p1 <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>% filter(Degree_Period == "1905-1926") %>% 
  select(RID, word)
p2 <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>% filter(Degree_Period == "1927-1937") %>% 
  select(RID, word)
p3 <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>% filter(Degree_Period == "1938-1951") %>% 
  select(RID, word)
p4 <- shs_unigram %>% filter(!word == "study") %>% 
  filter(lgth >3 ) %>% filter(Degree_Period == "1952-1962") %>% 
  select(RID, word)


model10p1  <- BTM(p1, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:41 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 901/1000

model10p2  <- BTM(p2, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:41 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 901/1000

model10p3  <- BTM(p3, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:43 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 901/1000

model10p4  <- BTM(p4, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2025-05-19 16:11:44 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 901/1000

# Most frequent terms for each period

# 1905-1926
terms(model10p1, top_n = 10)

## [[1]]
##           token probability
## 1     education  0.07934297
## 2     reference  0.06579991
## 3       special  0.06386519
## 4         china  0.05612630
## 5      national  0.04258324
## 6       chinese  0.04064852
## 7          life  0.03871380
## 8          view  0.03484435
## 9      criteria  0.03290963
## 10 construetion  0.02710546
## 
## [[2]]
##             token probability
## 1       political  0.06721447
## 2     comparative  0.04033636
## 3       education  0.04033636
## 4      philosophy  0.03649663
## 5         chinese  0.03457677
## 6          social  0.03457677
## 7       confucius  0.02881717
## 8  interpretation  0.02881717
## 9          united  0.02881717
## 10        ancient  0.02497744
## 
## [[3]]
##         token probability
## 1       china  0.12115507
## 2   relations  0.04585264
## 3  commercial  0.03111955
## 4       essay  0.03111955
## 5      france  0.03111955
## 6      modern  0.02784553
## 7   political  0.02620852
## 8  historical  0.02293450
## 9  industrial  0.02293450
## 10  religious  0.02293450
## 
## [[4]]
##         token probability
## 1     british  0.07220492
## 2  government  0.04333120
## 3       labor  0.04126879
## 4     history  0.03920639
## 5   reference  0.03508157
## 6     special  0.03508157
## 7     chinese  0.03301916
## 8    american  0.03095675
## 9       china  0.02476953
## 10     french  0.02476953
## 
## [[5]]
##             token probability
## 1         chinese  0.10274479
## 2     immigration  0.03786473
## 3          united  0.03065583
## 4           china  0.02885361
## 5       evolution  0.02885361
## 6           rural  0.02705138
## 7          empire  0.02164471
## 8          school  0.02164471
## 9    constitution  0.01984249
## 10 administration  0.01623804
## 
## [[6]]
##          token probability
## 1      germany  0.03993813
## 2  comparative  0.03245441
## 3         east  0.03245441
## 4        italy  0.03245441
## 5     taxation  0.02995984
## 6         life  0.02746526
## 7       advent  0.02497069
## 8        close  0.02497069
## 9     decrease  0.02497069
## 10     eastern  0.02497069
## 
## [[7]]
##          token probability
## 1        china  0.08354048
## 2      central  0.05768887
## 3   government  0.05371169
## 4        local  0.04774594
## 5   provincial  0.04575735
## 6  governments  0.03979160
## 7   regulation  0.03780301
## 8       modern  0.03581443
## 9       public  0.02984867
## 10       essay  0.02587150
## 
## [[8]]
##             token probability
## 1           china  0.15661796
## 2       political  0.04433643
## 3         history  0.03103993
## 4  constitutional  0.02660777
## 5       relations  0.02660777
## 6       financial  0.02217560
## 7         foreign  0.02217560
## 8          reform  0.02217560
## 9        economic  0.02069821
## 10         powers  0.02069821
## 
## [[9]]
##          token probability
## 1        basis  0.05953716
## 2       county  0.05953716
## 3     selected  0.05953716
## 4  supervision  0.05953716
## 5      reading  0.04580310
## 6      chinese  0.03435805
## 7       agents  0.02978003
## 8  application  0.02978003
## 9   elementary  0.02978003
## 10 instruetion  0.02978003
## 
## [[10]]
##           token probability
## 1       chinese  0.11407323
## 2      learning  0.08318483
## 3    psychology  0.06180056
## 4       factors  0.04754437
## 5    characters  0.04516834
## 6  experimental  0.04279231
## 7      analysis  0.04041628
## 8         means  0.03566422
## 9   preliminary  0.03091216
## 10      process  0.03091216

# 1927-1937
terms(model10p2, top_n = 10)

## [[1]]
##       token probability
## 1     china  0.04899251
## 2   chinese  0.04387466
## 3     rural  0.04095017
## 4   english  0.02194098
## 5   dynasty  0.02120986
## 6    church  0.02047874
## 7    school  0.01974762
## 8    system  0.01974762
## 9    united  0.01974762
## 10 american  0.01682313
## 
## [[2]]
##           token probability
## 1         china  0.06838861
## 2     education  0.03989533
## 3       chinese  0.03182224
## 4     reference  0.02897291
## 5  organization  0.02707336
## 6       special  0.02564870
## 7        social  0.01757560
## 8       control  0.01710071
## 9         legal  0.01567605
## 10       united  0.01567605
## 
## [[3]]
##          token probability
## 1      century  0.03001434
## 2    reference  0.02834734
## 3    political  0.02668034
## 4     economic  0.02584684
## 5        china  0.02334634
## 6      chinese  0.02001234
## 7      england  0.01834534
## 8  sovereignty  0.01751184
## 9       france  0.01584484
## 10     special  0.01584484
## 
## [[4]]
##            token probability
## 1     conference  0.03242113
## 2        nations  0.02470365
## 3          basis  0.02393190
## 4  international  0.02084491
## 5         league  0.02084491
## 6        british  0.01930141
## 7          ports  0.01852967
## 8      waterways  0.01852967
## 9           york  0.01775792
## 10         moral  0.01698617
## 
## [[5]]
##         token probability
## 1     chinese  0.07743780
## 2       china  0.03743160
## 3      school  0.02710742
## 4  succession  0.01936429
## 5     century  0.01871903
## 6     stories  0.01807377
## 7       civil  0.01742851
## 8  literature  0.01742851
## 9    children  0.01678324
## 10       code  0.01549272
## 
## [[6]]
##            token probability
## 1          china  0.08837499
## 2        chinese  0.05128084
## 3        foreign  0.03018809
## 4       economic  0.02727874
## 5          trade  0.02291473
## 6  international  0.01927805
## 7      evolution  0.01491403
## 8         policy  0.01491403
## 9    development  0.01455036
## 10    historical  0.01455036
## 
## [[7]]
##          token probability
## 1  comparative  0.04007703
## 2       george  0.04007703
## 3   historical  0.03232268
## 4      charles  0.02844551
## 5        china  0.02586073
## 6    education  0.02456834
## 7       france  0.02456834
## 8      chinese  0.02198356
## 9     teaching  0.01810639
## 10     england  0.01552161
## 
## [[8]]
##       token probability
## 1   service  0.04323503
## 2    united  0.03459050
## 3  american  0.03335556
## 4      role  0.02100622
## 5   chinese  0.01977129
## 6   effects  0.01853636
## 7     motor  0.01853636
## 8   freight  0.01730142
## 9  doctrine  0.01606649
## 10    pupil  0.01483156
## 
## [[9]]
##            token probability
## 1          china  0.09384180
## 2      relations  0.02681610
## 3     diplomatic  0.02215345
## 4  international  0.02098778
## 5        century  0.01807362
## 6           sino  0.01749079
## 7       treaties  0.01749079
## 8        foreign  0.01690796
## 9           asia  0.01632513
## 10      japanese  0.01574229
## 
## [[10]]
##           token probability
## 1       chinese  0.04028166
## 2         china  0.03289792
## 3         based  0.02148668
## 4      relation  0.01947294
## 5       special  0.01880169
## 6   educational  0.01745919
## 7        credit  0.01477419
## 8  experimental  0.01477419
## 9  associations  0.01410294
## 10    education  0.01276044

# 1938-1951
terms(model10p2, top_n = 10)

## [[1]]
##       token probability
## 1     china  0.04899251
## 2   chinese  0.04387466
## 3     rural  0.04095017
## 4   english  0.02194098
## 5   dynasty  0.02120986
## 6    church  0.02047874
## 7    school  0.01974762
## 8    system  0.01974762
## 9    united  0.01974762
## 10 american  0.01682313
## 
## [[2]]
##           token probability
## 1         china  0.06838861
## 2     education  0.03989533
## 3       chinese  0.03182224
## 4     reference  0.02897291
## 5  organization  0.02707336
## 6       special  0.02564870
## 7        social  0.01757560
## 8       control  0.01710071
## 9         legal  0.01567605
## 10       united  0.01567605
## 
## [[3]]
##          token probability
## 1      century  0.03001434
## 2    reference  0.02834734
## 3    political  0.02668034
## 4     economic  0.02584684
## 5        china  0.02334634
## 6      chinese  0.02001234
## 7      england  0.01834534
## 8  sovereignty  0.01751184
## 9       france  0.01584484
## 10     special  0.01584484
## 
## [[4]]
##            token probability
## 1     conference  0.03242113
## 2        nations  0.02470365
## 3          basis  0.02393190
## 4  international  0.02084491
## 5         league  0.02084491
## 6        british  0.01930141
## 7          ports  0.01852967
## 8      waterways  0.01852967
## 9           york  0.01775792
## 10         moral  0.01698617
## 
## [[5]]
##         token probability
## 1     chinese  0.07743780
## 2       china  0.03743160
## 3      school  0.02710742
## 4  succession  0.01936429
## 5     century  0.01871903
## 6     stories  0.01807377
## 7       civil  0.01742851
## 8  literature  0.01742851
## 9    children  0.01678324
## 10       code  0.01549272
## 
## [[6]]
##            token probability
## 1          china  0.08837499
## 2        chinese  0.05128084
## 3        foreign  0.03018809
## 4       economic  0.02727874
## 5          trade  0.02291473
## 6  international  0.01927805
## 7      evolution  0.01491403
## 8         policy  0.01491403
## 9    development  0.01455036
## 10    historical  0.01455036
## 
## [[7]]
##          token probability
## 1  comparative  0.04007703
## 2       george  0.04007703
## 3   historical  0.03232268
## 4      charles  0.02844551
## 5        china  0.02586073
## 6    education  0.02456834
## 7       france  0.02456834
## 8      chinese  0.02198356
## 9     teaching  0.01810639
## 10     england  0.01552161
## 
## [[8]]
##       token probability
## 1   service  0.04323503
## 2    united  0.03459050
## 3  american  0.03335556
## 4      role  0.02100622
## 5   chinese  0.01977129
## 6   effects  0.01853636
## 7     motor  0.01853636
## 8   freight  0.01730142
## 9  doctrine  0.01606649
## 10    pupil  0.01483156
## 
## [[9]]
##            token probability
## 1          china  0.09384180
## 2      relations  0.02681610
## 3     diplomatic  0.02215345
## 4  international  0.02098778
## 5        century  0.01807362
## 6           sino  0.01749079
## 7       treaties  0.01749079
## 8        foreign  0.01690796
## 9           asia  0.01632513
## 10      japanese  0.01574229
## 
## [[10]]
##           token probability
## 1       chinese  0.04028166
## 2         china  0.03289792
## 3         based  0.02148668
## 4      relation  0.01947294
## 5       special  0.01880169
## 6   educational  0.01745919
## 7        credit  0.01477419
## 8  experimental  0.01477419
## 9  associations  0.01410294
## 10    education  0.01276044

# 1952-1962
terms(model10p2, top_n = 10)

## [[1]]
##       token probability
## 1     china  0.04899251
## 2   chinese  0.04387466
## 3     rural  0.04095017
## 4   english  0.02194098
## 5   dynasty  0.02120986
## 6    church  0.02047874
## 7    school  0.01974762
## 8    system  0.01974762
## 9    united  0.01974762
## 10 american  0.01682313
## 
## [[2]]
##           token probability
## 1         china  0.06838861
## 2     education  0.03989533
## 3       chinese  0.03182224
## 4     reference  0.02897291
## 5  organization  0.02707336
## 6       special  0.02564870
## 7        social  0.01757560
## 8       control  0.01710071
## 9         legal  0.01567605
## 10       united  0.01567605
## 
## [[3]]
##          token probability
## 1      century  0.03001434
## 2    reference  0.02834734
## 3    political  0.02668034
## 4     economic  0.02584684
## 5        china  0.02334634
## 6      chinese  0.02001234
## 7      england  0.01834534
## 8  sovereignty  0.01751184
## 9       france  0.01584484
## 10     special  0.01584484
## 
## [[4]]
##            token probability
## 1     conference  0.03242113
## 2        nations  0.02470365
## 3          basis  0.02393190
## 4  international  0.02084491
## 5         league  0.02084491
## 6        british  0.01930141
## 7          ports  0.01852967
## 8      waterways  0.01852967
## 9           york  0.01775792
## 10         moral  0.01698617
## 
## [[5]]
##         token probability
## 1     chinese  0.07743780
## 2       china  0.03743160
## 3      school  0.02710742
## 4  succession  0.01936429
## 5     century  0.01871903
## 6     stories  0.01807377
## 7       civil  0.01742851
## 8  literature  0.01742851
## 9    children  0.01678324
## 10       code  0.01549272
## 
## [[6]]
##            token probability
## 1          china  0.08837499
## 2        chinese  0.05128084
## 3        foreign  0.03018809
## 4       economic  0.02727874
## 5          trade  0.02291473
## 6  international  0.01927805
## 7      evolution  0.01491403
## 8         policy  0.01491403
## 9    development  0.01455036
## 10    historical  0.01455036
## 
## [[7]]
##          token probability
## 1  comparative  0.04007703
## 2       george  0.04007703
## 3   historical  0.03232268
## 4      charles  0.02844551
## 5        china  0.02586073
## 6    education  0.02456834
## 7       france  0.02456834
## 8      chinese  0.02198356
## 9     teaching  0.01810639
## 10     england  0.01552161
## 
## [[8]]
##       token probability
## 1   service  0.04323503
## 2    united  0.03459050
## 3  american  0.03335556
## 4      role  0.02100622
## 5   chinese  0.01977129
## 6   effects  0.01853636
## 7     motor  0.01853636
## 8   freight  0.01730142
## 9  doctrine  0.01606649
## 10    pupil  0.01483156
## 
## [[9]]
##            token probability
## 1          china  0.09384180
## 2      relations  0.02681610
## 3     diplomatic  0.02215345
## 4  international  0.02098778
## 5        century  0.01807362
## 6           sino  0.01749079
## 7       treaties  0.01749079
## 8        foreign  0.01690796
## 9           asia  0.01632513
## 10      japanese  0.01574229
## 
## [[10]]
##           token probability
## 1       chinese  0.04028166
## 2         china  0.03289792
## 3         based  0.02148668
## 4      relation  0.01947294
## 5       special  0.01880169
## 6   educational  0.01745919
## 7        credit  0.01477419
## 8  experimental  0.01477419
## 9  associations  0.01410294
## 10    education  0.01276044

# Plot the topics

plot(model10p1, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1905-1926)", 
     subtitle = "Biterm topic model with 10 topics")

plot(model10p2, top_n = 8, 
     title = "Dissertation titles of Chinese PhDs abroad (1927-1937)", 
     subtitle = "Biterm topic model with 10 topics")

plot(model10p3, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1938-1951)", 
     subtitle = "Biterm topic model with 10 topics")

plot(model10p4, top_n = 10, 
     title = "Dissertation titles of Chinese PhDs abroad (1952-1962)", 
     subtitle = "Biterm topic model with 10 topics")

Named Entity Recognition

Extract named entities (persons, locations/GPE, organizations, events) from titles to identify China-focused theses:

library(histtext)

phd_all_master_ner <- phd_all_master %>% drop_na(Thesis)
phd_all_master_ner <- ner_on_df(phd_all_master_ner, text_column = "Thesis", id_column = "RID") 

phd_all_master_ner %>% group_by(Type) %>% count(sort = TRUE)

A total of 2885 named entities were found, distributed as follows:

phd_all_master_ner %>% group_by(Type) %>% count(sort = TRUE)

The named entities were curated and manually coded using a combination of R (tidyverse) and Excel to classify the entities and identify those related to China and Chinese issues. The resulting dataset was used for statistical analyses of individual entity types and served as a basis for analyzing China-centered theses.

Locations/GPE

# upload clean list of locations/GPE

library(readr)
phd_loc_clean_meta <- read_delim("Data/phd_loc_clean_meta.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

phd_loc_clean_meta %>% group_by(Place) %>% count(sort = TRUE)

Events

# upload clean list of events

library(readr)
phd_event_clean_meta <- read_delim("Data/phd_event_clean_meta.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

phd_event_clean_meta %>% group_by(EventType) %>% count(sort = TRUE)

phd_event_clean_meta %>% group_by(Event2) %>% count(sort = TRUE)

Organizations

# upload clean list of organizations

library(readr)
phd_org_clean_meta <- read_delim("Data/phd_org_clean_meta.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

phd_org_clean_meta %>% group_by(Text_clean2) %>% count(sort = TRUE)

phd_org_clean_meta %>% group_by(type2) %>% count(sort = TRUE)

phd_org_clean_meta %>% group_by(country) %>% count(sort = TRUE)

Persons

# upload clean list of persons

library(readr)
phd_pers_clean_meta <- read_delim("Data/phd_pers_clean_meta.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

phd_pers_clean_meta %>% group_by(Text_clean2) %>% count(sort = TRUE)

China-Centered Theses

# upload compiled list of china-centered theses

library(readr)
china_theses_all <- read_delim("Data/china_theses_all.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

names(china_theses_all)

##  [1] "RID"         "Region"      "Country"     "Discipline"  "Degree_Year"
##  [6] "Thesis"      "Title_Src"   "totalDisc"   "Field"       "totalField"

The dataset of China-centered theses includes the following information: a unique dissertation identifier (RID); region and country of graduation (Region, Country); discipline and field of study (Discipline, Field); year of graduation (Degree_Year); dissertation title in both its original language and translated form (Title_Src, Thesis); and the total number of dissertations within each discipline and field (totalDisc, totalField), which is used for statistical computations.

Disciplines

china_theses_all %>% group_by(Field) %>% count(sort = TRUE)

china_theses_all %>% group_by(Discipline) %>% count(sort = TRUE)

# Proportion 

china_theses_fields <- china_theses_all %>% group_by(Field, totalField) %>% count()
china_theses_fields <- china_theses_fields %>% mutate(percent = round(n/totalField*100, 2))
china_theses_fields %>% arrange(desc(n))

china_theses_disciplines <- china_theses_all %>% group_by(Discipline, totalDisc) %>% count()
china_theses_disciplines <- china_theses_disciplines %>% mutate(percent = round(n/totalDisc*100, 2))
china_theses_disciplines %>% arrange(desc(n))

china_theses_disciplines %>% arrange(desc(percent))

849 dissertations in the Humanities and Social Sciences focused on China, out of 1622, i.e. 52.3%. The disciplines that were most likely to focus on China issues in absolute numbers are economics, education, and international law and relations, but in proportion, history, sociology, and anthropology are the most concerned with China.

Time Distribution

library(hrbrthemes)
library(viridis)

# All Disciplines 

china_theses_all %>% 
  ggplot( aes(x=Degree_Year)) +
  geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Bin size = 1") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  ) +
  labs(title = "China-focused theses",
       x = "Year",
       y = "Frequency", 
       caption = "Source: Yuan, T’ung-li")

# Humanities and Social Sciences 

china_theses_all %>% 
  filter(Field == "Humanities and social sciences") %>% 
  ggplot( aes(x=Degree_Year)) +
  geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Bin size = 1") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  ) +
  labs(title = "China-focused theses (Humanities & Social Sciences)",
       x = "Year",
       y = "Frequency", 
       caption = "Source: Yuan, T’ung-li")

# Comparative Trend

shs_all <- phd_all_master %>% filter(Field == "Humanities and social sciences") %>% group_by(Degree_Year) %>% count() %>% rename(TotalTheses = n)
shs_china  <- china_theses_all %>% filter(Field == "Humanities and social sciences") %>% group_by(Degree_Year) %>% count() %>% rename(ChinaTheses = n)
theses_year <- left_join(shs_all, shs_china)

theses_year <- theses_year %>%
  mutate(ChinaTheses = ifelse(is.na(ChinaTheses), 0, ChinaTheses))

theses_year <- theses_year %>% drop_na(Degree_Year)
theses_year <- theses_year %>% mutate(ratio = round(ChinaTheses/TotalTheses, 2))

ggplot(theses_year, aes(x = Degree_Year)) +
  geom_line(aes(y = ChinaTheses, color = "ChinaTheses"), size = 1) +
  geom_line(aes(y = TotalTheses, color = "TotalTheses"), size = 1) +
  labs(title = "Comparative Trend (Humanities & Social Sciences Only)",
       x = "Year",
       y = "Frequency") +
  scale_color_manual(name = "Legend",
                     values = c("ChinaTheses" = "red", "TotalTheses" = "steelblue"),
                     labels = c("ChinaTheses" = "China-centered", "TotalTheses" = "All Theses (HSS)")) +
  theme_minimal()

Conclusion

This first document has focused on the doctoral dissertations themselves, remaining within the confines of a single data source: the original dissertation catalog compiled by Yuan Tongli. In the next documents, we will shift our attention to the authors of these dissertations—who were the Chinese PhDs, what were their backgrounds, and what trajectories did they follow after graduation — drawing on external datasets, including the Chinese University Students’ Datasets (CSUD-OS) compiled by the Lee-Campbell Research Group, as well as biographical entries from web-based knowledge platforms such as Wikipedia and Baidu.

References

Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in America, 1905–1960 Washington, D.C.: Published under the auspices of the Sino-American Cultural Society, 1961.
Yuan, T’ung-li Doctoral Dissertations by Chinese Students in Great Britain and Northern Ireland, 1916–1961 Taipei: Chinese Cultural Research Institute, 1963.
Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in Continental Europe, 1907–1962 Taipei: Chinese Culture Quarterly Review, 1964.