Abstract
This document is part of a series of three scripts developed to support a comprehensive study of early Chinese PhDs from 1905 to 1962. The study is based on a dataset derived from three catalogs of Chinese doctoral dissertations compiled by the librarian and bibliographer Yuan Tongli 袁同禮 (1895–1965), which cover the United States (1905–1960), the United Kingdom (1916–1961), and Continental Europe (1905–1962).The first script examines the dissertations themselves, while the second and third analyze the authors’ backgrounds and their post-graduation trajectories.
# load packages
library(readr)
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(FactoMineR)
library(Factoshiny)
library(networkD3)
library(BTM)
library(histtext)
This document is part of a series of three scripts developed to support a comprehensive study of early Chinese PhDs from 1905 to 1962. It is based on the three catalogs of Chinese doctoral dissertations compiled by the librarian and bibliographer Yuan Tongli 袁同禮 (1895–1965), covering the United States (1905–1960), the United Kingdom (1916–1961), and Continental Europe (1905–1962). These scripts provide the complete data and code underlying our analysis and narrative—material that could not be included in the final published work—to ensure transparency and enable rigorous traceability of our methods for interested readers.
The documentation series follows the general structure of the paper and includes:
As illustrated by the figure and table below, we lost a significant number of individuals when incorporating external sources, resulting in a reduced sample size compared to our initial population. Nonetheless, the integration of qualitative biographical information enables us to draw more nuanced and in-depth insights into the life experiences of a substantial subset of PhDs. This approach ultimately supports robust empirical conclusions grounded in rich, contextual data.
Matched in CUSD-OS |
(%) |
Found in Baidu/Wikipedia |
(%) |
|
---|---|---|---|---|
USA |
2,315 |
83.4% |
681 |
24.5% |
Europe |
1,113 |
71.3% |
268 |
17.2% |
UK |
288 |
83.2% |
130 |
37.6% |
Total |
3,716 |
79.4% |
1,079 |
23.0% |
The dataset phd_all_master is a compilation of the three lists of doctors from the U.S., U.K. and Europe:
library(readr)
phd_all_master <- read_delim("Data/phd_all_master.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
phd_all_master
names(phd_all_master)
## [1] "ID" "Srce_ID" "NameWG" "NameZH"
## [5] "NamePY" "Sex" "Birth_year_Src" "Birth_year_corr"
## [9] "Death_year" "Age_Grad" "Country" "University"
## [13] "School" "Discipline_Srce" "Degree" "Degree_Year"
## [17] "Degree_Period2" "Degree_Period" "Thesis" "Title_Src"
## [21] "Discipline" "Field" "Region" "City"
## [25] "State" "RID" "Lat" "Long"
The dataset contains 4,717 unique dissertations (rows) and 28 attributes on these dissertations and their authors, including
Geographical distribution of PhDs by world region and by country:
phd_all_master %>% group_by(Region) %>% count(sort = TRUE)
phd_all_master %>% group_by(Country) %>% count(sort = TRUE)
When did they graduate?
phd_all_master %>% drop_na(Degree_Year) %>%
ggplot( aes(x=Degree_Year)) +
geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin size = 1") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
) +
labs(title = "Chinese PhDs Abroad (1905-1962)",
subtitle = "Year of Graduation",
x = "Year",
y = "Frequency",
caption = "Source: Yuan T'ung-li's Guides to Doctoral Dissertations (1961, 1962, 1964)")
phd_all_master %>%
group_by(Region, Degree_Year) %>%
count() %>%
ggplot( aes(x=Degree_Year, y=n, group=Region, fill=Region)) +
geom_area() +
scale_fill_viridis(discrete = TRUE) +
theme(legend.position="none") +
labs(title = "Chinese PhDs Abroad (1905-1964)",
subtitle = "Year of Graduation",
x = "Year ",
y = "Frequency",
caption = "Source: Yuan T'ung-li's Guides to Doctoral Dissertations (1961, 1962, 1964)") +
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0, "lines"),
strip.text.x = element_text(size = 8),
plot.title = element_text(size=13)
) +
facet_wrap(~Region, ncol = 1)
phd_all_master %>% group_by(Degree_Period) %>% count(sort = TRUE) # taking the Nationalist takeover as a cut-off
phd_all_master %>% group_by(Degree_Period2) %>% count(sort = TRUE) # taking end of WWI as a cut-off
From which university did they graduate?
Distribution by institution, as well as distribution of institutions by region, country, and period of graduation:
phd_all_master %>% group_by(University) %>% count(sort = TRUE)
phd_all_master %>% group_by(Region, University) %>% count(sort = TRUE) # by region
phd_all_master %>% group_by(Country, University) %>% count(sort = TRUE) # by country
phd_all_master %>% group_by(Degree_Period, University) %>% count() %>% arrange(Degree_Period, desc(n)) # by period
library(leaflet)
map_all <- phd_all_master %>% group_by(City, Lat, Long) %>%
count() %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers( radius = ~log(n)*3,
label = ~City,
color = "white",
weight = 1,
opacity = 0.6,
fill = TRUE,
fillColor = "red",
fillOpacity = 0.9,
stroke = TRUE,
popup = ~paste( "City:", City ,
"",
"Chinese PhDs, 1905-1962", n))
# add legend
legend_html <- "
<div style='background-color: white; padding: 10px;'>
<h4>Chinese PhDs, 1905-1962</h4>
<div style='display: flex; align-items: center;'>
<div style='background-color: red; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
<span>10</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: red; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
<span>20</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: red; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
<span>50</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: red; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
<span>100</span>
</div>
</div>
"
# Add the custom legend to the map
map_all %>%
addControl(html = legend_html, position = "bottomright")
map1 <- phd_all_master %>%
filter(Degree_Period == "1905-1926") %>%
group_by(City, Lat, Long) %>%
count() %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers( radius = ~log(n)*3,
label = ~City,
color = "white",
weight = 1,
opacity = 0.6,
fill = TRUE,
fillColor = "orange",
fillOpacity = 0.9,
stroke = TRUE,
popup = ~paste( "City:", City ,
"<br>",
"Chinese PhDs, 1905-1926:", n))
legend1 <- "
<div style='background-color: white; padding: 10px;'>
<h4>Chinese PhDs, 1905-1926</h4>
<div style='display: flex; align-items: center;'>
<div style='background-color: orange; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
<span>10</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: orange; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
<span>20</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: orange; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
<span>50</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: orange; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
<span>100</span>
</div>
</div>
"
# Add the custom legend to the map
map1 %>%
addControl(html = legend1, position = "bottomright")
map2 <- phd_all_master %>%
filter(Degree_Period == "1927-1937") %>%
group_by(City, Lat, Long) %>%
count() %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers( radius = ~log(n)*3,
label = ~City,
color = "white",
weight = 1,
opacity = 0.6,
fill = TRUE,
fillColor = "blue",
fillOpacity = 0.9,
stroke = TRUE,
popup = ~paste( "City:", City ,
"",
"Chinese PhDs, 1927-1937", n))
legend2 <- "
<div style='background-color: white; padding: 10px;'>
<h4>Chinese PhDs, 1927-1937</h4>
<div style='display: flex; align-items: center;'>
<div style='background-color: blue; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
<span>10</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: blue; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
<span>20</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: blue; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
<span>50</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: blue; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
<span>100</span>
</div>
</div>
"
# Add the custom legend to the map
map2 %>%
addControl(html = legend2, position = "bottomright")
map3 <- phd_all_master %>%
filter(Degree_Period == "1938-1951") %>%
group_by(City, Lat, Long) %>%
count() %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers( radius = ~log(n)*3,
label = ~City,
color = "white",
weight = 1,
opacity = 0.6,
fill = TRUE,
fillColor = "darkgreen",
fillOpacity = 0.9,
stroke = TRUE,
popup = ~paste( "City:", City ,
"",
"Chinese PhDs, 1938-1951", n))
legend3 <- "
<div style='background-color: white; padding: 10px;'>
<h4>Chinese PhDs, 1938-1951</h4>
<div style='display: flex; align-items: center;'>
<div style='background-color: darkgreen; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
<span>10</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: darkgreen; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
<span>20</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: darkgreen; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
<span>50</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: darkgreen; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
<span>100</span>
</div>
</div>
"
# Add the custom legend to the map
map3 %>%
addControl(html = legend3, position = "bottomright")
map4 <- phd_all_master %>%
filter(Degree_Period == "1952-1962") %>%
group_by(City, Lat, Long) %>%
count() %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers( radius = ~log(n)*3,
label = ~City,
color = "white",
weight = 1,
opacity = 0.6,
fill = TRUE,
fillColor = "purple",
fillOpacity = 0.9,
stroke = TRUE,
popup = ~paste( "City:", City ,
"",
"Chinese PhDs, 1952-1962", n))
legend4 <- "
<div style='background-color: white; padding: 10px;'>
<h4>Chinese PhDs, 1952-1962</h4>
<div style='display: flex; align-items: center;'>
<div style='background-color: purple; width: 20px; height: 20px; border-radius: 50%; margin-right: 5px;'></div>
<span>10</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: purple; width: 25px; height: 25px; border-radius: 50%; margin-right: 5px;'></div>
<span>20</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: purple; width: 30px; height: 30px; border-radius: 50%; margin-right: 5px;'></div>
<span>50</span>
</div>
<div style='display: flex; align-items: center;'>
<div style='background-color: purple; width: 35px; height: 35px; border-radius: 50%; margin-right: 5px;'></div>
<span>100</span>
</div>
</div>
"
# Add the custom legend to the map
map4 %>%
addControl(html = legend4, position = "bottomright")
# all regions included
phd_all_master %>% group_by(Field) %>% count(sort = TRUE)
phd_all_master %>% group_by(Discipline) %>% count(sort = TRUE)
# by regions
phd_all_master %>% group_by(Region) %>% count(Field) %>% arrange(Region, desc(n))
phd_all_master %>% group_by(Region) %>% count(Discipline) %>% arrange(Region, desc(n))
# by periods
phd_all_master %>% group_by(Degree_Period) %>% count(Field) %>% arrange(Degree_Period, desc(n))
phd_all_master %>% group_by(Degree_Period) %>% count(Discipline) %>% arrange(Degree_Period, desc(n))
region_field <- phd_all_master %>%
drop_na(Field) %>%
select(Field, Region) %>%
group_by(Region, Field) %>%
tally() %>%
spread(key = Field, value = n)
# read first column as row names
region_field <- column_to_rownames(region_field, var = "Region")
# replace NA values with 0
region_field <- mutate_all(region_field, ~replace(., is.na(.), 0))
# load packages for CA
library(FactoMineR)
res.ca1<-CA(region_field,graph=FALSE)
plot.CA(res.ca1,cex=0.9,cex.main=0.9,cex.axis=0.9,title="Chinese PhDs: Region and Field of Study")
country_disc <- phd_all_master %>%
drop_na(Discipline) %>%
select(Discipline, Country) %>%
group_by(Country, Discipline) %>%
tally() %>%
spread(key = Discipline, value = n)
# read first column as row names
country_disc <- column_to_rownames(country_disc, var = "Country")
# replace NA values with 0
country_disc <- mutate_all(country_disc, ~replace(., is.na(.), 0))
# load packages for CA
library(factoextra)
res.ca2<-CA(country_disc,graph=FALSE)
# Get the contributions of the columns (disciplines)
col_contrib <- get_ca_col(res.ca2)$contrib
# For example, highlight the top 5 contributing disciplines
top_disciplines <- names(sort(col_contrib[,1], decreasing = TRUE))[1:15]
fviz_ca_biplot(res.ca2,
repel = TRUE,
title = "Chinese PhDs: Country and Discipline",
cex = 0.5,
cex.main = 0.8,
cex.axis = 1,
col.row = "blue", # Set country labels and dots to blue
col.col = "red", # Set discipline labels and dots to red
pointsize.col = col_contrib[,1]/max(col_contrib[,1]) * 5, # Scale discipline dot sizes
label = "all", # Show labels for both rows and columns
select.col = list(name = top_disciplines)) # Limit discipline labels to top contributors)
mca_field_period <- phd_all_master %>% distinct(RID, Region, Degree_Period2, Field) %>% drop_na(RID, Region, Degree_Period2, Field) %>% rename(Period = Degree_Period2)
mca_field_period <- column_to_rownames(mca_field_period, "RID")
res.MCA<-MCA(mca_field_period,graph=FALSE)
plot.MCA(res.MCA, choix='var',title="Variables Plot",col.var=c(1,2,3))
# Create the MCA plot
mca_plot <- fviz_mca_var(res.MCA,
col.var = c("black", "black", "black", "red", "red", "red", "red", "red",
"lightgreen", "lightgreen", "lightgreen", "lightgreen", "lightgreen", "lightgreen"), # Updated colors
repel = TRUE, # Avoid overlapping labels
title = "Chinese PhDs: Region, Field, and Graduation Period", # Title
label = "var") # Show only variable labels
# Customize the legend using ggplot2
mca_plot +
scale_color_manual(
values = c("black" = "black", "lightgreen" = "lightgreen", "red" = "red"), # Map colors
labels = c("Region", "Field", "Period") # Custom legend labels
) +
guides(color = guide_legend(title = "Variables"))
library(explor)
explor(res.MCA)
res <- explor::prepare_results(res.MCA)
explor::MCA_var_plot(res, xax = 1, yax = 2, var_sup = FALSE, var_sup_choice = ,
var_lab_min_contrib = 0, col_var = "Variable", symbol_var = NULL, size_var = "Cos2",
size_range = c(52.5, 700), labels_size = 10, point_size = 56, transitions = TRUE,
labels_positions = NULL, labels_prepend_var = FALSE, xlim = c(-2.7, 2.53),
ylim = c(-2.04, 3.19))
mca_disc_period <- phd_all_master %>% distinct(RID, Region, Degree_Period2, Discipline) %>% drop_na(RID, Region, Degree_Period2, Discipline) %>% rename(Period = Degree_Period2)
mca_disc_period <- column_to_rownames(mca_disc_period, "RID")
res.MCA2<-MCA(mca_disc_period,graph=FALSE)
plot.MCA(res.MCA2, choix='var',title="Variables Plot",col.var=c(1,2,3))
res <- explor::prepare_results(res.MCA2)
# filter theses in the Humanities and Social Sciences (1620)
phd_shs <- phd_all_master %>% filter(Field == "Humanities and social sciences") %>% drop_na(Thesis)
library(tidyverse)
library(tidytext)
# clean titles
phd_shs <- phd_shs %>%
mutate(title = str_replace_all(Thesis, "[:digit:]", "")) %>%
mutate(title = str_remove(title, "^The")) %>%
mutate(title = str_remove(title, "^A")) %>%
mutate(title = str_replace_all(title, "- ", " ")) %>%
mutate(title = str_replace_all(title, "'s", "")) %>%
mutate(title = str_replace_all(title, "\\, ", " ")) %>%
mutate(title = str_replace_all(title, "China -", "China ")) %>%
mutate(title = str_remove(title, "-$")) %>% # remove comma end of string
mutate(title = str_remove(title, "\\.$")) %>%
mutate(title = trimws(title, "both")) %>%
mutate(title = str_squish(title)) %>%
relocate(title, .after= Thesis) %>% mutate(published = str_extract(title, "Published"))%>%
relocate(published, .after= title)
# tokenization (unigram)
library(tidytext)
data("stop_words")
shs_unigram <- phd_shs %>%
unnest_tokens(output = word, input = title) %>%
anti_join(stop_words) # remove stop words
shs_unigram <- shs_unigram %>% mutate(lgth = nchar(word))
shs_unigram_count <- shs_unigram %>%
group_by(word) %>%
tally() %>%
arrange(desc(n))
shs_unigram_count
# tf_idf by disciplines
shs_tf_idf_discipline <- shs_unigram %>%
count(Discipline, word) %>%
bind_tf_idf(word, Discipline, n) %>%
arrange(desc(tf_idf))
shs_tf_idf_discipline %>%
group_by(Discipline) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Discipline)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Discipline, scales = "free") +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the dissertation titles",
subtitle = "tf-idf by discipline",
caption = "Source: Yuan, T’ung-li (1961, 1962)")
# tf-idf by period
shs_tf_idf_period <- shs_unigram %>% filter(lgth >3) %>%
filter(!word == "preface") %>%
count(Degree_Period, word) %>%
bind_tf_idf(word, Degree_Period, n) %>%
arrange(desc(tf_idf))
shs_tf_idf_period %>%
group_by(Degree_Period) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Degree_Period)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Degree_Period, scales = "free") +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the dissertation titles",
subtitle = "tf-idf by period",
caption = "Source: Yuan, T’ung-li (1961, 1962)")
# tf-idf by Region
shs_tf_idf_region <- shs_unigram %>%
count(Region, word) %>%
bind_tf_idf(word, Region, n) %>%
arrange(desc(tf_idf))
shs_tf_idf_region %>%
group_by(Region) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Region)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Region, scales = "free") +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the dissertation titles",
subtitle = "tf-idf by region",
caption = "Source: Yuan, T’ung-li (1961, 1962)")
library(widyr)
word_pairs <- shs_unigram %>%
pairwise_count(word, RID, sort = TRUE)
word_pairs_filtered <- word_pairs %>% filter(!item1 == "study") %>% filter(!item2 == "study")
set.seed(2024)
library(igraph)
library(tidygraph)
library(ggraph)
word_pairs_filtered %>%
filter(n > 5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "orange", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines")) +
theme_void()+
labs(title = "Word co-occurrences in dissertation titles",
subtitle = "Most frequent pairs (n>5)",
caption = "Source: Yuan, T’ung-li (1961, 1962)")
# Create a two-mode network linking titles and the words they contain
edge_2m <- shs_unigram %>% select(Thesis, word)
node_title <- edge_2m %>% distinct(Thesis) %>% rename(Name = Thesis) %>% mutate(Type = "Title")
node_word <- edge_2m %>% distinct(word) %>% rename(Name = word) %>% mutate(Type = "Word")
node <- bind_rows(node_title, node_word)
library(igraph)
title_net <- graph_from_data_frame(edge_2m, directed = FALSE, vertices = node)
# Community detection with Louvain
set.seed(2025)
title_net_cluster <- cluster_louvain(title_net)
# Extract community membership
title_net_cluster_df <- data.frame(title_net_cluster$membership,
title_net_cluster$names) %>%
group_by(title_net_cluster.membership) %>%
add_tally() %>% # add size of clusters
rename(member = title_net_cluster.names, cluster = title_net_cluster.membership, size = n)
## add node type and degree centrality
node_type <- node %>% rename(member = Name)
title_net_cluster_df <- left_join(title_net_cluster_df, node_type)
degree <- degree(title_net, mode = "all", normalized = FALSE)
degree_df <- as.data.frame(degree)
degree_df <- rownames_to_column(degree_df, "member")
title_net_cluster_df <- left_join(title_net_cluster_df, degree_df)
# Plot the networks
# Index node size on degree centrality
deg_cent <- degree(title_net, mode = "all", normalized = TRUE)
V(title_net)$size <- deg_cent * 100
# Index node color on their type
V(title_net)$color <- ifelse(V(title_net)$Type == "Title", "red", "orange")
plot(title_net,
vertex.size = V(title_net)$size,
vertex.color = V(title_net)$color,
vertex.label.color = "black",
vertex.label.cex = V(title_net)$size/10,
main = "Dissertation Title Network")
# Plot the communities
V(title_net)$group <- title_net_cluster$membership
V(title_net)$color <- title_net_cluster$membership
plot(title_net_cluster, title_net,
vertex.label=NA,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size=1.8,
main="Semantic Communities",
sub = "Louvain method")
library(BTM)
x <- shs_unigram %>% filter(!word == "study") %>%
select(RID, word)
set.seed(2025)
model10 <- BTM(x, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:10:40 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:40 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:41 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:41 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:42 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:42 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:43 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:43 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:44 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:44 Start Gibbs sampling iteration 901/1000
model15 <- BTM(x, k = 15, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:10:45 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:45 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:46 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:47 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:47 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:48 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:49 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:49 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:50 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:51 Start Gibbs sampling iteration 901/1000
model20 <- BTM(x, k = 20, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:10:51 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:10:52 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:10:53 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:10:54 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:10:54 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:10:55 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:10:56 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:10:57 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:10:57 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:10:58 Start Gibbs sampling iteration 901/1000
model25 <- BTM(x, k = 25, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:10:59 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:00 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:01 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:02 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:03 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:04 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:05 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:06 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:06 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:07 Start Gibbs sampling iteration 901/1000
# Most frequent terms and scores for each model
# model10$theta
# terms(model10, top_n = 10)
# topicterms10
# scores <- predict(model10, newdata = x)
# head(scores, 10)
# hist(scores)
# model20$theta
# terms(model20)
# model20 <- terms(model20, top_n = 10)
# topicterms10
# scores <- predict(model10, newdata = x)
# head(scores, 10)
# hist(scores)
# Most frequent terms for each model
library(BTM)
## 10-topic model
terms(model10, top_n = 10)
## [[1]]
## token probability
## 1 china 0.02625494
## 2 political 0.02450475
## 3 economic 0.02341089
## 4 chinese 0.01991052
## 5 influence 0.01597261
## 6 sen 0.01509752
## 7 sun 0.01509752
## 8 yat 0.01509752
## 9 social 0.01487875
## 10 philosophy 0.01465997
##
## [[2]]
## token probability
## 1 chinese 0.032772745
## 2 political 0.018459858
## 3 france 0.015823274
## 4 century 0.013563344
## 5 comparative 0.013563344
## 6 england 0.012810034
## 7 mid 0.012056724
## 8 school 0.010926759
## 9 religious 0.009796795
## 10 china 0.009420140
##
## [[3]]
## token probability
## 1 china 0.03591563
## 2 chinese 0.03536648
## 3 education 0.02723903
## 4 special 0.01944107
## 5 united 0.01867226
## 6 reference 0.01713464
## 7 school 0.01493803
## 8 program 0.01274142
## 9 students 0.01208244
## 10 educational 0.01142346
##
## [[4]]
## token probability
## 1 chinese 0.07174477
## 2 law 0.04844577
## 3 china 0.02537746
## 4 family 0.01153647
## 5 century 0.01130578
## 6 income 0.01130578
## 7 comparative 0.01084442
## 8 tax 0.01038305
## 9 individual 0.01015237
## 10 international 0.01015237
##
## [[5]]
## token probability
## 1 china 0.04161432
## 2 chinese 0.02801051
## 3 central 0.02187546
## 4 government 0.01867456
## 5 asia 0.01814108
## 6 development 0.01814108
## 7 local 0.01787434
## 8 industrialization 0.01760760
## 9 special 0.01387322
## 10 reference 0.01333974
##
## [[6]]
## token probability
## 1 chinese 0.02240356
## 2 theory 0.01728348
## 3 reference 0.01568345
## 4 experimental 0.01504344
## 5 learning 0.01440343
## 6 united 0.01440343
## 7 psychology 0.01344342
## 8 reading 0.01312341
## 9 special 0.01216339
## 10 relation 0.01152338
##
## [[7]]
## token probability
## 1 china 0.06767850
## 2 international 0.02494593
## 3 chinese 0.02463017
## 4 foreign 0.02178835
## 5 relations 0.02063057
## 6 sino 0.01515743
## 7 policy 0.01473642
## 8 trade 0.01231561
## 9 british 0.01157884
## 10 japanese 0.01115783
##
## [[8]]
## token probability
## 1 china 0.079265500
## 2 chinese 0.041746170
## 3 education 0.026704637
## 4 law 0.022648493
## 5 social 0.022310481
## 6 modern 0.015550241
## 7 ancient 0.011325091
## 8 philosophy 0.010987079
## 9 feudal 0.009128014
## 10 comparative 0.008959008
##
## [[9]]
## token probability
## 1 china 0.04866493
## 2 chinese 0.02296422
## 3 labor 0.02022562
## 4 rural 0.01875099
## 5 service 0.01811900
## 6 united 0.01411643
## 7 church 0.01306312
## 8 organization 0.01264180
## 9 economic 0.01137783
## 10 war 0.01116717
##
## [[10]]
## token probability
## 1 law 0.03367361
## 2 international 0.02540902
## 3 london 0.02020539
## 4 china 0.01592005
## 5 britain 0.01530786
## 6 political 0.01438957
## 7 american 0.01408347
## 8 basis 0.01285909
## 9 supervision 0.01194080
## 10 united 0.01194080
## 15-topic model
terms(model15, top_n = 10)
## [[1]]
## token probability
## 1 chinese 0.04227906
## 2 law 0.04086991
## 3 god 0.01926302
## 4 idea 0.01832359
## 5 doctrine 0.01691444
## 6 li 0.01597501
## 7 evolution 0.01456587
## 8 marriage 0.01456587
## 9 court 0.01268701
## 10 obligations 0.01268701
##
## [[2]]
## token probability
## 1 china 0.06098115
## 2 chinese 0.03597539
## 3 education 0.02742079
## 4 economic 0.02610470
## 5 educational 0.01974358
## 6 reference 0.01930488
## 7 philosophy 0.01798879
## 8 social 0.01733075
## 9 special 0.01733075
## 10 theory 0.01447921
##
## [[3]]
## token probability
## 1 chinese 0.03296725
## 2 students 0.02052170
## 3 selected 0.01783077
## 4 basis 0.01480348
## 5 schools 0.01480348
## 6 theory 0.01446711
## 7 college 0.01379438
## 8 american 0.01345801
## 9 comparative 0.01143982
## 10 english 0.01143982
##
## [[4]]
## token probability
## 1 law 0.06936891
## 2 chinese 0.05258273
## 3 china 0.02962515
## 4 international 0.02567546
## 5 century 0.01555438
## 6 german 0.01259211
## 7 civil 0.01209840
## 8 period 0.01185154
## 9 japanese 0.01111097
## 10 power 0.01111097
##
## [[5]]
## token probability
## 1 international 0.03728616
## 2 law 0.02757723
## 3 legal 0.02447038
## 4 china 0.02369366
## 5 special 0.01942174
## 6 world 0.01825667
## 7 reference 0.01476145
## 8 public 0.01398474
## 9 britain 0.01359638
## 10 federal 0.01359638
##
## [[6]]
## token probability
## 1 china 0.03496597
## 2 development 0.02887673
## 3 nations 0.02730532
## 4 united 0.02710889
## 5 international 0.02259107
## 6 chinese 0.01905538
## 7 special 0.01826967
## 8 reference 0.01748397
## 9 economic 0.01571612
## 10 industrialization 0.01237686
##
## [[7]]
## token probability
## 1 reference 0.02797119
## 2 british 0.02634514
## 3 china 0.02276785
## 4 chinese 0.02276785
## 5 special 0.02244264
## 6 foreign 0.01951576
## 7 united 0.01723930
## 8 relations 0.01658889
## 9 policy 0.01431243
## 10 diplomatic 0.01268639
##
## [[8]]
## token probability
## 1 students 0.02740171
## 2 china 0.02583612
## 3 chinese 0.02231355
## 4 social 0.02035656
## 5 english 0.01996517
## 6 factors 0.01722539
## 7 personal 0.01722539
## 8 york 0.01683399
## 9 conference 0.01487700
## 10 school 0.01331142
##
## [[9]]
## token probability
## 1 china 0.06842045
## 2 relations 0.04140673
## 3 chinese 0.03635744
## 4 sino 0.02550146
## 5 central 0.02045216
## 6 treaty 0.01893738
## 7 soviet 0.01792752
## 8 government 0.01565534
## 9 historical 0.01388808
## 10 diplomatic 0.01212083
##
## [[10]]
## token probability
## 1 china 0.04243260
## 2 labor 0.02794460
## 3 united 0.02656479
## 4 service 0.02552993
## 5 war 0.02208041
## 6 control 0.01828593
## 7 international 0.01759603
## 8 chinese 0.01656117
## 9 education 0.01552631
## 10 development 0.01449145
##
## [[11]]
## token probability
## 1 china 0.06001739
## 2 education 0.05882661
## 3 chinese 0.02381786
## 4 program 0.02381786
## 5 curriculum 0.01881661
## 6 rural 0.01881661
## 7 school 0.01548244
## 8 united 0.01548244
## 9 secondary 0.01524429
## 10 proposed 0.01429167
##
## [[12]]
## token probability
## 1 income 0.03437954
## 2 individual 0.02380285
## 3 george 0.02274518
## 4 tax 0.02274518
## 5 north 0.01904334
## 6 carolina 0.01798567
## 7 flexibility 0.01745683
## 8 built 0.01692800
## 9 motor 0.01375499
## 10 reference 0.01269732
##
## [[13]]
## token probability
## 1 political 0.05130770
## 2 china 0.04070279
## 3 chinese 0.02895140
## 4 social 0.02665845
## 5 philosophy 0.01977959
## 6 sen 0.01977959
## 7 sun 0.01977959
## 8 yat 0.01977959
## 9 development 0.01576692
## 10 science 0.01433383
##
## [[14]]
## token probability
## 1 china 0.09781908
## 2 chinese 0.03621726
## 3 foreign 0.02379038
## 4 system 0.01686683
## 5 trade 0.01384887
## 6 banking 0.01331629
## 7 special 0.01296124
## 8 dynasty 0.01118597
## 9 modern 0.01083091
## 10 reference 0.01029833
##
## [[15]]
## token probability
## 1 chinese 0.09670194
## 2 experimental 0.02467254
## 3 learning 0.02121907
## 4 reading 0.02121907
## 5 analysis 0.01776561
## 6 psychological 0.01628555
## 7 psychology 0.01529885
## 8 characters 0.01431214
## 9 movements 0.01283209
## 10 culture 0.01233874
## 20-topic model
terms(model20, top_n = 10)
## [[1]]
## token probability
## 1 basis 0.02323834
## 2 chinese 0.02076670
## 3 east 0.02027237
## 4 supervision 0.02027237
## 5 selected 0.01928372
## 6 american 0.01878939
## 7 international 0.01780074
## 8 county 0.01483477
## 9 economic 0.01384612
## 10 war 0.01384612
##
## [[2]]
## token probability
## 1 china 0.03397222
## 2 nations 0.03188179
## 3 united 0.02848482
## 4 chinese 0.02639439
## 5 social 0.02456525
## 6 analysis 0.02273612
## 7 development 0.02012307
## 8 economic 0.01907786
## 9 educational 0.01594220
## 10 life 0.01150002
##
## [[3]]
## token probability
## 1 special 0.04613937
## 2 reference 0.04322781
## 3 china 0.03494104
## 4 development 0.02508651
## 5 economic 0.02486254
## 6 chinese 0.02015924
## 7 united 0.02015924
## 8 industrialization 0.01456007
## 9 theory 0.01433611
## 10 system 0.01388817
##
## [[4]]
## token probability
## 1 china 0.08499351
## 2 chinese 0.03278874
## 3 foreign 0.03103797
## 4 international 0.02435321
## 5 policy 0.01973754
## 6 war 0.01941922
## 7 trade 0.01766845
## 8 labor 0.01544020
## 9 development 0.01496272
## 10 sino 0.01225698
##
## [[5]]
## token probability
## 1 china 0.06519758
## 2 chinese 0.05261633
## 3 education 0.05071008
## 4 modern 0.02745382
## 5 theory 0.01563507
## 6 constitution 0.01449132
## 7 movement 0.01411007
## 8 adult 0.01334757
## 9 social 0.01296632
## 10 obligations 0.01220382
##
## [[6]]
## token probability
## 1 china 0.07027134
## 2 chinese 0.04010004
## 3 government 0.03216023
## 4 central 0.02223546
## 5 local 0.01945652
## 6 provincial 0.01628059
## 7 ancient 0.01469263
## 8 law 0.01469263
## 9 dynasty 0.01310467
## 10 people 0.01310467
##
## [[7]]
## token probability
## 1 china 0.04946290
## 2 political 0.04420779
## 3 economic 0.02535124
## 4 ideas 0.02411475
## 5 philosophy 0.02349650
## 6 sen 0.02133263
## 7 sun 0.02133263
## 8 yat 0.02133263
## 9 confucius 0.01885964
## 10 social 0.01855052
##
## [[8]]
## token probability
## 1 china 0.09112718
## 2 chinese 0.02510984
## 3 treatment 0.01767126
## 4 critical 0.01488180
## 5 organization 0.01488180
## 6 favored 0.01348706
## 7 nation 0.01348706
## 8 rural 0.01348706
## 9 administrative 0.01255724
## 10 dynasty 0.01209233
##
## [[9]]
## token probability
## 1 chinese 0.05594034
## 2 learning 0.02547125
## 3 students 0.02547125
## 4 english 0.02001410
## 5 experimental 0.02001410
## 6 college 0.01910457
## 7 political 0.01910457
## 8 psychology 0.01819505
## 9 science 0.01501171
## 10 factors 0.01410219
##
## [[10]]
## token probability
## 1 relations 0.05329461
## 2 china 0.04225257
## 3 soviet 0.03265081
## 4 sino 0.02640966
## 5 diplomatic 0.02448931
## 6 treaty 0.02400922
## 7 chinese 0.02352913
## 8 tax 0.01920833
## 9 central 0.01680789
## 10 north 0.01680789
##
## [[11]]
## token probability
## 1 law 0.10140350
## 2 chinese 0.04198894
## 3 international 0.04119674
## 4 legal 0.02957789
## 5 status 0.02139189
## 6 china 0.02112782
## 7 comparative 0.01848717
## 8 united 0.01690278
## 9 american 0.01294181
## 10 civil 0.01294181
##
## [[12]]
## token probability
## 1 british 0.03346773
## 2 china 0.02907924
## 3 power 0.02085082
## 4 france 0.01591377
## 5 balance 0.01481664
## 6 chinese 0.01426808
## 7 comparative 0.01426808
## 8 market 0.01371952
## 9 treaty 0.01371952
## 10 england 0.01317096
##
## [[13]]
## token probability
## 1 china 0.04550410
## 2 motor 0.02747757
## 3 land 0.02661917
## 4 transportation 0.02490236
## 5 federal 0.02404395
## 6 administration 0.02146873
## 7 existence 0.02061033
## 8 efficiency 0.01717670
## 9 revision 0.01631830
## 10 system 0.01545989
##
## [[14]]
## token probability
## 1 chinese 0.05031443
## 2 idea 0.02124710
## 3 social 0.01565723
## 4 confucianism 0.01453925
## 5 god 0.01453925
## 6 ti 0.01453925
## 7 method 0.01398027
## 8 classics 0.01286229
## 9 development 0.01286229
## 10 francisco 0.01286229
##
## [[15]]
## token probability
## 1 education 0.05806760
## 2 china 0.05758970
## 3 school 0.02557020
## 4 chinese 0.02461439
## 5 curriculum 0.02055222
## 6 program 0.01983536
## 7 teachers 0.01601214
## 8 secondary 0.01529528
## 9 comparative 0.01457843
## 10 children 0.01410053
##
## [[16]]
## token probability
## 1 chinese 0.04990971
## 2 rural 0.02364490
## 3 service 0.02101842
## 4 school 0.01970518
## 5 political 0.01904856
## 6 united 0.01707870
## 7 program 0.01642208
## 8 theories 0.01576546
## 9 church 0.01510883
## 10 china 0.01379559
##
## [[17]]
## token probability
## 1 century 0.01933974
## 2 united 0.01933974
## 3 paradise 0.01832240
## 4 germany 0.01730505
## 5 literary 0.01730505
## 6 aspects 0.01628771
## 7 czechoslovakia 0.01628771
## 8 poland 0.01628771
## 9 political 0.01628771
## 10 rights 0.01628771
##
## [[18]]
## token probability
## 1 chinese 0.04747806
## 2 analysis 0.02374215
## 3 philosophy 0.01874512
## 4 reference 0.01624660
## 5 wu 0.01499735
## 6 missionary 0.01437272
## 7 john 0.01249883
## 8 june 0.01249883
## 9 special 0.01249883
## 10 psychological 0.01187420
##
## [[19]]
## token probability
## 1 english 0.04045552
## 2 chinese 0.03609944
## 3 language 0.01991972
## 4 mid 0.01991972
## 5 century 0.01805283
## 6 france 0.01805283
## 7 china 0.01743054
## 8 reading 0.01618594
## 9 seventeenth 0.01431905
## 10 achievement 0.01369675
##
## [[20]]
## token probability
## 1 feudal 0.03939604
## 2 law 0.03574893
## 3 chinese 0.02918414
## 4 china 0.02699588
## 5 studies 0.02188993
## 6 kingship 0.02116051
## 7 inquiry 0.01897225
## 8 nature 0.01824282
## 9 period 0.01751340
## 10 po 0.01386630
## 25-topic model
terms(model25, top_n = 10)
## [[1]]
## token probability
## 1 chinese 0.07815845
## 2 china 0.03059063
## 3 life 0.03059063
## 4 lu 0.02153010
## 5 native 0.01926496
## 6 tibetan 0.01926496
## 7 century 0.01813240
## 8 movement 0.01699983
## 9 pali 0.01699983
## 10 sanskrit 0.01699983
##
## [[2]]
## token probability
## 1 american 0.03200245
## 2 analysis 0.03132169
## 3 stories 0.02247183
## 4 china 0.01974880
## 5 literature 0.01906804
## 6 chinese 0.01770653
## 7 basis 0.01566425
## 8 critical 0.01498349
## 9 fair 0.01362197
## 10 return 0.01362197
##
## [[3]]
## token probability
## 1 special 0.04467763
## 2 reference 0.04421229
## 3 chinese 0.04095488
## 4 english 0.02187580
## 5 relation 0.01908374
## 6 china 0.01768771
## 7 american 0.01582633
## 8 experimental 0.01489565
## 9 dialect 0.01443030
## 10 reading 0.01443030
##
## [[4]]
## token probability
## 1 george 0.03700598
## 2 social 0.02714101
## 3 novels 0.02467476
## 4 characteristics 0.01974228
## 5 relation 0.01974228
## 6 attempt 0.01727603
## 7 historical 0.01727603
## 8 major 0.01604291
## 9 charles 0.01480979
## 10 mental 0.01480979
##
## [[5]]
## token probability
## 1 china 0.09179844
## 2 education 0.04541429
## 3 chinese 0.03016916
## 4 social 0.02497932
## 5 educational 0.01978949
## 6 people 0.01751893
## 7 factors 0.01719457
## 8 history 0.01687021
## 9 school 0.01622148
## 10 rural 0.01524838
##
## [[6]]
## token probability
## 1 income 0.04448151
## 2 china 0.03924916
## 3 individual 0.03336277
## 4 tax 0.03205468
## 5 theory 0.02616829
## 6 north 0.02486020
## 7 carolina 0.02224402
## 8 labor 0.02224402
## 9 flexibility 0.02158998
## 10 built 0.02093594
##
## [[7]]
## token probability
## 1 philosophy 0.05234942
## 2 confucius 0.03252310
## 3 john 0.02617867
## 4 confucianism 0.02538562
## 5 dewey 0.02459257
## 6 doctrine 0.02379952
## 7 neo 0.02379952
## 8 moral 0.02142036
## 9 god 0.01983425
## 10 political 0.01983425
##
## [[8]]
## token probability
## 1 china 0.08201941
## 2 special 0.03971972
## 3 chinese 0.03455335
## 4 reference 0.03164727
## 5 economic 0.02486640
## 6 education 0.02389771
## 7 development 0.02292901
## 8 system 0.01905423
## 9 malaya 0.01711684
## 10 relations 0.01679394
##
## [[9]]
## token probability
## 1 chinese 0.03803895
## 2 english 0.03289994
## 3 learning 0.02776093
## 4 comparative 0.02673313
## 5 manchurian 0.02159412
## 6 ethical 0.01748291
## 7 life 0.01748291
## 8 moral 0.01748291
## 9 text 0.01748291
## 10 obligation 0.01645511
##
## [[10]]
## token probability
## 1 london 0.03474916
## 2 chinese 0.03226767
## 3 studies 0.03061334
## 4 science 0.02978618
## 5 variety 0.02895901
## 6 china 0.02647752
## 7 political 0.02647752
## 8 american 0.01903305
## 9 law 0.01903305
## 10 school 0.01737872
##
## [[11]]
## token probability
## 1 chinese 0.05778911
## 2 students 0.04165127
## 3 college 0.03071918
## 4 english 0.01770478
## 5 reading 0.01770478
## 6 university 0.01718421
## 7 language 0.01458133
## 8 theory 0.01458133
## 9 children 0.01354018
## 10 method 0.01354018
##
## [[12]]
## token probability
## 1 china 0.06858973
## 2 central 0.04589627
## 3 relations 0.04438337
## 4 government 0.03127159
## 5 political 0.02421140
## 6 local 0.02168991
## 7 school 0.02017701
## 8 provincial 0.01866411
## 9 diplomatic 0.01815981
## 10 modern 0.01765551
##
## [[13]]
## token probability
## 1 nations 0.06253402
## 2 united 0.05488652
## 3 china 0.03689242
## 4 development 0.02384669
## 5 league 0.02384669
## 6 international 0.02159743
## 7 economic 0.02114757
## 8 county 0.01664905
## 9 supervision 0.01664905
## 10 basis 0.01484964
##
## [[14]]
## token probability
## 1 chinese 0.05209155
## 2 china 0.04674004
## 3 political 0.02925846
## 4 economic 0.02854493
## 5 law 0.02854493
## 6 ideas 0.02747462
## 7 sen 0.02462049
## 8 sun 0.02462049
## 9 yat 0.02462049
## 10 social 0.02247989
##
## [[15]]
## token probability
## 1 china 0.02935055
## 2 economic 0.02889907
## 3 united 0.02844759
## 4 labor 0.02709316
## 5 british 0.02573873
## 6 international 0.02528725
## 7 chinese 0.02167543
## 8 rural 0.02032100
## 9 government 0.01806361
## 10 church 0.01670918
##
## [[16]]
## token probability
## 1 china 0.10527618
## 2 chinese 0.02857362
## 3 foreign 0.02665606
## 4 war 0.02013634
## 5 policy 0.01802702
## 6 legal 0.01764351
## 7 trade 0.01725999
## 8 status 0.01649297
## 9 law 0.01476716
## 10 banking 0.01400013
##
## [[17]]
## token probability
## 1 motor 0.03415248
## 2 industrialization 0.03325396
## 3 transportation 0.03325396
## 4 developed 0.02337032
## 5 special 0.02247181
## 6 countries 0.02157330
## 7 development 0.02157330
## 8 china 0.02067478
## 9 reference 0.01977627
## 10 theory 0.01887776
##
## [[18]]
## token probability
## 1 education 0.05035715
## 2 china 0.04102180
## 3 chinese 0.03621268
## 4 program 0.02574577
## 5 school 0.02574577
## 6 curriculum 0.01923931
## 7 children 0.01839064
## 8 secondary 0.01810775
## 9 schools 0.01725908
## 10 reference 0.01527886
##
## [[19]]
## token probability
## 1 service 0.03662012
## 2 united 0.02656951
## 3 chinese 0.02441581
## 4 marriage 0.02154420
## 5 period 0.01580100
## 6 control 0.01508310
## 7 organization 0.01508310
## 8 car 0.01436520
## 9 functions 0.01436520
## 10 techniques 0.01436520
##
## [[20]]
## token probability
## 1 sino 0.04298268
## 2 international 0.04062762
## 3 china 0.03856695
## 4 relations 0.03238493
## 5 treaty 0.03150179
## 6 law 0.02973550
## 7 soviet 0.02090405
## 8 japanese 0.02060966
## 9 diplomatic 0.02002090
## 10 chinese 0.01972652
##
## [[21]]
## token probability
## 1 chinese 0.07496441
## 2 china 0.04300532
## 3 factors 0.03196490
## 4 learning 0.02905953
## 5 social 0.02731631
## 6 modern 0.02499201
## 7 constitution 0.02266771
## 8 development 0.01976234
## 9 movement 0.01976234
## 10 analysis 0.01860019
##
## [[22]]
## token probability
## 1 aspects 0.02320353
## 2 charles 0.01963513
## 3 church 0.01963513
## 4 china 0.01606673
## 5 christian 0.01606673
## 6 account 0.01517463
## 7 effect 0.01428253
## 8 injuries 0.01428253
## 9 responsibility 0.01428253
## 10 role 0.01339043
##
## [[23]]
## token probability
## 1 law 0.10452022
## 2 chinese 0.06466581
## 3 international 0.02684479
## 4 comparative 0.01911792
## 5 civil 0.01708453
## 6 succession 0.01667785
## 7 family 0.01545782
## 8 german 0.01545782
## 9 code 0.01423778
## 10 century 0.01301775
##
## [[24]]
## token probability
## 1 chinese 0.06572251
## 2 philosophy 0.02547696
## 3 ancient 0.02038259
## 4 law 0.01936371
## 5 tzu 0.01783540
## 6 system 0.01579765
## 7 nature 0.01528821
## 8 obligations 0.01477878
## 9 conference 0.01426934
## 10 li 0.01426934
##
## [[25]]
## token probability
## 1 united 0.03836408
## 2 england 0.03134066
## 3 france 0.02539777
## 4 reference 0.02485751
## 5 comparative 0.02053540
## 6 administration 0.01999514
## 7 application 0.01999514
## 8 theory 0.01999514
## 9 control 0.01837435
## 10 mid 0.01729382
# Plot the topics
library(BTM)
library(textplot)
library(ggraph)
library(concaveman)
plot(model10, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'",
subtitle = "Biterm topic model with 10 topics")
plot(model15, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'",
subtitle = "Biterm topic model with 15 topics")
plot(model20, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'",
subtitle = "Biterm topic model with 20 topics")
plot(model25, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1905-1962)'",
subtitle = "Biterm topic model with 25 topics")
# Build Models
usa <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Region == "USA") %>%
select(RID, word)
uk <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Region == "UK") %>%
select(RID, word)
eu <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Region == "Europe") %>%
select(RID, word)
set.seed(321)
model10us <- BTM(usa, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:12 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:13 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:14 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:14 Start Gibbs sampling iteration 901/1000
model15us <- BTM(usa, k = 15, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:14 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:15 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:16 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:17 Start Gibbs sampling iteration 901/1000
model20us <- BTM(usa, k = 20, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:18 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:18 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:19 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:20 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:20 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:21 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:21 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:22 Start Gibbs sampling iteration 901/1000
model25us <- BTM(usa, k = 25, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:22 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:23 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:23 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:24 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:24 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:25 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:25 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:26 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:26 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 901/1000
model5uk <- BTM(uk, k = 5, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 901/1000
model10uk <- BTM(uk, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:27 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 901/1000
model15uk <- BTM(uk, k = 15, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 901/1000
model10eu <- BTM(eu, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:28 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 901/1000
model15eu <- BTM(eu, k = 15, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:29 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:30 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 901/1000
model20eu <- BTM(eu, k = 20, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:31 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:32 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:33 Start Gibbs sampling iteration 901/1000
# Most frequent terms for each model
## U.S. theses models with 10, 20, and 25 topics
terms(model10us)
## [[1]]
## token probability
## 1 chinese 0.03615592
## 2 american 0.03389665
## 3 method 0.02335337
## 4 stories 0.02109409
## 5 basis 0.01808173
##
## [[2]]
## token probability
## 1 china 0.05946573
## 2 chinese 0.02801636
## 3 foreign 0.02405216
## 4 international 0.01929511
## 5 american 0.01850227
##
## [[3]]
## token probability
## 1 chinese 0.05854293
## 2 students 0.04042357
## 3 college 0.02021353
## 4 reading 0.01916818
## 5 university 0.01603214
##
## [[4]]
## token probability
## 1 chinese 0.03929508
## 2 reference 0.03764072
## 3 special 0.03267766
## 4 united 0.02812818
## 5 china 0.02564665
##
## [[5]]
## token probability
## 1 china 0.03343227
## 2 feudal 0.02407311
## 3 development 0.01805650
## 4 visual 0.01671948
## 5 kingship 0.01605097
##
## [[6]]
## token probability
## 1 china 0.06230025
## 2 education 0.05694646
## 3 chinese 0.02433702
## 4 program 0.01865876
## 5 educational 0.01736087
##
## [[7]]
## token probability
## 1 united 0.04545990
## 2 nations 0.04055451
## 3 china 0.02518428
## 4 chinese 0.02354915
## 5 international 0.02060591
##
## [[8]]
## token probability
## 1 china 0.02826449
## 2 economic 0.02562682
## 3 united 0.02487320
## 4 development 0.01997468
## 5 century 0.01771382
##
## [[9]]
## token probability
## 1 chinese 0.05317800
## 2 factors 0.02524894
## 3 learning 0.02148926
## 4 conference 0.01880378
## 5 analysis 0.01826668
##
## [[10]]
## token probability
## 1 china 0.03966740
## 2 chinese 0.02786303
## 3 philosophy 0.02408563
## 4 income 0.02314128
## 5 influence 0.01841954
terms(model20us)
## [[1]]
## token probability
## 1 china 0.04960921
## 2 chinese 0.04357648
## 3 relations 0.02882978
## 4 foreign 0.02815948
## 5 diplomatic 0.02614857
##
## [[2]]
## token probability
## 1 analysis 0.03482582
## 2 chinese 0.02638575
## 3 farm 0.02216572
## 4 attainment 0.02111071
## 5 goal 0.02111071
##
## [[3]]
## token probability
## 1 chinese 0.07694527
## 2 american 0.03527237
## 3 stories 0.03527237
## 4 comparative 0.02565555
## 5 foreign 0.02031287
##
## [[4]]
## token probability
## 1 selected 0.02619861
## 2 united 0.02560332
## 3 china 0.02500804
## 4 county 0.02500804
## 5 hsien 0.02500804
##
## [[5]]
## token probability
## 1 united 0.04412680
## 2 special 0.04242983
## 3 reference 0.03960155
## 4 administration 0.03111672
## 5 control 0.02828844
##
## [[6]]
## token probability
## 1 conference 0.04726590
## 2 reading 0.04070302
## 3 international 0.03939044
## 4 chinese 0.03151498
## 5 ports 0.03151498
##
## [[7]]
## token probability
## 1 income 0.05945520
## 2 individual 0.05299408
## 3 built 0.03619518
## 4 carolina 0.03619518
## 5 flexibility 0.03619518
##
## [[8]]
## token probability
## 1 theory 0.03411998
## 2 china 0.02946831
## 3 development 0.02869304
## 4 economic 0.02326609
## 5 countries 0.02171554
##
## [[9]]
## token probability
## 1 students 0.03831224
## 2 school 0.03422614
## 3 chinese 0.03065081
## 4 community 0.02707548
## 5 factors 0.02656472
##
## [[10]]
## token probability
## 1 nations 0.05618346
## 2 international 0.04339048
## 3 british 0.04060939
## 4 china 0.03615966
## 5 united 0.03449101
##
## [[11]]
## token probability
## 1 china 0.06079296
## 2 chinese 0.05755110
## 3 land 0.02675344
## 4 special 0.02189065
## 5 formation 0.01945926
##
## [[12]]
## token probability
## 1 political 0.05849726
## 2 studies 0.03120309
## 3 china 0.02730392
## 4 asia 0.02535434
## 5 central 0.02437954
##
## [[13]]
## token probability
## 1 education 0.06472831
## 2 china 0.05649166
## 3 adult 0.04354835
## 4 social 0.02825171
## 5 theory 0.02707505
##
## [[14]]
## token probability
## 1 chinese 0.06193396
## 2 economic 0.05883764
## 3 development 0.03484124
## 4 trade 0.02555230
## 5 application 0.02090784
##
## [[15]]
## token probability
## 1 china 0.02764983
## 2 chinese 0.02501777
## 3 paradise 0.02370173
## 4 school 0.02238570
## 5 activities 0.02106967
##
## [[16]]
## token probability
## 1 china 0.08358536
## 2 education 0.05556414
## 3 chinese 0.03817166
## 4 curriculum 0.02705980
## 5 program 0.02367793
##
## [[17]]
## token probability
## 1 china 0.08623988
## 2 education 0.06232563
## 3 united 0.04348409
## 4 relations 0.02101918
## 5 organization 0.01884516
##
## [[18]]
## token probability
## 1 learning 0.04082702
## 2 chinese 0.03674500
## 3 reference 0.02381859
## 4 development 0.01973657
## 5 psychology 0.01905624
##
## [[19]]
## token probability
## 1 philosophy 0.04990286
## 2 education 0.03610141
## 3 china 0.03132398
## 4 political 0.02760821
## 5 confucius 0.02548491
##
## [[20]]
## token probability
## 1 chinese 0.05086197
## 2 american 0.03560592
## 3 english 0.02967301
## 4 children 0.02628278
## 5 language 0.02458766
terms(model25us)
## [[1]]
## token probability
## 1 reference 0.05539808
## 2 special 0.04791253
## 3 china 0.04042698
## 4 united 0.03294142
## 5 economic 0.02345972
##
## [[2]]
## token probability
## 1 china 0.07144218
## 2 rural 0.03810567
## 3 chinese 0.02994163
## 4 hsien 0.02790062
## 5 united 0.02790062
##
## [[3]]
## token probability
## 1 china 0.04183630
## 2 england 0.03347090
## 3 administration 0.02789396
## 4 systems 0.02510550
## 5 organization 0.02324652
##
## [[4]]
## token probability
## 1 chinese 0.07711511
## 2 education 0.03588933
## 3 learning 0.03512589
## 4 reading 0.03283557
## 5 adult 0.03130869
##
## [[5]]
## token probability
## 1 chinese 0.06755160
## 2 china 0.03800701
## 3 formation 0.03378635
## 4 capital 0.03167602
## 5 prior 0.02956569
##
## [[6]]
## token probability
## 1 chinese 0.05573972
## 2 education 0.04002229
## 3 health 0.03716458
## 4 effects 0.03144915
## 5 modern 0.02573372
##
## [[7]]
## token probability
## 1 reference 0.05299017
## 2 special 0.05020170
## 3 english 0.04834272
## 4 chinese 0.04090681
## 5 american 0.02510550
##
## [[8]]
## token probability
## 1 nations 0.05453459
## 2 united 0.04293272
## 3 british 0.03365122
## 4 development 0.02959057
## 5 international 0.02611001
##
## [[9]]
## token probability
## 1 china 0.07719285
## 2 chinese 0.03524325
## 3 political 0.03076863
## 4 social 0.02853132
## 5 land 0.02461602
##
## [[10]]
## token probability
## 1 china 0.05406438
## 2 development 0.04760086
## 3 chinese 0.03761179
## 4 american 0.02174679
## 5 cultural 0.02057161
##
## [[11]]
## token probability
## 1 individual 0.05849426
## 2 income 0.03462392
## 3 north 0.03462392
## 4 built 0.03343041
## 5 carolina 0.03343041
##
## [[12]]
## token probability
## 1 chinese 0.05869632
## 2 college 0.04663708
## 3 students 0.04663708
## 4 university 0.02573441
## 5 comparative 0.02091071
##
## [[13]]
## token probability
## 1 china 0.04847091
## 2 local 0.03197369
## 3 united 0.02991153
## 4 central 0.02784938
## 5 relations 0.02578723
##
## [[14]]
## token probability
## 1 china 0.09039554
## 2 relations 0.06583420
## 3 foreign 0.04618513
## 4 diplomatic 0.04029041
## 5 asia 0.03341324
##
## [[15]]
## token probability
## 1 china 0.06063153
## 2 feudal 0.04961012
## 3 world 0.03445568
## 4 kingship 0.03170033
## 5 analysis 0.02205659
##
## [[16]]
## token probability
## 1 school 0.04613211
## 2 chinese 0.03881071
## 3 personal 0.03002504
## 4 community 0.02929290
## 5 children 0.02709648
##
## [[17]]
## token probability
## 1 effects 0.04388325
## 2 economic 0.03657181
## 3 international 0.03510953
## 4 control 0.03072266
## 5 attainment 0.02926037
##
## [[18]]
## token probability
## 1 studies 0.05649850
## 2 united 0.03307800
## 3 variety 0.03307800
## 4 american 0.02894498
## 5 aspects 0.02343427
##
## [[19]]
## token probability
## 1 chinese 0.04424594
## 2 county 0.03810238
## 3 supervision 0.03564495
## 4 application 0.03441624
## 5 basis 0.03318753
##
## [[20]]
## token probability
## 1 analysis 0.05674320
## 2 selected 0.02837906
## 3 business 0.02390052
## 4 cooperative 0.02390052
## 5 stores 0.02390052
##
## [[21]]
## token probability
## 1 railroad 0.05428631
## 2 american 0.03361073
## 3 united 0.02714961
## 4 fair 0.02585739
## 5 return 0.02585739
##
## [[22]]
## token probability
## 1 theory 0.04421236
## 2 development 0.04294951
## 3 influence 0.04042381
## 4 countries 0.03537241
## 5 industrialization 0.03537241
##
## [[23]]
## token probability
## 1 conference 0.07260775
## 2 international 0.03765494
## 3 ports 0.03496626
## 4 waterways 0.03227758
## 5 china 0.02690022
##
## [[24]]
## token probability
## 1 china 0.05466366
## 2 education 0.03426982
## 3 united 0.02937530
## 4 plan 0.02529653
## 5 control 0.02284927
##
## [[25]]
## token probability
## 1 education 0.10542486
## 2 china 0.06980996
## 3 chinese 0.04036831
## 4 program 0.03324532
## 5 curriculum 0.02849667
## UK theses models with 5 and 10 topics
terms(model5uk)
## [[1]]
## token probability
## 1 treaty 0.04974356
## 2 reference 0.04041859
## 3 special 0.04041859
## 4 china 0.03731026
## 5 british 0.03109361
##
## [[2]]
## token probability
## 1 special 0.05121918
## 2 asia 0.04313620
## 3 borrowing 0.04313620
## 4 reference 0.04313620
## 5 capital 0.03774754
##
## [[3]]
## token probability
## 1 george 0.06235715
## 2 china 0.04388780
## 3 historical 0.03465312
## 4 charles 0.02541845
## 5 dickens 0.02541845
##
## [[4]]
## token probability
## 1 political 0.06648579
## 2 london 0.04613715
## 3 science 0.04206742
## 4 english 0.03935427
## 5 chinese 0.03121481
##
## [[5]]
## token probability
## 1 international 0.06078770
## 2 soviet 0.04648752
## 3 london 0.04291247
## 4 relations 0.03814574
## 5 sino 0.03814574
terms(model10uk)
## [[1]]
## token probability
## 1 trade 0.06866824
## 2 british 0.04722282
## 3 government 0.04722282
## 4 local 0.04722282
## 5 china 0.04293373
##
## [[2]]
## token probability
## 1 chinese 0.07424330
## 2 china 0.05800765
## 3 relations 0.04873014
## 4 diplomatic 0.03713325
## 5 treaty 0.03713325
##
## [[3]]
## token probability
## 1 international 0.06768164
## 2 soviet 0.06599002
## 3 london 0.06260678
## 4 relations 0.05584031
## 5 sino 0.05414869
##
## [[4]]
## token probability
## 1 reference 0.10995782
## 2 special 0.10740125
## 3 asia 0.04093059
## 4 economic 0.03837403
## 5 capital 0.03581746
##
## [[5]]
## token probability
## 1 power 0.04035719
## 2 china 0.03459600
## 3 principles 0.03171540
## 4 structure 0.03171540
## 5 biological 0.02883480
##
## [[6]]
## token probability
## 1 government 0.12420928
## 2 borrowing 0.09464972
## 3 british 0.04735442
## 4 employed 0.04735442
## 5 methods 0.04735442
##
## [[7]]
## token probability
## 1 treaty 0.08361484
## 2 special 0.07316733
## 3 reference 0.06620233
## 4 parliaments 0.06271983
## 5 peace 0.06271983
##
## [[8]]
## token probability
## 1 political 0.08349025
## 2 london 0.06601961
## 3 science 0.06019606
## 4 comparative 0.04466660
## 5 english 0.04272542
##
## [[9]]
## token probability
## 1 george 0.10135851
## 2 historical 0.06912273
## 3 charles 0.05070228
## 4 dickens 0.05070228
## 5 eliot 0.05070228
##
## [[10]]
## token probability
## 1 china 0.08936645
## 2 dialect 0.07820262
## 3 tones 0.07820262
## 4 descriptive 0.05587497
## 5 chengtu 0.04471114
## Europe theses models with 10 and 20 topics
terms(model10eu)
## [[1]]
## token probability
## 1 france 0.02956595
## 2 power 0.02442628
## 3 sacred 0.02314137
## 4 china 0.01671678
## 5 blessed 0.01543186
##
## [[2]]
## token probability
## 1 china 0.06866391
## 2 labor 0.04291763
## 3 service 0.03576588
## 4 organization 0.03290518
## 5 international 0.03075966
##
## [[3]]
## token probability
## 1 china 0.09764387
## 2 chinese 0.05069201
## 3 system 0.03895404
## 4 trade 0.03842050
## 5 foreign 0.03735341
##
## [[4]]
## token probability
## 1 chinese 0.06692331
## 2 china 0.02205060
## 3 relation 0.01732716
## 4 orphan 0.01417820
## 5 phonetic 0.01417820
##
## [[5]]
## token probability
## 1 china 0.08832970
## 2 chinese 0.06004295
## 3 modern 0.03060164
## 4 government 0.02482884
## 5 relationship 0.02078787
##
## [[6]]
## token probability
## 1 chinese 0.05436131
## 2 china 0.05129913
## 3 east 0.02603616
## 4 manchuria 0.02220844
## 5 international 0.02067735
##
## [[7]]
## token probability
## 1 chinese 0.04516429
## 2 doctrine 0.02608347
## 3 political 0.02544745
## 4 school 0.02163128
## 5 china 0.02099526
##
## [[8]]
## token probability
## 1 chinese 0.05043972
## 2 marriage 0.03009042
## 3 civil 0.02832092
## 4 succession 0.02743617
## 5 code 0.02389716
##
## [[9]]
## token probability
## 1 china 0.10834750
## 2 chinese 0.02891428
## 3 historical 0.02307360
## 4 relations 0.02102936
## 5 foreign 0.01986123
##
## [[10]]
## token probability
## 1 chinese 0.05225227
## 2 obligation 0.03681761
## 3 critical 0.03206848
## 4 moral 0.02731936
## 5 history 0.02375751
terms(model20eu)
## [[1]]
## token probability
## 1 china 0.14846765
## 2 regime 0.04455184
## 3 organization 0.03795401
## 4 education 0.03630456
## 5 public 0.03630456
##
## [[2]]
## token probability
## 1 nations 0.06205839
## 2 china 0.04310137
## 3 united 0.03793127
## 4 manchuria 0.03620791
## 5 league 0.03103781
##
## [[3]]
## token probability
## 1 china 0.09289948
## 2 france 0.04866865
## 3 comparative 0.03392504
## 4 labor 0.02507888
## 5 teaching 0.02507888
##
## [[4]]
## token probability
## 1 obligation 0.06306916
## 2 moral 0.04905898
## 3 role 0.03971886
## 4 basis 0.02570868
## 5 charles 0.02570868
##
## [[5]]
## token probability
## 1 china 0.13346553
## 2 chinese 0.05203354
## 3 relations 0.03770754
## 4 powers 0.03242954
## 5 evolution 0.02790554
##
## [[6]]
## token probability
## 1 power 0.03703780
## 2 chinese 0.03086826
## 3 dynasty 0.02881175
## 4 liturgy 0.02881175
## 5 qing 0.02675523
##
## [[7]]
## token probability
## 1 china 0.09596712
## 2 government 0.06159597
## 3 relationship 0.03295334
## 4 modern 0.03008908
## 5 relations 0.02865695
##
## [[8]]
## token probability
## 1 china 0.07744694
## 2 chinese 0.05281033
## 3 ancient 0.04401154
## 4 power 0.03697251
## 5 balance 0.03169324
##
## [[9]]
## token probability
## 1 chinese 0.07946436
## 2 system 0.06212984
## 3 china 0.04768440
## 4 social 0.04768440
## 5 contribution 0.02746078
##
## [[10]]
## token probability
## 1 labor 0.06872255
## 2 service 0.05794466
## 3 german 0.03773610
## 4 political 0.03773610
## 5 life 0.02965268
##
## [[11]]
## token probability
## 1 china 0.08141824
## 2 critical 0.07195320
## 3 legal 0.04355810
## 4 historical 0.03030705
## 5 commercial 0.02841404
##
## [[12]]
## token probability
## 1 chinese 0.05989970
## 2 legal 0.05830259
## 3 china 0.05031703
## 4 historical 0.03594302
## 5 status 0.03274879
##
## [[13]]
## token probability
## 1 china 0.10056870
## 2 chinese 0.05606672
## 3 foreign 0.05433287
## 4 trade 0.05144314
## 5 policy 0.03179291
##
## [[14]]
## token probability
## 1 chinese 0.06381977
## 2 character 0.03343664
## 3 succession 0.03039832
## 4 worship 0.03039832
## 5 family 0.02584085
##
## [[15]]
## token probability
## 1 chinese 0.06758844
## 2 china 0.06115247
## 3 modern 0.03648124
## 4 doctrine 0.03219059
## 5 court 0.02897260
##
## [[16]]
## token probability
## 1 chinese 0.04382192
## 2 classical 0.04183092
## 3 studies 0.04183092
## 4 orphan 0.03585792
## 5 history 0.03187592
##
## [[17]]
## token probability
## 1 chinese 0.09881440
## 2 civil 0.04360271
## 3 china 0.03924389
## 4 code 0.03197919
## 5 development 0.02907332
##
## [[18]]
## token probability
## 1 china 0.04728346
## 2 chinese 0.04203071
## 3 political 0.02714793
## 4 ideas 0.02627248
## 5 economic 0.02452156
##
## [[19]]
## token probability
## 1 chinese 0.04191482
## 2 china 0.03443271
## 3 social 0.02994344
## 4 phonetic 0.02695059
## 5 czechoslovakia 0.02395774
##
## [[20]]
## token probability
## 1 chinese 0.04796592
## 2 society 0.03505698
## 3 christian 0.03321285
## 4 german 0.03136872
## 5 pope 0.02768045
# Plot the topics
plot(model10us, top_n = 8,
title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'",
subtitle = "Biterm topic model with 10 topics")
plot(model15us, top_n = 10,
title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'",
subtitle = "Biterm topic model with 15 topics")
plot(model20us, top_n = 8,
title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'",
subtitle = "Biterm topic model with 20 topics")
plot(model25us, top_n = 10,
title = "Dissertation titles of Chinese PhDs in the USA (1905-1962)'",
subtitle = "Biterm topic model with 25 topics")
plot(model5uk, top_n = 10,
title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'",
subtitle = "Biterm topic model with 5 topics")
plot(model10uk, top_n = 8,
title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'",
subtitle = "Biterm topic model with 10 topics")
plot(model15uk, top_n = 8,
title = "Dissertation titles of Chinese PhDs in the UK (1905-1962)'",
subtitle = "Biterm topic model with 15 topics")
plot(model10eu, top_n = 10,
title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'",
subtitle = "Biterm topic model with 10 topics")
plot(model15eu, top_n = 10,
title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'",
subtitle = "Biterm topic model with 15 topics")
plot(model20eu, top_n = 10,
title = "Dissertation titles of Chinese PhDs in Europe (1905-1962)'",
subtitle = "Biterm topic model with 20 topics")
p1 <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Degree_Period == "1905-1926") %>%
select(RID, word)
p2 <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Degree_Period == "1927-1937") %>%
select(RID, word)
p3 <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Degree_Period == "1938-1951") %>%
select(RID, word)
p4 <- shs_unigram %>% filter(!word == "study") %>%
filter(lgth >3 ) %>% filter(Degree_Period == "1952-1962") %>%
select(RID, word)
model10p1 <- BTM(p1, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 901/1000
model10p2 <- BTM(p2, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:41 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:42 Start Gibbs sampling iteration 901/1000
model10p3 <- BTM(p3, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:43 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 901/1000
model10p4 <- BTM(p4, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 1/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 101/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 201/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 301/1000
## 2025-05-19 16:11:44 Start Gibbs sampling iteration 401/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 501/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 601/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 701/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 801/1000
## 2025-05-19 16:11:45 Start Gibbs sampling iteration 901/1000
# Most frequent terms for each period
# 1905-1926
terms(model10p1, top_n = 10)
## [[1]]
## token probability
## 1 education 0.07934297
## 2 reference 0.06579991
## 3 special 0.06386519
## 4 china 0.05612630
## 5 national 0.04258324
## 6 chinese 0.04064852
## 7 life 0.03871380
## 8 view 0.03484435
## 9 criteria 0.03290963
## 10 construetion 0.02710546
##
## [[2]]
## token probability
## 1 political 0.06721447
## 2 comparative 0.04033636
## 3 education 0.04033636
## 4 philosophy 0.03649663
## 5 chinese 0.03457677
## 6 social 0.03457677
## 7 confucius 0.02881717
## 8 interpretation 0.02881717
## 9 united 0.02881717
## 10 ancient 0.02497744
##
## [[3]]
## token probability
## 1 china 0.12115507
## 2 relations 0.04585264
## 3 commercial 0.03111955
## 4 essay 0.03111955
## 5 france 0.03111955
## 6 modern 0.02784553
## 7 political 0.02620852
## 8 historical 0.02293450
## 9 industrial 0.02293450
## 10 religious 0.02293450
##
## [[4]]
## token probability
## 1 british 0.07220492
## 2 government 0.04333120
## 3 labor 0.04126879
## 4 history 0.03920639
## 5 reference 0.03508157
## 6 special 0.03508157
## 7 chinese 0.03301916
## 8 american 0.03095675
## 9 china 0.02476953
## 10 french 0.02476953
##
## [[5]]
## token probability
## 1 chinese 0.10274479
## 2 immigration 0.03786473
## 3 united 0.03065583
## 4 china 0.02885361
## 5 evolution 0.02885361
## 6 rural 0.02705138
## 7 empire 0.02164471
## 8 school 0.02164471
## 9 constitution 0.01984249
## 10 administration 0.01623804
##
## [[6]]
## token probability
## 1 germany 0.03993813
## 2 comparative 0.03245441
## 3 east 0.03245441
## 4 italy 0.03245441
## 5 taxation 0.02995984
## 6 life 0.02746526
## 7 advent 0.02497069
## 8 close 0.02497069
## 9 decrease 0.02497069
## 10 eastern 0.02497069
##
## [[7]]
## token probability
## 1 china 0.08354048
## 2 central 0.05768887
## 3 government 0.05371169
## 4 local 0.04774594
## 5 provincial 0.04575735
## 6 governments 0.03979160
## 7 regulation 0.03780301
## 8 modern 0.03581443
## 9 public 0.02984867
## 10 essay 0.02587150
##
## [[8]]
## token probability
## 1 china 0.15661796
## 2 political 0.04433643
## 3 history 0.03103993
## 4 constitutional 0.02660777
## 5 relations 0.02660777
## 6 financial 0.02217560
## 7 foreign 0.02217560
## 8 reform 0.02217560
## 9 economic 0.02069821
## 10 powers 0.02069821
##
## [[9]]
## token probability
## 1 basis 0.05953716
## 2 county 0.05953716
## 3 selected 0.05953716
## 4 supervision 0.05953716
## 5 reading 0.04580310
## 6 chinese 0.03435805
## 7 agents 0.02978003
## 8 application 0.02978003
## 9 elementary 0.02978003
## 10 instruetion 0.02978003
##
## [[10]]
## token probability
## 1 chinese 0.11407323
## 2 learning 0.08318483
## 3 psychology 0.06180056
## 4 factors 0.04754437
## 5 characters 0.04516834
## 6 experimental 0.04279231
## 7 analysis 0.04041628
## 8 means 0.03566422
## 9 preliminary 0.03091216
## 10 process 0.03091216
# 1927-1937
terms(model10p2, top_n = 10)
## [[1]]
## token probability
## 1 china 0.04899251
## 2 chinese 0.04387466
## 3 rural 0.04095017
## 4 english 0.02194098
## 5 dynasty 0.02120986
## 6 church 0.02047874
## 7 school 0.01974762
## 8 system 0.01974762
## 9 united 0.01974762
## 10 american 0.01682313
##
## [[2]]
## token probability
## 1 china 0.06838861
## 2 education 0.03989533
## 3 chinese 0.03182224
## 4 reference 0.02897291
## 5 organization 0.02707336
## 6 special 0.02564870
## 7 social 0.01757560
## 8 control 0.01710071
## 9 legal 0.01567605
## 10 united 0.01567605
##
## [[3]]
## token probability
## 1 century 0.03001434
## 2 reference 0.02834734
## 3 political 0.02668034
## 4 economic 0.02584684
## 5 china 0.02334634
## 6 chinese 0.02001234
## 7 england 0.01834534
## 8 sovereignty 0.01751184
## 9 france 0.01584484
## 10 special 0.01584484
##
## [[4]]
## token probability
## 1 conference 0.03242113
## 2 nations 0.02470365
## 3 basis 0.02393190
## 4 international 0.02084491
## 5 league 0.02084491
## 6 british 0.01930141
## 7 ports 0.01852967
## 8 waterways 0.01852967
## 9 york 0.01775792
## 10 moral 0.01698617
##
## [[5]]
## token probability
## 1 chinese 0.07743780
## 2 china 0.03743160
## 3 school 0.02710742
## 4 succession 0.01936429
## 5 century 0.01871903
## 6 stories 0.01807377
## 7 civil 0.01742851
## 8 literature 0.01742851
## 9 children 0.01678324
## 10 code 0.01549272
##
## [[6]]
## token probability
## 1 china 0.08837499
## 2 chinese 0.05128084
## 3 foreign 0.03018809
## 4 economic 0.02727874
## 5 trade 0.02291473
## 6 international 0.01927805
## 7 evolution 0.01491403
## 8 policy 0.01491403
## 9 development 0.01455036
## 10 historical 0.01455036
##
## [[7]]
## token probability
## 1 comparative 0.04007703
## 2 george 0.04007703
## 3 historical 0.03232268
## 4 charles 0.02844551
## 5 china 0.02586073
## 6 education 0.02456834
## 7 france 0.02456834
## 8 chinese 0.02198356
## 9 teaching 0.01810639
## 10 england 0.01552161
##
## [[8]]
## token probability
## 1 service 0.04323503
## 2 united 0.03459050
## 3 american 0.03335556
## 4 role 0.02100622
## 5 chinese 0.01977129
## 6 effects 0.01853636
## 7 motor 0.01853636
## 8 freight 0.01730142
## 9 doctrine 0.01606649
## 10 pupil 0.01483156
##
## [[9]]
## token probability
## 1 china 0.09384180
## 2 relations 0.02681610
## 3 diplomatic 0.02215345
## 4 international 0.02098778
## 5 century 0.01807362
## 6 sino 0.01749079
## 7 treaties 0.01749079
## 8 foreign 0.01690796
## 9 asia 0.01632513
## 10 japanese 0.01574229
##
## [[10]]
## token probability
## 1 chinese 0.04028166
## 2 china 0.03289792
## 3 based 0.02148668
## 4 relation 0.01947294
## 5 special 0.01880169
## 6 educational 0.01745919
## 7 credit 0.01477419
## 8 experimental 0.01477419
## 9 associations 0.01410294
## 10 education 0.01276044
# 1938-1951
terms(model10p2, top_n = 10)
## [[1]]
## token probability
## 1 china 0.04899251
## 2 chinese 0.04387466
## 3 rural 0.04095017
## 4 english 0.02194098
## 5 dynasty 0.02120986
## 6 church 0.02047874
## 7 school 0.01974762
## 8 system 0.01974762
## 9 united 0.01974762
## 10 american 0.01682313
##
## [[2]]
## token probability
## 1 china 0.06838861
## 2 education 0.03989533
## 3 chinese 0.03182224
## 4 reference 0.02897291
## 5 organization 0.02707336
## 6 special 0.02564870
## 7 social 0.01757560
## 8 control 0.01710071
## 9 legal 0.01567605
## 10 united 0.01567605
##
## [[3]]
## token probability
## 1 century 0.03001434
## 2 reference 0.02834734
## 3 political 0.02668034
## 4 economic 0.02584684
## 5 china 0.02334634
## 6 chinese 0.02001234
## 7 england 0.01834534
## 8 sovereignty 0.01751184
## 9 france 0.01584484
## 10 special 0.01584484
##
## [[4]]
## token probability
## 1 conference 0.03242113
## 2 nations 0.02470365
## 3 basis 0.02393190
## 4 international 0.02084491
## 5 league 0.02084491
## 6 british 0.01930141
## 7 ports 0.01852967
## 8 waterways 0.01852967
## 9 york 0.01775792
## 10 moral 0.01698617
##
## [[5]]
## token probability
## 1 chinese 0.07743780
## 2 china 0.03743160
## 3 school 0.02710742
## 4 succession 0.01936429
## 5 century 0.01871903
## 6 stories 0.01807377
## 7 civil 0.01742851
## 8 literature 0.01742851
## 9 children 0.01678324
## 10 code 0.01549272
##
## [[6]]
## token probability
## 1 china 0.08837499
## 2 chinese 0.05128084
## 3 foreign 0.03018809
## 4 economic 0.02727874
## 5 trade 0.02291473
## 6 international 0.01927805
## 7 evolution 0.01491403
## 8 policy 0.01491403
## 9 development 0.01455036
## 10 historical 0.01455036
##
## [[7]]
## token probability
## 1 comparative 0.04007703
## 2 george 0.04007703
## 3 historical 0.03232268
## 4 charles 0.02844551
## 5 china 0.02586073
## 6 education 0.02456834
## 7 france 0.02456834
## 8 chinese 0.02198356
## 9 teaching 0.01810639
## 10 england 0.01552161
##
## [[8]]
## token probability
## 1 service 0.04323503
## 2 united 0.03459050
## 3 american 0.03335556
## 4 role 0.02100622
## 5 chinese 0.01977129
## 6 effects 0.01853636
## 7 motor 0.01853636
## 8 freight 0.01730142
## 9 doctrine 0.01606649
## 10 pupil 0.01483156
##
## [[9]]
## token probability
## 1 china 0.09384180
## 2 relations 0.02681610
## 3 diplomatic 0.02215345
## 4 international 0.02098778
## 5 century 0.01807362
## 6 sino 0.01749079
## 7 treaties 0.01749079
## 8 foreign 0.01690796
## 9 asia 0.01632513
## 10 japanese 0.01574229
##
## [[10]]
## token probability
## 1 chinese 0.04028166
## 2 china 0.03289792
## 3 based 0.02148668
## 4 relation 0.01947294
## 5 special 0.01880169
## 6 educational 0.01745919
## 7 credit 0.01477419
## 8 experimental 0.01477419
## 9 associations 0.01410294
## 10 education 0.01276044
# 1952-1962
terms(model10p2, top_n = 10)
## [[1]]
## token probability
## 1 china 0.04899251
## 2 chinese 0.04387466
## 3 rural 0.04095017
## 4 english 0.02194098
## 5 dynasty 0.02120986
## 6 church 0.02047874
## 7 school 0.01974762
## 8 system 0.01974762
## 9 united 0.01974762
## 10 american 0.01682313
##
## [[2]]
## token probability
## 1 china 0.06838861
## 2 education 0.03989533
## 3 chinese 0.03182224
## 4 reference 0.02897291
## 5 organization 0.02707336
## 6 special 0.02564870
## 7 social 0.01757560
## 8 control 0.01710071
## 9 legal 0.01567605
## 10 united 0.01567605
##
## [[3]]
## token probability
## 1 century 0.03001434
## 2 reference 0.02834734
## 3 political 0.02668034
## 4 economic 0.02584684
## 5 china 0.02334634
## 6 chinese 0.02001234
## 7 england 0.01834534
## 8 sovereignty 0.01751184
## 9 france 0.01584484
## 10 special 0.01584484
##
## [[4]]
## token probability
## 1 conference 0.03242113
## 2 nations 0.02470365
## 3 basis 0.02393190
## 4 international 0.02084491
## 5 league 0.02084491
## 6 british 0.01930141
## 7 ports 0.01852967
## 8 waterways 0.01852967
## 9 york 0.01775792
## 10 moral 0.01698617
##
## [[5]]
## token probability
## 1 chinese 0.07743780
## 2 china 0.03743160
## 3 school 0.02710742
## 4 succession 0.01936429
## 5 century 0.01871903
## 6 stories 0.01807377
## 7 civil 0.01742851
## 8 literature 0.01742851
## 9 children 0.01678324
## 10 code 0.01549272
##
## [[6]]
## token probability
## 1 china 0.08837499
## 2 chinese 0.05128084
## 3 foreign 0.03018809
## 4 economic 0.02727874
## 5 trade 0.02291473
## 6 international 0.01927805
## 7 evolution 0.01491403
## 8 policy 0.01491403
## 9 development 0.01455036
## 10 historical 0.01455036
##
## [[7]]
## token probability
## 1 comparative 0.04007703
## 2 george 0.04007703
## 3 historical 0.03232268
## 4 charles 0.02844551
## 5 china 0.02586073
## 6 education 0.02456834
## 7 france 0.02456834
## 8 chinese 0.02198356
## 9 teaching 0.01810639
## 10 england 0.01552161
##
## [[8]]
## token probability
## 1 service 0.04323503
## 2 united 0.03459050
## 3 american 0.03335556
## 4 role 0.02100622
## 5 chinese 0.01977129
## 6 effects 0.01853636
## 7 motor 0.01853636
## 8 freight 0.01730142
## 9 doctrine 0.01606649
## 10 pupil 0.01483156
##
## [[9]]
## token probability
## 1 china 0.09384180
## 2 relations 0.02681610
## 3 diplomatic 0.02215345
## 4 international 0.02098778
## 5 century 0.01807362
## 6 sino 0.01749079
## 7 treaties 0.01749079
## 8 foreign 0.01690796
## 9 asia 0.01632513
## 10 japanese 0.01574229
##
## [[10]]
## token probability
## 1 chinese 0.04028166
## 2 china 0.03289792
## 3 based 0.02148668
## 4 relation 0.01947294
## 5 special 0.01880169
## 6 educational 0.01745919
## 7 credit 0.01477419
## 8 experimental 0.01477419
## 9 associations 0.01410294
## 10 education 0.01276044
# Plot the topics
plot(model10p1, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1905-1926)",
subtitle = "Biterm topic model with 10 topics")
plot(model10p2, top_n = 8,
title = "Dissertation titles of Chinese PhDs abroad (1927-1937)",
subtitle = "Biterm topic model with 10 topics")
plot(model10p3, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1938-1951)",
subtitle = "Biterm topic model with 10 topics")
plot(model10p4, top_n = 10,
title = "Dissertation titles of Chinese PhDs abroad (1952-1962)",
subtitle = "Biterm topic model with 10 topics")
Extract named entities (persons, locations/GPE, organizations, events) from titles to identify China-focused theses:
library(histtext)
phd_all_master_ner <- phd_all_master %>% drop_na(Thesis)
phd_all_master_ner <- ner_on_df(phd_all_master_ner, text_column = "Thesis", id_column = "RID")
phd_all_master_ner %>% group_by(Type) %>% count(sort = TRUE)
A total of 2885 named entities were found, distributed as
follows:
phd_all_master_ner %>% group_by(Type) %>% count(sort = TRUE)
The named entities were curated and manually coded using a
combination of R (tidyverse) and Excel to classify the entities and
identify those related to China and Chinese issues. The resulting
dataset was used for statistical analyses of individual entity types and
served as a basis for analyzing China-centered theses.
# upload clean list of locations/GPE
library(readr)
phd_loc_clean_meta <- read_delim("Data/phd_loc_clean_meta.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
phd_loc_clean_meta %>% group_by(Place) %>% count(sort = TRUE)
# upload clean list of events
library(readr)
phd_event_clean_meta <- read_delim("Data/phd_event_clean_meta.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
phd_event_clean_meta %>% group_by(EventType) %>% count(sort = TRUE)
phd_event_clean_meta %>% group_by(Event2) %>% count(sort = TRUE)
# upload clean list of organizations
library(readr)
phd_org_clean_meta <- read_delim("Data/phd_org_clean_meta.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
phd_org_clean_meta %>% group_by(Text_clean2) %>% count(sort = TRUE)
phd_org_clean_meta %>% group_by(type2) %>% count(sort = TRUE)
phd_org_clean_meta %>% group_by(country) %>% count(sort = TRUE)
# upload clean list of persons
library(readr)
phd_pers_clean_meta <- read_delim("Data/phd_pers_clean_meta.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
phd_pers_clean_meta %>% group_by(Text_clean2) %>% count(sort = TRUE)
# upload compiled list of china-centered theses
library(readr)
china_theses_all <- read_delim("Data/china_theses_all.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
names(china_theses_all)
## [1] "RID" "Region" "Country" "Discipline" "Degree_Year"
## [6] "Thesis" "Title_Src" "totalDisc" "Field" "totalField"
The dataset of China-centered theses includes the following
information: a unique dissertation identifier (RID); region and country
of graduation (Region, Country); discipline and field of study
(Discipline, Field); year of graduation (Degree_Year); dissertation
title in both its original language and translated form (Title_Src,
Thesis); and the total number of dissertations within each discipline
and field (totalDisc, totalField), which is used for statistical
computations.
china_theses_all %>% group_by(Field) %>% count(sort = TRUE)
china_theses_all %>% group_by(Discipline) %>% count(sort = TRUE)
# Proportion
china_theses_fields <- china_theses_all %>% group_by(Field, totalField) %>% count()
china_theses_fields <- china_theses_fields %>% mutate(percent = round(n/totalField*100, 2))
china_theses_fields %>% arrange(desc(n))
china_theses_disciplines <- china_theses_all %>% group_by(Discipline, totalDisc) %>% count()
china_theses_disciplines <- china_theses_disciplines %>% mutate(percent = round(n/totalDisc*100, 2))
china_theses_disciplines %>% arrange(desc(n))
china_theses_disciplines %>% arrange(desc(percent))
849 dissertations in the Humanities and Social Sciences focused
on China, out of 1622, i.e. 52.3%. The disciplines that were most likely
to focus on China issues in absolute numbers are economics, education,
and international law and relations, but in proportion, history,
sociology, and anthropology are the most concerned with China.
library(hrbrthemes)
library(viridis)
# All Disciplines
china_theses_all %>%
ggplot( aes(x=Degree_Year)) +
geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin size = 1") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
) +
labs(title = "China-focused theses",
x = "Year",
y = "Frequency",
caption = "Source: Yuan, T’ung-li")
# Humanities and Social Sciences
china_theses_all %>%
filter(Field == "Humanities and social sciences") %>%
ggplot( aes(x=Degree_Year)) +
geom_histogram( binwidth=1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin size = 1") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
) +
labs(title = "China-focused theses (Humanities & Social Sciences)",
x = "Year",
y = "Frequency",
caption = "Source: Yuan, T’ung-li")
# Comparative Trend
shs_all <- phd_all_master %>% filter(Field == "Humanities and social sciences") %>% group_by(Degree_Year) %>% count() %>% rename(TotalTheses = n)
shs_china <- china_theses_all %>% filter(Field == "Humanities and social sciences") %>% group_by(Degree_Year) %>% count() %>% rename(ChinaTheses = n)
theses_year <- left_join(shs_all, shs_china)
theses_year <- theses_year %>%
mutate(ChinaTheses = ifelse(is.na(ChinaTheses), 0, ChinaTheses))
theses_year <- theses_year %>% drop_na(Degree_Year)
theses_year <- theses_year %>% mutate(ratio = round(ChinaTheses/TotalTheses, 2))
ggplot(theses_year, aes(x = Degree_Year)) +
geom_line(aes(y = ChinaTheses, color = "ChinaTheses"), size = 1) +
geom_line(aes(y = TotalTheses, color = "TotalTheses"), size = 1) +
labs(title = "Comparative Trend (Humanities & Social Sciences Only)",
x = "Year",
y = "Frequency") +
scale_color_manual(name = "Legend",
values = c("ChinaTheses" = "red", "TotalTheses" = "steelblue"),
labels = c("ChinaTheses" = "China-centered", "TotalTheses" = "All Theses (HSS)")) +
theme_minimal()
This first document has focused on the doctoral dissertations themselves, remaining within the confines of a single data source: the original dissertation catalog compiled by Yuan Tongli. In the next documents, we will shift our attention to the authors of these dissertations—who were the Chinese PhDs, what were their backgrounds, and what trajectories did they follow after graduation — drawing on external datasets, including the Chinese University Students’ Datasets (CSUD-OS) compiled by the Lee-Campbell Research Group, as well as biographical entries from web-based knowledge platforms such as Wikipedia and Baidu.
Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in America, 1905–1960 Washington, D.C.: Published under the auspices of the Sino-American Cultural Society, 1961.
Yuan, T’ung-li Doctoral Dissertations by Chinese Students in Great Britain and Northern Ireland, 1916–1961 Taipei: Chinese Cultural Research Institute, 1963.
Yuan, T’ung-li A Guide to Doctoral Dissertations by Chinese Students in Continental Europe, 1907–1962 Taipei: Chinese Culture Quarterly Review, 1964.