Abstract
This is the third instalment of our tutorial series devoted to a place-based study of American University Men in China. In this tutorial, we rely on community detection (Louvain algorithm) to detect, visualize and analyze subgroups of alumni within place-based networks.
In the previous tutorial, we learnt how to build and analyze the network of places and its transposed network of colleges using various centrality measures. In this new installment, we will see how we can detect communities of more densely connected places/colleges within the two networks.
The purpose is twofold :
Before we proceed to community detection, let’s recapitulate the successive operations for building place-based networks.
Summary of previous steps
# load packages
library(Places)
library(tidyverse)
library(igraph)
# load original data
aucplaces <- read_delim("Data/aucdata.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
aucplaces <- as.data.frame(aucplaces)
# retrieve places
Result1 <- places(aucplaces, "Name_eng", "University")
result1df <- as.data.frame(Result1$PlacesData)
# load annotated data (places manually labeled wit qualitative attributes)
place_attributes <- read_csv("Data/place_attributes.csv",
col_types = cols(...1 = col_skip())) # manually labeled places
# load university data (region)
univ_region <- read_delim("Data/univ_region.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
# create network of places
bimod<-table(Result1$Edgelist$Places, Result1$Edgelist$Set)
PlacesMat<-bimod %*% t(bimod)
diag(PlacesMat)<-0
Pla1Net<-graph_from_adjacency_matrix(PlacesMat, mode="undirected", weighted = TRUE)
# create network of colleges
bimod2<-table(Result1$Edgelist$Set, Result1$Edgelist$Places)
PlacesMat2<-bimod2 %*% t(bimod2)
diag(PlacesMat2)<-0
Pla2Net<-graph_from_adjacency_matrix(PlacesMat2, mode="undirected", weighted = TRUE)
# extract main components
Pla1NetMC <- induced.subgraph(Pla1Net,vids=clusters(Pla1Net)$membership==1)
Pla2NetMC <- induced.subgraph(Pla2Net,vids=clusters(Pla2Net)$membership==3)
We proceed in four steps:
We tested four clustering methods that are available in igraph: Louvain (lvc), spin glass (sg), fast greedy (fg), and edge betweeness (Girvan-Newman) (eb). The two first methods are non-hierarchical, whereas the two last ones are hierarchical clustering methods. The results are summarized in the table below:
Code | Method | Network | ModScore | NbrClusters | MinSize | MaxSize |
---|---|---|---|---|---|---|
lvc1 | Louvain | Places | 0.52 | 7 | 10 | 46 |
lvc2 | Louvain | Colleges | 0.49 | 8 | 6 | 21 |
sg1 | Spin glass | Places | 0.52 | 8 | 10 | 36 |
sg2 | Spin glass | Colleges | 0.48 | 11 | 3 | 21 |
fg1 | Fast greedy | Places | 0.49 | 5 | 31 | 41 |
fg2 | Fast greedy | Colleges | 0.49 | 8 | 3 | 21 |
eb1 | Girvan-Newman (edge betweeness) | Places | 0.48 | 43 | 1 | 33 |
eb2 | Girvan-Newman (edge betweeness) | Colleges | 0.46 | 13 | 2 | 20 |
The following lines of code were used to cluster the networks
using the various methods and evaluate their respective performance:
# Louvain
## detect communities
lvc1 <- cluster_louvain(Pla1NetMC)
lvc2 <- cluster_louvain(Pla2NetMC)
## inspect results
print(lvc1) # 7 groups, modularity score (mod): 0.52
print(lvc2) # 8 groups, modularity score (mod): 0.49
# communities sizes
sizes(lvc1) # from 10 to 46
sizes(lvc2) # from 6 to 21
# Spin glass
## detect communities
sg1 <- cluster_spinglass(Pla1NetMC)
sg2 <- cluster_spinglass(Pla2NetMC)
## inspect results
print(sg1) # 8 groups, mod: 0.52
print(sg2) # 11 groups, mod: 0.48 # larger number of smaller groups
sizes(sg1) # size of clusters ranges from 10 to 36 nodes
sizes(sg2) # size of clusters ranges from 3 to 21 nodes
# Fast greedy
## detect communities
fg1 <- cluster_fast_greedy(Pla1NetMC)
fg2 <- cluster_fast_greedy(Pla2NetMC)
## inspect results
print(fg1) # 5, mod: 0.49 # # smaller number of larger groups (less skewed, more egalitarian)
print(fg2) # 8, mod: 0.49 # larger number of smaller groups (more skewed, less egalitarian)
sizes(fg1) # size of clusters : ranges from 31 to 41 nodes
sizes(fg2) # size of clusters : ranges from 3 to 21 nodes
# Girvan-Newman (Edge Betweenness)
## detect communities
eb1 <- cluster_edge_betweenness(Pla1NetMC)
eb2 <- cluster_edge_betweenness(Pla2NetMC)
## inspect results
print(eb1) # 43, mod: 0.48 # larger number of communities (more dispersed)
print(eb2) # 13, mod: 0.46 # larger number of smaller groups
sizes(eb1) # size of clusters : ranges from 10 to 36 nodes
sizes(eb2) # size of clusters : ranges from 3 to 21 nodes
In the next steps, we will rely on the Louvain
algorithm because it presents the highest modularity scores on both
networks and it provides the most meaningful results from a humanistic
perspective.
Modularity is a measure of how well groups have been partitioned into clusters. It compares the relationships in a cluster compared to what would be expected for a random (or other baseline) number of connections. Modularity measures the quality (i.e., presumed accuracy) of a community grouping by comparing its relationship density to a suitably defined random network. The modularity quantifies the quality of an assignment of nodes to communities by evaluating how much more densely connected the nodes within a community are, compared to how connected they would be in a random network.
The Louvain algorithm is known for being one of the fastest modularity-based algorithms and for working well with large graphs (hundreds or thousands of nodes and edges). It also reveals a hierarchy of communities at different scales, which is useful for understanding the global functioning of a network. The method consists of repeated application of two steps. The first step is a “greedy” assignment of nodes to communities, favoring local optimizations of modularity. The second step is the definition of a new coarse-grained network based on the communities found in the first step. These two steps are repeated until no further modularity-increasing reassignments of communities are possible.
More details about the Louvain method can be found here.
## detect communities with Louvain
lvc1 <- cluster_louvain(Pla1NetMC)
lvc2 <- cluster_louvain(Pla2NetMC)
Network of places
V(Pla1NetMC)$group <- lvc1$membership # create a group for each community
V(Pla1NetMC)$color <- lvc1$membership # node color reflects group membership
plot(lvc1, Pla1NetMC, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size=1.8,
main="Communities of places (Louvain method)")
Network of colleges
V(Pla2NetMC)$group <- lvc2$membership # create a group for each community
V(Pla2NetMC)$color <- lvc2$membership # node color reflects group membership
plot(lvc2, Pla2NetMC, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size=3,
main="Communities of colleges (Louvain method)")
In the two graphs above, each community is represented by a
distinct color. Black ties refer to edges within communities, whereas
red edges refer to edges across/between communities.
These visualizations contain too much information to be really useful. In order to make of sense of the results, we need to extract and label each community individually.
# store membership data
louv_places <- data.frame(lvc1$membership,
lvc1$names) %>%
group_by(lvc1.membership) %>%
add_tally() %>% # add size of clusters
rename(PlaceLabel = lvc1.names, lvcluster = lvc1.membership, size = n)
louv_univ <- data.frame(lvc2$membership,
lvc2$names) %>%
group_by(lvc2.membership) %>%
add_tally() %>% # add size of clusters
rename(University = lvc2.names, lvcluster = lvc2.membership,
size = n)
# join with places and colleges attributes (place details and colleges popularity index = total number of curricula)
louv_places_detail <- inner_join(louv_places, place_attributes, by = "PlaceLabel") # join place details
univ_count <- aucplaces %>% group_by(University) %>% count() # compute index of popularity
louv_univ_detail <- inner_join(louv_univ, univ_count, by = "University") # join with colleges clusters
louv_univ_detail <- louv_univ_detail %>% rename(NbrCurricula = n) # rename n column
Extract communities from the network of places
glvc1 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==1) # 35 nodes
glvc2 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==2) # 25 nodes
glvc3 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==3) # 21 nodes
glvc4 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==4) # 20 nodes
glvc5 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==5) # 46 nodes
glvc6 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==6) # 21 nodes
glvc7 <- induced_subgraph(Pla1NetMC, V(Pla1NetMC)$group==7) # 10 nodes
Extract communities from the network of universities
glvcu1 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==1) # 18 nodes
glvcu2 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==2) # 14 nodes
glvcu3 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==3) # 10 nodes
glvcu4 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==4) # 21 nodes
glvcu5 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==5) # 14 nodes
glvcu6 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==6) # 15 nodes
glvcu7 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==7) # 6 nodes
glvcu8 <- induced_subgraph(Pla2NetMC, V(Pla2NetMC)$group==8) # 6 nodes
We chose to label the communities of places according to the college(s) that most frequently appear in the places included in each community. In order to identify the college(s) that give each community its coherence, we simply looked for the places that contain only one college and a high number of students.
louv_places_detail <- louv_places_detail %>% mutate(lvcluster = as.factor(lvcluster))
louv_places_detail$cluster_name <- fct_recode(louv_places_detail$lvcluster,
"Columbia" ="1",
"Princeton-NYU"="2",
"Pennsylvania"="3",
"Chicago" ="4",
"California-Cornell-Michigan"="5",
"Harvard-MIT" = "6",
"Yale-Vanderbilt"= "7")
Similarly, we labeled the communities of colleges according to
the colleges that concentrated the largest number of curricula. In order
to find them, we computed the number of curricula per college and we
sorted the colleges in each cluster by decreasing order of
importance.
louv_univ_detail <- louv_univ_detail %>% mutate(lvcluster = as.factor(lvcluster))
louv_univ_detail$cluster_name <- fct_recode(louv_univ_detail$lvcluster,
"Pennsylvania-Yale-Illinois" ="1",
"Harvard-MIT"="2",
"California-Cornell"="3",
"Columbia-NYU" ="4",
"Chicago"="5",
"Princeton-George Washington" = "6",
"Michigan"= "7",
"Wisconsin-Purdue"= "8")
plot(glvc1, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5,
main="Places Group 1 (Columbia)")
plot(glvcu4, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu4)*0.1,
vertex.size= degree(glvcu4)*3, # node size proportionate to node degree (in cluster)
main="Colleges Group 4 (Columbia-NYU)")
plot(glvc2, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5,
main="Places Group 2 (Princeton-NYU)")
plot(glvcu6, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu6)*0.15,
vertex.size= degree(glvcu6)*1.5, # node size proportionate to node degree (in cluster)
main="Colleges Group 6 (Princeton-G.Wasghinton-Wooster-Northwestern)")
plot(glvc3, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5, # node size proportionate to node degree
main="Places Group 3 (Pennsylvania)")
plot(glvcu1, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu1)*0.2,
vertex.size= degree(glvcu1), # node size proportionate to node degree (in cluster)
main="Colleges Group 1 (Pennsylvania-Yale-Illinois)")
plot(glvc4, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5,
main="Places Group 4 (Chicago)")
plot(glvcu5, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu5)*0.1,
vertex.size= degree(glvcu5)*3, # node size proportionate to node degree (in cluster)
main="Colleges Group 5 (Chicago)")
plot(glvc5, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5,
main="Places Group 5 (California-Cornell-Michigan)")
plot(glvcu3, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu3)*0.2,
vertex.size= degree(glvcu3)*2, # node size proportionate to node degree (in cluster)
main="Colleges Group 3 (California-Cornell)")
plot(glvcu7, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu7)*0.5,
vertex.size= degree(glvcu7)*5, # node size proportionate to node degree (in cluster)
main="Colleges Group 7 (Michigan)")
plot(glvcu8, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu8)*0.5,
vertex.size= degree(glvcu8)*5, # node size proportionate to node degree (in cluster)
main="Colleges Group 8 (Purdue)")
plot(glvc6, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5, # node size proportionate to node degree
main="Places Group 6 (Harvard-MIT)")
plot(glvcu2, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu2)*0.2,
vertex.size= degree(glvcu2)*2, # node size proportionate to node degree (in cluster)
main="Colleges Group 2 (Harvard-MIT)")
plot(glvc7, vertex.label=V(Pla1NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = 0.5,
vertex.size= 5, # node size proportionate to node degree
main="Places Group 7 (Yale-Vanderbilt)")
plot(glvcu1, vertex.label=V(Pla2NetMC)$id,
vertex.label.color = "black",
vertex.label.cex = degree(glvcu1)*0.2,
vertex.size= degree(glvcu1), # node size proportionate to node degree (in cluster)
main="Colleges Group 1 (Pennsylvania-Yale-Illinois)")
No | Label | Size | Shape |
---|---|---|---|
PG1 | Columbia | 35 | Hairball |
PG2 | Princeton-NYU | 25 | Chain |
PG3 | Pennsylvania | 21 | Hairball |
PG4 | Chicago | 20 | Hairball |
PG5 | California-Cornell-Michigan | 46 | Hybrid |
PG6 | Harvard-MIT (Technical) | 26 | Hybrid |
PG7 | Yale | 10 | Hairball |
PG = Place-based groups |
No | Label | Size | Shape |
---|---|---|---|
UG1 | Pennsylvania-Yale-Illinois | 18 | Chain |
UG2 | Harvard-MIT | 14 | Chain |
UG3 | California-Cornell | 10 | Bipolar |
UG4 | Columbia-NYU | 21 | Star |
UG5 | Chicago | 14 | Chain |
UG6 | Princeton-George Washington | 15 | Chain |
UG7 | Michigan | 6 | Star |
UG8 | Purdue | 6 | Star |
UG = University-based groups |
PGNo | PGLabel | PGSize | PGShape | UGNo | UGLabel | UGSize | UGShape |
---|---|---|---|---|---|---|---|
PG1 | Columbia | 35 | Hairball | UG4 | Columbia-NYU | 21 | Star |
PG2 | Princeton-NYU | 25 | Chain | UG6, UG4 | Princeton-George Washington, Columbia-NYU | 15, 21 | Chain, Star |
PG3 | Pennsylvania | 21 | Hairball | UG1 | Pennsylvania-Yale-Illinois | 18 | Chain |
PG4 | Chicago | 20 | Hairball | UG5 | Chicago | 14 | Chain |
PG5 | California-Cornell-Michigan | 46 | Hybrid | UG3, UG7 | California-Cornell, Michigan | 10, 6 | Bipolar |
PG6 | Harvard-MIT (Technical) | 26 | Hybrid | UG2 | Harvard-MIT (Technical) | 18 | Chain |
PG7 | Yale | 10 | Hairball | UG1 | Pennsylvania-Yale-Illinois | 18 | Chain |
These visual comparisons need to be substantiated with the analysis of communities structures based on their topographical metrics. For this structural analysis, we rely on the following metrics:
We proceed in two steps:
order1 <- gorder(glvc1)
order2 <- gorder(glvc2)
order3 <- gorder(glvc3)
order4 <- gorder(glvc4)
order5 <- gorder(glvc5)
order6 <- gorder(glvc6)
order7 <- gorder(glvc7)
order <- c(order1, order2, order3, order4, order5, order6, order7)
order
## [1] 35 25 21 20 46 26 10
size1 <- gsize(glvc1)
size2 <- gsize(glvc2)
size3 <- gsize(glvc3)
size4 <- gsize(glvc4)
size5 <- gsize(glvc5)
size6 <- gsize(glvc6)
size7 <- gsize(glvc7)
size <- c(size1, size2, size3, size4, size5, size6, size7)
size
## [1] 500 59 173 113 152 165 25
d1 <- diameter(glvc1)
d2 <- diameter(glvc2)
d3 <- diameter(glvc3)
d4 <- diameter(glvc4)
d5 <- diameter(glvc5)
d6 <- diameter(glvc6)
d7 <- diameter(glvc7)
diameter <- c(d1, d2, d3, d4, d5, d6, d7)
diameter
## [1] 3 6 3 3 5 5 3
Communities PG2 (Princeton-NYU), PG5
(California-Cornell-Michigan) and PG6 (Harvard-MIT) present larger
diameters than other communities (6 and 5 respectively), which is
directly related to their chain-based or hybrid structures. Large
diameters tend to impede or reduce circulations between places in these
groups. By contrast, hairball communities (PG1-Columbia) present smaller
diameters (3), which implies that there were more intense circulations
between Columbia graduates (members of the Columbia community). We also
notice that there is no linear relation between order (number of nodes)
and diameter (longest path). Although PG1 (Columbia) and PG7 (Yale) have
the same diameter (3), the former contains a larger number of nodes (35)
than the latter (10). We can verify this observation by simply plotting
order/diameter:
plot(order, diameter)
#### Average distance
avdist1 <- average.path.length(glvc1)
avdist2 <- average.path.length(glvc2)
avdist3 <- average.path.length(glvc3)
avdist4 <- average.path.length(glvc4)
avdist5 <- average.path.length(glvc5)
avdist6 <- average.path.length(glvc6)
avdist7 <- average.path.length(glvc7)
avgpath <- c(avdist1, avdist2, avdist3, avdist4, avdist5, avdist6, avdist7)
avgpath
## [1] 1.164706 2.456667 1.180952 1.442105 2.546860 1.738462 1.488889
The maximum distance between places is to be found in
communities 5 (California-Cornell-Michigan) and 2 (Princeton-NYU)
(>2). The minimum distance is to be found in communities 1 (Columbia)
and 3 (Pennsylvania) (<1.2). Maximum distances are associated with
chain-based (or hybrid-chain) structures , whereas minimum distances are
associated with hairball structures.
How densely connected are the nodes (places) in each community?
ed1 <- edge_density(glvc1)
ed2 <- edge_density(glvc2)
ed3 <- edge_density(glvc3)
ed4 <- edge_density(glvc4)
ed5 <- edge_density(glvc5)
ed6 <- edge_density(glvc6)
ed7 <- edge_density(glvc7)
density <- c(ed1, ed2, ed3, ed4, ed5, ed6, ed7)
density
## [1] 0.8403361 0.1966667 0.8238095 0.5947368 0.1468599 0.5076923 0.5555556
The maximum densities are to be found in communities PG1
(Columbia) and PG3 (Pennsylvania) (> 0.8). They are clearly
associated with hairball structures. Minimum densities are to be found
in communities PG5 (California-Cornell-Michigan) and PG2 (Princeton-NYU)
(<0.2). They are associated with chain-based (or hybrid chain)
structures.
How many ties do the nodes have, on average, in each community?
deg1 <- mean(degree(glvc1))
deg2 <- mean(degree(glvc2))
deg3 <- mean(degree(glvc3))
deg4 <- mean(degree(glvc4))
deg5 <- mean(degree(glvc5))
deg6 <- mean(degree(glvc6))
deg7 <- mean(degree(glvc7))
avgdegree <- c(deg1, deg2, deg3, deg4, deg5, deg6, deg7)
avgdegree
## [1] 28.571429 4.720000 16.476190 11.300000 6.608696 12.692308 5.000000
The largest average degree is to be found in community PG1
(Columbia: 28.57), the smallest in PG2 (Princeton-NYU) (4.72). Larger
average degrees are associated with hairball structures (PG1-Columbia,
PG3-Pennsylvania, and PG4-Chicago), whereas smaller average degrees are
associated with chain-based structures (e.g. PG2-Princeton-NYU).
To what extent are the communities spaced out or bunched up? Since igraph does not offer built-in function for this metrics, we will need to write our own function, relying on this tutorial:
# Create function
Compactness <- function(g) {
gra.geo <- distances(g) ## get geodesics
gra.rdist <- 1/gra.geo ## get reciprocal of geodesics
diag(gra.rdist) <- NA ## assign NA to diagonal
gra.rdist[gra.rdist == Inf] <- 0 ## replace infinity with 0
# Compactness = mean of reciprocal distances
comp.igph <- mean(gra.rdist, na.rm=TRUE)
return(comp.igph)
}
# Apply function to each community of places
comp1 <- Compactness(glvc1)
comp2 <- Compactness(glvc2)
comp3 <- Compactness(glvc3)
comp4 <- Compactness(glvc4)
comp5 <- Compactness(glvc5)
comp6 <- Compactness(glvc6)
comp7 <- Compactness(glvc7)
compactness <- c(comp1, comp2, comp3, comp4, comp5, comp6, comp7)
compactness
## [1] 0.9142857 0.5080556 0.9031746 0.7912281 0.4751047 0.7161026 0.7703704
Hairball-type communities (PG1-Columbia, PG3-Pennsylvania and to
a lesser extent, PG4-Chicago and PG7-Yale) are naturally more compact
than chain-based (PG2-Princeton-NYU) and hybrid
(PG5-California-Cornell-Michigan) ones.
How many triangles can we find in each community? To what extent are they likely to break down into subgroups?
trans1 <- transitivity(glvc1)
trans2 <- transitivity(glvc2)
trans3 <- transitivity(glvc3)
trans4 <- transitivity(glvc4)
trans5 <- transitivity(glvc5)
trans6 <- transitivity(glvc6)
trans7 <- transitivity(glvc7)
transitivity <- c(trans1, trans2, trans3, trans4, trans5, trans6, trans7)
transitivity
## [1] 0.9918694 0.6517572 0.9877676 0.9516240 0.6625310 0.9639922 0.8571429
Hairball communities (PG1-Columbia, PG3-Pennsylvania,
PG6-Harvard-MIT, PG4-Chicago) generally contain more triangles than
chain-based communities (PG2-Princeton-NYU,
PG5-California-Cornell-Michigan). (Higher transitivity scores (close to
1) mean that the networks contain a large number of triangles and are
more likely to break down into subgroups.)
To what extent are the communities dominated by one or a few nodes?
eigen1 <- centr_eigen(glvc1)$centralization
eigen2 <- centr_eigen(glvc2)$centralization
eigen3 <- centr_eigen(glvc3)$centralization
eigen4 <- centr_eigen(glvc4)$centralization
eigen5 <- centr_eigen(glvc5)$centralization
eigen6 <- centr_eigen(glvc6)$centralization
eigen7 <- centr_eigen(glvc7)$centralization
dominance <- c(eigen1, eigen2, eigen3, eigen4, eigen5, eigen6, eigen7)
dominance
## [1] 0.08877289 0.68899847 0.10203164 0.26848306 0.70183553 0.32723282 0.34480619
Smaller values implies lower dominance and more egalitarianism.
Based on these metrics (eigenvector centrality), we observe that
chain-based communities (PG5-California-Cornell-Michigan and
PG2-Princeton-NYU) are more dominated and less “egalitarian” than
hairball structures (PG1-Columbia and PG3-Pennsylvania).
Finally, we compile these metrics in order to identify structural profiles based on their possible combinations using PCA/HCPC.
louvain_places_pca <- cbind(order, size, diameter, avgpath, density, avgdegree, compactness, transitivity, dominance)
rownames(louvain_places_pca) <- c("PG1.Columbia", "PG2.Princeton-NYU", "PG3.Pennsylvania", "PG4.Chicago","PG5.California-Cornell-Michigan","PG6.Harvard","PG7.Yale")
write.csv(louvain_places_pca, file="louvain_places_pca.csv")
louvain_places_pca <- read.csv("~/Places AUC/Markdown/Data/louvain_places_pca.csv", row.names=1)
Load packages
library(FactoMineR)
library(Factoshiny)
library(factoextra)
res.PCA<-PCA(louvain_places_pca,graph=FALSE)
plot.PCA(res.PCA,choix='var',title="PCA Graph of Variables (Topographical metrics)")
The two dimensions capture 95% of information - 74% on the first
dimension and 21% on the second. Six dimensions are necessary to capture
100% information.
On the graph of variables, density, transitivity and compactness are positively and strongly correlated with the first dimension, whereas diameter, average path and dominance are negatively correlated with the same dimension. Order and size are positively and strongly correlated to the second dimension. Average degree is positively correlated to the two dimensions.
Dense, compact and more egalitarian communities with high transitivity (many triangles) and average degree are situated on the right side of the graph, whereas sparse communities, with long average path (geodesic distance) dominated by one or few nodes are located on the left side of the graph. Communities with large number of nodes and edges (large order and size) are situated at the top of the graph (above the x axis), whereas smaller communities (with fewer nodes and edges) are located below.
plot.PCA(res.PCA,title="PCA Graph of Individuals (Communities of places")
The former group (including more compact and egalitarian communities) is associated with hairball style communities, e.g. PG1-Columbia and PG3-Pennsylvania. The latter group (more sparse and less egalitarian communities) is associated with chain-based structures e.g. PG5-California-Cornell-Michigan and PG2-Princeton-NYU.
The Columbia community (PG1) in the right-hand corner contains the largest number of nodes (places), whereas the California-Cornell-Michigan community (PG5) in the left hand corner contains the largest number of edges. Yale (PG7) and Chicago (PG4) communities contain fewer edges and nodes, since they are located at the bottom of the graph. The Harvard community (PG6) is closer the mean profile since it coïncides with the point of origin.
In the next section, we perform a hierarchical clustering (HCPC) on all six dimensions in order to group the communities according to their structural profiles.
res.PCA<-PCA(louvain_places_pca,ncp=6,graph=FALSE)
res.HCPC<-HCPC(res.PCA,nb.clust=3,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical Tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor Map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='3D Tree on Factor Map')
The groups were identified:
Warning: the partition is not significantly determined by topographical metrics (p-value >0.05).
order1 <- gorder(glvcu1)
order2 <- gorder(glvcu2)
order3 <- gorder(glvcu3)
order4 <- gorder(glvcu4)
order5 <- gorder(glvcu5)
order6 <- gorder(glvcu6)
order7 <- gorder(glvcu7)
order8 <- gorder(glvcu8)
order2 <- c(order1, order2, order3, order4, order5, order6, order7, order8)
order2
## [1] 18 14 10 21 14 15 6 6
size1 <- gsize(glvcu1)
size2 <- gsize(glvcu2)
size3 <- gsize(glvcu3)
size4 <- gsize(glvcu4)
size5 <- gsize(glvcu5)
size6 <- gsize(glvcu6)
size7 <- gsize(glvcu7)
size8 <- gsize(glvcu8)
size2 <- c(size1, size2, size3, size4, size5, size6, size7, size8)
size2
## [1] 21 14 12 29 16 20 5 6
The Columbia-NYU community (UG4) contains the largest number of
nodes (21) and edges (29), followed by Pennsylvania-Yale-Illinois (UG1)
and Princeton-George Washington (UG6) constellations (18 and 15 nodes,
21 and 20 edges, respectively). The Michigan and Purdue communities are
the smallest ones based on the number of nodes and edges they include (6
nodes each, 5 and 6 edges, respectively). Harvard-MIT (UG2), Chicago
(UG5) and California-Cornell (UG3) stand in between with 14 and 10
nodes, 14, 16 and 12 edges respectively.
d1 <- diameter(glvcu1)
d2 <- diameter(glvcu2)
d3 <- diameter(glvcu3)
d4 <- diameter(glvcu4)
d5 <- diameter(glvcu5)
d6 <- diameter(glvcu6)
d7 <- diameter(glvcu7)
d8 <- diameter(glvcu8)
diameter2 <- c(d1, d2, d3, d4, d5, d6, d7, d8)
diameter2
## [1] 6 5 3 5 4 5 3 3
The Pennsylvania-Yale-Illinois constellation (UG1) presents the
largest diameter (6), followed by the Harvard-MIT (UG2), Columbia-NYU
(UG4) and Princeton-George Washington (UG6) communities (5 for each).
Except for Columbia, large diameters are generally associated with
chain-based structures which impede or reduce circulations.
Bipolar (UG3-California-Cornell) and star-like communities
(UG7-Michigan, UG8-Purdue) present smaller diameters (3), which
faciliates circulations within these communities. We notice that there
is no linear relation between order (number of nodes) and diameter. For
example, although the UG4 (Columbia-NYU) contain more nodes than UG2
(Harvard-MIT) and UG6 (Princeton-George Washington) (21, 15 and 15
respectively) the three communities presents exactly the same diameter
(5) This can be verified by simply plotting order/diameter:
plot(order2, diameter2)
#### Average distance
avdist1 <- average.path.length(glvcu1)
avdist2 <- average.path.length(glvcu2)
avdist3 <- average.path.length(glvcu3)
avdist4 <- average.path.length(glvcu4)
avdist5 <- average.path.length(glvcu5)
avdist6 <- average.path.length(glvcu6)
avdist7 <- average.path.length(glvcu7)
avdist8 <- average.path.length(glvcu8)
avgpath2 <- c(avdist1, avdist2, avdist3, avdist4, avdist5, avdist6, avdist7, avdist8)
avgpath2
## [1] 2.555556 2.483516 2.000000 2.071429 2.230769 2.638095 1.866667 1.800000
The maximum distance between colleges is to be found in the UG6
(Princeton-George Washington) and UG1 (Pennsylvania-Yale-Illinois)
constellations (>2.5). Minimum distances occur in communities UG7
(Michigan) and UG8 (Purdue) (<2) (and to a lesser extent,
UG1-Columbia, close to 2) Maximum distances are associated with
chain-based or hybrid structures, whereas minimum distances are
associated with star-like communities.
How densely connected are the nodes (places) in each community?
ed1 <- edge_density(glvcu1)
ed2 <- edge_density(glvcu2)
ed3 <- edge_density(glvcu3)
ed4 <- edge_density(glvcu4)
ed5 <- edge_density(glvcu5)
ed6 <- edge_density(glvcu6)
ed7 <- edge_density(glvcu7)
ed8 <- edge_density(glvcu8)
density2 <- c(ed1, ed2, ed3, ed4, ed5, ed6, ed7, ed8)
density2
## [1] 0.1372549 0.1538462 0.2666667 0.1380952 0.1758242 0.1904762 0.3333333
## [8] 0.4000000
Maximum densities are to be found in communities UG8 (Purdue)
and UG7 (Michigan) (> 0.3). They are associated with star-like
structures and highly specialized curricula (engineers, professionals).
Minimum densities are to be found in constellations UG1
(California-Cornell) and UG4 (Princeton-NYU) (<0.2). They are
associated with chain-based (or hybrid) structures and less specialized
curricula (humanities, sciences).
How many ties do the nodes have, on average, in each community?
deg1 <- mean(degree(glvcu1))
deg2 <- mean(degree(glvcu2))
deg3 <- mean(degree(glvcu3))
deg4 <- mean(degree(glvcu4))
deg5 <- mean(degree(glvcu5))
deg6 <- mean(degree(glvcu6))
deg7 <- mean(degree(glvcu7))
deg8 <- mean(degree(glvcu8))
avgdegree2 <- c(deg1, deg2, deg3, deg4, deg5, deg6, deg7, deg8)
avgdegree2
## [1] 2.333333 2.000000 2.400000 2.761905 2.285714 2.666667 1.666667 2.000000
The largest average degree (2.76) is to be found in the Columbia
community, the smallest (1.66) the Michigan community (UG7). There
appears to be no clear relation between average degree and structures in
college-based communities. Columbia and Michigan communities both
present star-like structures.
To what extent are the communities spaced out or bunched up? Since igraph does not offer built-in function for this metrics, we will need to write our own function, relying on this tutorial:
# Create function
Compactness <- function(g) {
gra.geo <- distances(g) ## get geodesics
gra.rdist <- 1/gra.geo ## get reciprocal of geodesics
diag(gra.rdist) <- NA ## assign NA to diagonal
gra.rdist[gra.rdist == Inf] <- 0 ## replace infinity with 0
# Compactness = mean of reciprocal distances
comp.igph <- mean(gra.rdist, na.rm=TRUE)
return(comp.igph)
}
# Apply function to each community of places
comp1 <- Compactness(glvcu1)
comp2 <- Compactness(glvcu2)
comp3 <- Compactness(glvcu3)
comp4 <- Compactness(glvcu4)
comp5 <- Compactness(glvcu5)
comp6 <- Compactness(glvcu6)
comp7 <- Compactness(glvcu7)
comp8 <- Compactness(glvcu8)
compactness2 <- c(comp1, comp2, comp3, comp4, comp5, comp6, comp7, comp8)
compactness2
## [1] 0.4497821 0.4851648 0.5888889 0.4939683 0.5228938 0.4885714 0.6333333
## [8] 0.6666667
Star-like, specialized communities (UG8-Purdue, UG7-Michigan)
present higher compactness than others.
How many triangles can we find in each community? To what extent are they likely to break down into subgroups?
trans1 <- transitivity(glvcu1)
trans2 <- transitivity(glvcu2)
trans3 <- transitivity(glvcu3)
trans4 <- transitivity(glvcu4)
trans5 <- transitivity(glvcu5)
trans6 <- transitivity(glvcu6)
trans7 <- transitivity(glvcu7)
trans8 <- transitivity(glvcu8)
transitivity2 <- c(trans1, trans2, trans3, trans4, trans5, trans6, trans7, trans8)
transitivity2
## [1] 0.16216216 0.08108108 0.28125000 0.17647059 0.18000000 0.38181818 0.00000000
## [8] 0.33333333
Communities UG6 (Princeton-George Washington), UG8 (Purdue) and
UG3 (California-Cornell) present higher transitivity than other
communities. UG7 (Michigan) does not contain any triangle (null
transivity). There is no clear relation between transitivity and global
shape and no correspondence between transivity scores in the two
networks.
To what extent are the communities dominated by one or a few nodes?
eigen1 <- centr_eigen(glvcu1)$centralization
eigen2 <- centr_eigen(glvcu2)$centralization
eigen3 <- centr_eigen(glvcu3)$centralization
eigen4 <- centr_eigen(glvcu4)$centralization
eigen5 <- centr_eigen(glvcu5)$centralization
eigen6 <- centr_eigen(glvcu6)$centralization
eigen7 <- centr_eigen(glvcu7)$centralization
eigen8 <- centr_eigen(glvcu8)$centralization
dominance2 <- c(eigen1, eigen2, eigen3, eigen4, eigen5, eigen6, eigen7, eigen8)
dominance2
## [1] 0.7714981 0.7613397 0.5840721 0.7848368 0.7481689 0.6429198 0.6557278
## [8] 0.6014263
Communities UG4-Columbia-NYU, UG1-Pennsylvania-Yale-Illinois and
UG2-Harvard-MIT are less egalitarian than others. Except for Columbia
(UG4), chain-based communities generally present higher dominance than
star-like or bipolar communities. We observe a negative correspondence
between place-based and college-based communities from the perspective
of dominance. For example, Columbia and Pennsylvania communities are
less egalitarian than other college-based communities, whereas in the
network of places, they are the most egalitarian of all communities.
Finally, we compile these metrics in order to identify structural profiles based on their possible combinations using PCA/HCPC.
louvain_univ_pca <- cbind(order2, size2, diameter2, avgpath2, density2, avgdegree2, compactness2, transitivity2, dominance2)
rownames(louvain_univ_pca) <- c("UG1.Pennsylvania-Yale-Illinois", "UG2.Harvard-MIT", "UG3.California-Cornell", "UG4.Columbia-NYU","UG5.Chicago","UG6.Princeton-George Washington","UG7.Michigan", "UG8.Purdue")
write.csv(louvain_univ_pca, file="louvain_univ_pca.csv")
louvain_univ_pca <- read.csv("~/Places AUC/Markdown/Data/louvain_univ_pca.csv", row.names=1)
Load packages
library(FactoMineR)
library(Factoshiny)
library(factoextra)
res.PCA<-PCA(louvain_univ_pca,graph=FALSE)
plot.PCA(res.PCA,choix='var',title="PCA Graph of Variables (Topographical metrics)")
The two dimensions capture 89% of information - 70% on the first
dimension and 19% on the second. Seven dimensions are necessary to
capture 100% information. On the graph of variables, diameter, order,
size and average path are positively correlated with the first
dimension, whereas density, and compactness are negatively correlated
with the same dimension. Transitivity is positively and strongly
associated with the second dimension. Average degree is positively
correlated to the two dimensions, whereas dominance is positively
associated with the first dimension, but negatively with the second.
plot.PCA(res.PCA,title="PCA Graph of Individuals (Communities of colleges")
Large and loose communities characterized by a large number of nodes, edges, and diameter, but low density and compactness, are located on the right side of the graph, whereas small but dense and compact communities are located on the opposite (left side). The former group essentially includes the Columbia community (UG4) and the Pennsylvania-Yale-Illinois constellation (UG1). The latter refers to Purdue (UG8) and California-Cornell (UG3) communities. The Princeton constellation (UG6) is characterized by a large average degree and located in the top right hand corner on the opposite of the Michigan community (UG7). The highly dominated Harvard-MIT community (UG2) is located in the bottom right hand corner. As it is closer to the point of origin, the Chicago community (UG5) most strongly represents the mean profile.
In the next section, we perform a hierarchical clustering (HCPC) on all seven dimensions in order to group the communities according to their structural profiles.
res.PCA<-PCA(louvain_univ_pca,ncp=7,graph=FALSE)
res.HCPC<-HCPC(res.PCA,nb.clust=4,consol=FALSE,graph=FALSE)
plot.HCPC(res.HCPC,choice='tree',title='Hierarchical Tree')
plot.HCPC(res.HCPC,choice='map',draw.tree=FALSE,title='Factor Map')
plot.HCPC(res.HCPC,choice='3D.map',ind.names=FALSE,centers.plot=FALSE,angle=60,title='3D Tree on Factor Map')
Four classes were detected:
Note: The partition is most strongly characterized by dominance (p value 0.00066).
In order to enrich this structural analysis, the next section seeks to better characterize the profiles of community members. It provides a method for visually comparing communities membership. Since we do not have attributes for colleges, we will focus on the communities of places (PG).
For this analysis, we rely on a dataset which was created earlier in this tutorial (louv_places_detail). It contains the list of places with their quantitative and qualitative attributes, and the cluster which they belong to (see “Community detection - Extract Communities”).
How many students, on average, did the place contain in each community? Did some communities contain more populated places than others, which were rather defined by singular trajectories? We use boxplots below to compare the average number and the range of students in each community.
louv_places_detail %>%
ggplot(aes(reorder(cluster_name, NbElements), NbElements, color = cluster_name)) +
geom_boxplot(alpha = 0.8, show.legend = FALSE) +
coord_flip() +
labs(x = "Communities", y = "Number of students per place") +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni Communities (Louvain)",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
The Yale-Vanderbilt community ranks first based on the average
number of students, but it presents a lower dispersion than California,
Harvard, Pennsylvania, Columbia, and Princeton. The Chicago community is
more homogeneous with fewer students per place (maximum of 5).
How many colleges did the students attended, on average, in each community?
louv_places_detail %>%
ggplot(aes(reorder(cluster_name, NbSets), NbSets, color = cluster_name)) +
geom_boxplot(alpha = 0.8, show.legend = FALSE) +
coord_flip() +
labs(x = "Communities", y = "Number of colleges per place") +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni Communities (Louvain)",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
The ranking based on college attendance follows the exact reverse
order. The members of the Yale-Vanderbilt community attended fewer
colleges than then average (less than 2). The Columbia community
presents the widest possible range of curricula. Other communities point
to intermediate situations. The Princeton-NYU community shows the
maximum dispersion but most of its members attended less than two
universities.
ggplot(data = louv_places_detail, mapping = aes(x = NbSets, y = NbElements, color = cluster_name)) +
geom_jitter(show.legend = FALSE) +
facet_wrap(~ cluster_name) +
labs(x = "Number of colleges per place", y = "Number of students per place") +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni Communities (Louvain)",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
If we combine the number of students with the number of colleges in
each community, three profiles can be defined:
louv_places_detail %>%
group_by(cluster_name, Nationality) %>%
count() %>%
ggplot(aes(reorder(Nationality, n), n, fill = Nationality)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni Communities (Louvain)",
fill = "Students' nationalities",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
The above plots reveal three national profiles:
# First, we create a function that serves to define integer breaks on the *x* axes of the plots:
integer_breaks <- function(n = 5, ...) {
fxn <- function(x) {
breaks <- floor(pretty(x, n, ...))
names(breaks) <- attr(breaks, "labels")
breaks
}
return(fxn)
}
louv_places_detail %>%
group_by(cluster_name, Region_nbr) %>%
count() %>%
ggplot(aes(reorder(Region_nbr, n), n, fill = Region_nbr)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Geographical coverage",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
The plots reveal three main regional profiles:
louv_places_detail %>%
group_by(cluster_name, Mobility) %>%
count() %>%
ggplot(aes(reorder(Mobility, n), n, fill = Mobility)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
scale_fill_discrete(
limits = c("INTER", "INTRA", "NULL"),
labels = c("Inter-region (high)", "Intra-region (low)", "Null (same college)")
) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Geographical mobility",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
Students’ mobility provides a finer perspective on the geographical
profile of academic communities:
louv_places_detail %>%
group_by(cluster_name, Region_code) %>%
count() %>%
ggplot(aes(reorder(Region_code, n), n, fill = Region_code)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
scale_fill_discrete(
limits = c("EAST", "EM", "EO", "MID", "MO", "OTHER", "OTHER - US"),
labels = c("East Coast", "East-Midwest", "East-Other", "Midwest", "Mid-Other", "Other (non-USA)", "Other (USA)")
) +
theme(legend.position = c(0.6, .17),
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=0.8)) +
guides(fill=guide_legend(ncol=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Geographical coverage",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
A closer examination of the region of study reveals three main
regional groups:
Finally, the two sets of plots below combine the degree of mobility with the region of study:
louv_places_detail$region_label <- factor(louv_places_detail$Region_code, levels = c("EAST", "EM", "EO", "MID", "MO", "OTHER", "OTHER - US"),
labels = c("East Coast", "East-Midwest", "East-Other", "Midwest", "Mid-Other", "Other (non-USA)", "Other (USA)"))
louv_places_detail %>%
group_by(cluster_name, Region_nbr) %>%
count(region_label) %>%
ggplot(aes(reorder(cluster_name, n), n, fill = region_label)) +
geom_col(alpha = 0.8, show.legend = TRUE, position = "dodge") +
facet_grid(Region_nbr ~ cluster_name, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = "bottom",
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(nrow=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Region of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
louv_places_detail %>%
group_by(cluster_name, Mobility) %>%
count(region_label) %>%
ggplot(aes(reorder(cluster_name, n), n, fill = region_label)) +
geom_col(alpha = 0.8, show.legend = TRUE, position = "dodge") +
facet_grid(Mobility ~ cluster_name, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = "bottom",
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(nrow=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Geographical mobility",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
louv_places_detail %>%
group_by(cluster_name, period_nbr) %>%
count() %>%
ggplot(aes(reorder(period_nbr, n), n, fill = period_nbr)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
scale_fill_discrete(
limits = c("DIAC", "SYNC"),
labels = c("Diachronic", "Synchronic")
) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Community detection (Louvain algorithm)",
fill = "Period of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
The above plots suggest a strong generational influence on the
formation of alumni networks. Indeed, all communities are dominated by
students who attended the same colleges during the same period. The Yale
and California communities present the largest proportions of such
“synchronic” curricula (over 20%). On the opposite, Columbia, Harvard,
and Chicago attracted students from more diverse generations (synchronic
places represent less than 15%). The Pennsylvania and NYU-Princeton
communities present an intermediate situation.
louv_places_detail %>%
group_by(cluster_name, period_group) %>%
count() %>%
ggplot(aes(reorder(period_group, n), n, fill = period_group)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(ncol=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Period of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
A closer examination of the period of study reveals that younger
generations of students, especially post-WWI, dominated in the
Princeton-NYU and Chicago communities. On the opposite, California,
Pennsylvania, and Yale-Vanderbilt include a significant proportion of
earlier generations (pre-1909). Pre-Boxer students, however, are
represented in all communities. Finally, Columbia and Harvard-MIT
present a “cascade” distribution, with decreasing membership as we move
backward in time.
The final plot combines the time scope with the exact period of study:
louv_places_detail$period_label <- factor(louv_places_detail$period_nbr, levels = c("DIAC", "SYNC"),
labels = c("Diachronic", "Synchronic"))
louv_places_detail %>%
group_by(cluster_name, period_label) %>%
count(period_group) %>%
ggplot(aes(reorder(cluster_name, n), n, fill = period_group)) +
geom_col(alpha = 0.8, show.legend = TRUE, position = "dodge") +
facet_grid(period_label ~ cluster_name, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = "bottom",
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(nrow=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Period of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
Did academic communities specialize in particular fields of study?
louv_places_detail %>%
group_by(cluster_name, field_nbr) %>%
count() %>%
ggplot(aes(reorder(field_nbr, n), n, fill = field_nbr)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=12),
legend.text = element_text(size=10),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Academic specialization (range)",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
Single-field curricula dominated in all communities. California,
Princeton-NYU, and Yale-Vanderbilt presented a larger share of
multidisciplinary curricula than other communities(20%). Let’s have
closer look at their disciplinary profiles.
louv_places_detail %>%
group_by(cluster_name, field_group) %>%
count() %>%
ggplot(aes(reorder(field_group, n), n, fill = field_group)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .17),
legend.title = element_text(size=8),
legend.text = element_text(size=6),
legend.box.background = element_rect(color="darkgrey", size=0.8)) +
guides(fill=guide_legend(ncol=4)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Field of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
Finally, the plots below combine the range of specialization with
the dominant field of study in each community:
louv_places_detail %>%
group_by(cluster_name, field_nbr) %>%
count(field_group) %>%
ggplot(aes(reorder(cluster_name, n), n, fill = field_group)) +
geom_col(alpha = 0.8, show.legend = TRUE, position = "dodge") +
facet_grid(field_nbr ~ cluster_name, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = "bottom",
legend.title = element_text(size=8),
legend.text = element_text(size=6),
legend.box.background = element_rect(color="darkgrey", size=0.8)) +
guides(fill=guide_legend(nrow=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Field of study",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
To sum up, single-field curricula present three possible
combinations:
Multifield curricula present five possible combinations, from the narrower to the wider range of choices:
Did the communities differ according to the academic degrees obtained by their members? Which communities presented the highest levels of qualification?
louv_places_detail %>%
group_by(cluster_name, LevelNbr) %>%
count() %>%
ggplot(aes(reorder(LevelNbr, n), n, fill = LevelNbr)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .2),
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Level of qualification (range)",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
louv_places_detail %>%
group_by(cluster_name, degree_high) %>%
count() %>%
ggplot(aes(reorder(degree_high, n), n, fill = degree_high)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .15),
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(ncol=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Highest degree",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
All communities include graduates with equivalent levels of
qualification. California, Yale, and Harvard-MIT present a wider range
of degrees than other communities (19-20% of multilevel curricula),
especially Pennsvlvania and Chicago (only 10%).
louv_places_detail %>%
group_by(cluster_name, degree_high) %>%
count() %>%
ggplot(aes(reorder(degree_high, n), n, fill = degree_high)) +
geom_col(alpha = 0.8, show.legend = TRUE) +
facet_wrap(~ cluster_name) +
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = c(0.6, .15),
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(ncol=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Highest degree",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
Doctors dominate in the California and Chicago communities. On the
opposite, lower degrees (bachelor) dominate in the Harvard-MIT and
Princeton-NYU communities, which also include larger share of certified
engineers and special degrees. Master graduates dominated in the
Yale-Vanderbilt community. Finally, Columbia and Pennsylvania
communities present a “cascade” distribution decreasing from master to
doctorate, and eventually, bachelor degrees.
The final plot combines the range of qualification with the highest degrees obtained in each community:
louv_places_detail %>%
group_by(cluster_name, LevelNbr) %>%
count(degree_high) %>%
ggplot(aes(reorder(cluster_name, n), n, fill = degree_high)) +
geom_col(alpha = 0.8, show.legend = TRUE, position = "dodge") +
facet_grid(LevelNbr ~ cluster_name, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
scale_x_discrete(
name = NULL, breaks = NULL) +
theme(legend.position = "bottom",
legend.title = element_text(size=10),
legend.text = element_text(size=8),
legend.box.background = element_rect(color="darkgrey", size=1)) +
guides(fill=guide_legend(nrow=2)) +
labs(y = "Number of places", x = NULL) +
labs(title = "American University Men of China: a place-based network analysis",
subtitle = "Alumni communities (Louvain)",
fill = "Level of qualification",
caption = "Based on data extracted from the roster of the American University Club of China, 1936")
From this exercise in community detection using Louvain algorithm, we can define three types of academic communities whose particular structures are shaped by their members’ curricula:
Similarly, we can identify three types of college-based communities based on their structure and membership:
Community detection is not the end point. One can choose to focus on a specific community to examine its particular structure and membership in more detail. The output of community detection can also be used for sampling the whole population into more coherent sub-groups. It can serve a starting point for further analyses, such as sequence analysis or the analysis of professional affiliation networks. Alternatively, community membership can be treated as a new attribute for and used as supplementary variables in multidimensional analyses of places, students and colleges. Finally, the communities identified can help to better contextualize individual stories.
In the next tutorial, we will see how we can filter our population of American University Men in order to trace the formation and evolution of their academic trajectories over time.