Abstract
This is a multidirectional exploratory study of Shanghai industrialists in the Shenbao. The first installment is based on the extraction of documents using the two common terms that designated “industrialists” in the Chinese press (工業家, 實業家).
This is a study based on the common terms that designated “industrialists” in the Chinese press in the Republican period: 工業家 實業家 Yet, as we shall see, both terms had appeared at the end of the empire, in 1892 for 實業家 and in 1900 for 工業家. While 工業家 represents an unambiguous term for “industrialist”, 實業家 can refer to “entrepreneur” (in other sectors, including banking) and “industrialist”. My purpose is to extract all the texts that refer to any of these terms and to extract all the Named entities (person, organization, location) mentioned in these texts. The second stage of this survey is to link these entities to events to which these terms may be related.
Upload of the required libraries The enpchina library is the gateway to the Shenbao corpus. It provides all the functions to explore the corpus. The other libraries serve different purposes. The dplyr, stringr and tidyr libraries serve to manipulate data (order, arrang, clean, select, filter, etc.). The lubridate library serves to adapt the format of dates. The ggplot2 library serves for visualization. The tidygraph and igraph libraries serve to produce graph visualization of the data. You can simplify the upload of libraries by uploading tidyverse, a suite that includes dplyr, stringr, ggplot2 and tidyr (and more). Tidytext serve for dataming in texts and their conversion to tidy format in R.
library(enpchina)
library(dplyr)
library(stringr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(tidygraph)
library(igraph)
List all available corpora: this is a control operation that lists all the available corpora that can be explored with th enpchina package. If the list of corpora appears in the console below, it confirms that the enpchina package works correctly.
enpchina::list_corpora()
Search the terms in the corpus, using the search_documents() function. This will retrieve all the documents in which the terms appear:
Ind_searchall <- search_documents('"工業家"|"實業家"|"工業實業家"', "shunpao")
Results: the search yielded 3,342 observations. Yet it does not allow to see the contribution of each term.
Search by individual entries to get a view of results for each term. The syntax is to create a file name (e.g., Ind_search1) that will be the result of the query of the ‘“工業家”’ expression in the Shenbao corpus (“shunpao”). To search expressions, rather than individual characters or words (in English), the use of ’" on each side of the target expression is necessary. Single quote marks should no be used alone. Double quote parks will produces results for each of the characters (e.g., for 工, for 業, and for 家).
Ind_search1 <- search_documents('"工業家"', "shunpao")
Ind_search2 <- search_documents('"實業家"', "shunpao")
Ind_search3 <- search_documents('"工業實業家"', "shunpao")
Only the first two terms produced results. 工業家 refers very explicitly to “industrialist” with no ambiguity. While the term 實業 alone usually designates “industry” such as in 實業部, the term 實業家 has a wider catch that include the broader notion of “entrepreneur”.
Results: 工業家 yielded 523 observations with a time pan covering 1900-1949. 實業家 yielded 2,848 observations with a longer time span covering 1892-1949. The total number of observations (3,371) is superior to the number of observations obtained from the combined search of all three terms.
Although we plan to analyze the two results files separately, I produce a file that amalgamate the two results files above.
Ind_bind <- bind_rows(Ind_search1, Ind_search2)
We are interested in displaying all the occurrences of 工業家 or 實業家 in the corpus, for this purpose we use the search_concordances(). This will place the searched terms in the context of the sentence in which they appear. The syntax is slightly different. One needs to mame explicit the corpus: corpus = “shunpao”. Context-size serves to deterlmine the number of characters (words in English) around the queried expresion or word (e.g. here 120 means 60 characters before and 60 characters after the queried expression).
Ind_conc1 <- search_concordance('"工業家"', corpus = "shunpao", context_size = 120)
Ind_conc2 <- search_concordance('"實業家"', corpus = "shunpao", context_size = 120)
We also create a single file that binds all concordance files
Ind_conc <- bind_rows(Ind_conc1, Ind_conc2)
We count the number of articles on each term per year and plot the results as a histogram for 工業家. The count_documents function counts and sums up the number of documents that contain 工業家; “mutate” does two operations in succession: first it transforms that initial Data (19441206) into a computer readable format (1944-12-06), second it adds an extra variable (Year column) with only the year; group_by sums up by year; “filter” is used to select only the 1892-1949 period; “ggplot” turns the data into a histogram. The other elements are elements of text that can be changed freely (anything between parenthesis).
enpchina::count_documents('"工業家"', "shunpao") %>%
mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1892 & Year<=1949) %>%
ggplot(aes(Year,N)) + geom_col() +
labs(title = "工業家 in the Shenbao",
subtitle = "Number of articles mentioning 工業家",
x = "Year",
y = "Number of articles")
We count the number of articles on each terms per year and plot the results as a histogram for 實業家
enpchina::count_documents('"實業家"', "shunpao") %>%
mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1892 & Year<=1949) %>%
ggplot(aes(Year,N)) + geom_col() +
labs(title = "實業家 in the Shenbao",
subtitle = "Number of articles mentioning 實業家",
x = "Year",
y = "Number of articles")
We want to retrieve the full text of articles for each term using the get_documents() function. This will collect the same metadata as in the search_documents() function, with an additional variable that will contain the full text of the articles.
gyj_full_text <- enpchina::get_documents(Ind_search1, "shunpao")
syj_full_text <- enpchina::get_documents(Ind_search2, "shunpao")
We Rename the DocID column to Id in the full text files to enable join by Id. This is a temporary issue because the name for ID is different in search_documents() and get_documents. We shall correct this.
gyj_full_text <- gyj_full_text %>% rename(Id = DocID)
syj_full_text <- syj_full_text %>% rename(Id = DocID)
We join of the concordance results file and the full text file. This creates a file based on the number of occurrences (concordance), which exceeds the number of documents. Yet this is meant to facilitate the quick visual identification of the mention of the searched terms in the full text of each article.
gyj_concfull <- inner_join(gyj_full_text, Ind_conc1, by = "Id")
syj_concfull <- inner_join(syj_full_text, Ind_conc2, by = "Id")
We save all the results files to be stored in the Data folder or for export.
write_csv(gyj_full_text, "gyj_full_text.csv")
write_csv(syj_full_text, "syj_full_text.csv")
write_csv(Ind_conc1, "Ind_conc1.csv")
write_csv(Ind_conc2, "Ind_conc2.csv")
write_csv(gyj_concfull, "gyj_concfull.csv")
write_csv(syj_concfull, "syj_concfull.csv")
We perform NER (Named Entity Recognition) to extract the name of persons, organizations, and places related to the two searched terms. We use the initial results file from the get_documents query. We use the name of the results file (e.g., gyj_full_text) as a base to run NER directly in the Shenbao corpus.
gyj_ner_results <- enpchina::ner_on_corpus(gyj_full_text, "shunpao")
syj_ner_results <- enpchina::ner_on_corpus(syj_full_text, "shunpao")
NER extracts 2531 entities and 16204 entities for 工業家 and 實業家 respectively, distributed in 4 categories. Problem with the Shenbao: the types of entities come with an index of confidence that makes grouping by type impossible. We need to generate a new Type2 variable through the extraction of the basic type names of entities. To achieve this, we use the str_extract (stringr).
gyj_ner_results <- gyj_ner_results %>% mutate(Type2 = str_extract(Type, "GPE|PER|ORG|LOC"))
syj_ner_results <- syj_ner_results %>% mutate(Type2 = str_extract(Type, "GPE|PER|ORG|LOC"))
We count of the results by type of entity.
gyj_ner_count <- gyj_ner_results %>%
group_by(Type2) %>%
count() %>% View
syj_ner_count <- syj_ner_results %>%
group_by(Type2) %>%
count() %>% View
We count the number of unique entities, without duplicates.
gyj_ner_count_uniq <- gyj_ner_results %>%
distinct(Type2, Text) %>%
group_by(Type2) %>%
count() %>% rename(nuq = "n")
syj_ner_count_uniq <- syj_ner_results %>%
distinct(Type2, Text) %>%
group_by(Type2) %>%
count() %>% rename(nuq = "n")
In this section, we shall build a series of networks, both two-mode networks and one-mode networks.
We start with a two-mode network linking persons and documents and we project this network in Padagraph (a graph visualization tool outside of R)
Graph 1a - First we select only the persons in the list of named entities for 工業家. We use the “filter” operator to select only "PER3 in the Type2 variable.
gyj_pers <- gyj_ner_results %>% filter(Type2== "PER")
gyj_pers_uniq <- gyj_pers %>% distinct(Type2, Text)
We retain only the persons who appeared at least twice. This is an arbitrary choise to eliminate persons of lesser importance in the current analysis. We use the “filter” operator to determine we select persons that a rementioned more than once (>1).
gyj_pers_top2 <- gyj_pers %>% group_by(Type2, Text) %>%
count() %>%
filter(n>1)
We prepare an edge list linking Person(s) with Documents. We select only the persons with the “filter” operator" and we select only two variables: DocID (Document ID) and Text (Person).
persondata <- gyj_ner_results %>%
filter(Type2 == "PER") %>%
select(DocID, Text)
We prepare the edge and nodes of the network and create the two-mode network Persons - Documents with igraph and tidygraph.
edges <- persondata %>% transmute(from=DocID, to=Text)
ig <- graph_from_data_frame(d=edges, vertices=NULL, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig)
We project the two-mode network Persons - Documents based on igraph and tidygraph into Padagraph
tg %N>% mutate(label=name) %>%
enpchina::in_padagraph("gyj-PerDoc")
Graph 1 is accessible here : https://pdg.enpchina.eu/rstudio?gid=gyj-PerDoc
Since this is the same operations and script as above, we do not provide the same level of detailed explanations here.
Graph 1b - We select the persons only in the list of named entities for 實業家
syj_pers <- syj_ner_results %>% filter(Type2== "PER")
syj_pers_uniq <- syj_pers %>% distinct(Type2, Text)
We retain only the persons who appeared at least twice
syj_pers_top2 <- syj_pers %>% group_by(Type2, Text) %>%
count() %>%
filter(n>1)
We prepare an edge list linking Person(s) with documents
persondata2 <- syj_ner_results %>%
filter(Type2 == "PER") %>%
select(DocID, Text)
We prepare the edge and nodes of the network and create the two-mode network Persons - Documents with igraph and tidygraph
edges <- persondata2 %>% transmute(from=DocID, to=Text)
ig <- graph_from_data_frame(d=edges, vertices=NULL, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig)
We project the two-mode network Persons-Documents based on igraph and tidygraph into Padagraph
tg %N>% mutate(label=name) %>%
enpchina::in_padagraph("syj-PerDoc")
Graph 1b is accessible here : https://pdg.enpchina.eu/rstudio?gid=syj-PerDoc
We build here a different network to link each person to all the persons that are mentioned in the same articles.
Graph 2a - We build a one-mode network linking persons to persons using Padagraph for 工業家 We create the list of documents in which persons appear based on the count criteria above The operation below actually matches – a form of join – the selected list in the Text column from the table of persons. We use the same edge list as above (persondata) and we create a news edge list (small_pers_data) that is created by filtering by the name of persons (Text) present in the list of person that appears at least twice as selected above (gyj_pers_top2).
small_pers_data <- persondata %>%
filter(Text %in% gyj_pers_top2$Text)
We create an edge list in the form of a table with “to” and “from” designating persons. Here the inner_join joins the table with itself through DocID. It creates a link for each couple of relation. Filter is used to eliminate duplicates by document
edges_Pers <- small_pers_data %>%
inner_join(small_pers_data, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
We create a one-mode network Person-to-person with igraph and tidygraph.
edgesTest <- edges_Pers %>% transmute(from=from, to=to)
ig2 <- graph_from_data_frame(d=edgesTest, vertices=NULL, directed = FALSE)
tg2 <- tidygraph::as_tbl_graph(ig2)
We project the one-mode network Person-Person based on igraph and tidygraph into Padagraph
tg2 %N>% mutate(label=name) %>%
enpchina::in_padagraph("gyj-PerPer")
Graph 2a is accessible here : https://pdg.enpchina.eu/rstudio?gid=gyj-PerPer
We repeat the process for 實業家. We do not provide the same lever of detail on the script as it is tha same as above for the one-mode network of 工業家.
Graph 2b - We want to build a one-mode network linking persons to persons using Padagraph for 實業家 We create the list of documents in which persons appear based on the count criteria above The operation below actually matches – a form of join – the selected list in the Text column from the table of persons
small_pers_data2 <- persondata2 %>%
filter(Text %in% syj_pers_top2$Text)
We create an edge list in the form of a table with “to” and “from” designating persons. Here the inner_join joins the table with itself through DocID. It creates a link for each couple of relation. Filter is used to eliminate duplicates by document
edges_Pers2 <- small_pers_data2 %>%
inner_join(small_pers_data2, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
We create a one-mode network Person-to-person with igraph and tidygraph
edgesTest2 <- edges_Pers2 %>% transmute(from=from, to=to)
ig2 <- graph_from_data_frame(d=edgesTest2, vertices=NULL, directed = FALSE)
tg2 <- tidygraph::as_tbl_graph(ig2)
We project the one-mode network Person-Person based on igraph and tidygraph into Padagraph
tg2 %N>% mutate(label=name) %>%
enpchina::in_padagraph("syj-PerPer")
Graph 2b is accessible here : https://pdg.enpchina.eu/rstudio?gid=syj-PerPer
Graph 3 - We build a two-mode network linking persons and organizations using Padagraph. The approach is the same as for the Person-Documents two-mode network, but here we do not have the two categories (Person, Organization) in two distinct variables. We need to create a new variable to separate Person and organization from the Text variable.
Graph 3a - We first we select Persons only in the list of named entities for 工業家. This creates a node list for Persons.
gyj_pers <- gyj_ner_results %>% filter(Type2== "PER")
gyj_pers_uniq <- gyj_pers %>% distinct(Type2, Text)
We retain only the persons who appeared at least twice.
gyj_pers_top2 <- gyj_pers %>% group_by(Type2, Text) %>%
count() %>%
filter(n>1)
Second, we select Organizations only in the list of named entities for 工業家. This creates a node list for Organizations.
gyj_org <- gyj_ner_results %>% filter(Type2== "ORG")
gyj_org_uniq <- gyj_org %>% distinct(Type2, Text)
Third, we create a file that binds the results for Persons and Organizations. This creates the whole node list (name of person or organization and type).
gyj_data <- bind_rows(gyj_pers, gyj_org)
We create an edge list in the form of a table with “to” and “from” designating persons and organizations for 工業家. Here the inner_join operator joins the table with itself through the DocID. It creates a link for each couple of relations. Filter is used to eliminate duplicates by document.
edges_OrgPer <- gyj_data %>%
inner_join(gyj_data, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
Two-mode network - Person to Organization for 工業家 We create the two-mode network Person-to-person with igraph and tidygraph
edgesTest3 <- edges_OrgPer %>% transmute(from=from, to=to)
ig2 <- graph_from_data_frame(d=edgesTest3, vertices=NULL, directed = FALSE)
tg2 <- tidygraph::as_tbl_graph(ig2)
We project the two-mode network Person-Organization based on igraph and tidygraph into Padagraph
tg2 %N>% mutate(label=name) %>%
enpchina::in_padagraph("gyj-PerOrg")
Graph 3a is accessible here : https://pdg.enpchina.eu/rstudio?gid=gyj-PerOrg
Graph 3b first we select persons only in the list of named entities for 實業家
We repeat the process above for 實業家. We provide only the basic script as all the explanations are to be found in the section of the script above (Person-to-Organization 工業家).
syj_pers <- syj_ner_results %>% filter(Type2== "PER")
syj_pers_uniq <- syj_pers %>% distinct(Type2, Text)
We retain only the person who appeared at least twice (not used below - we keep all)
syj_pers_top2 <- syj_pers %>% group_by(Type2, Text) %>%
count() %>%
filter(n>1)
Second we select Organizations only in the list of named entities for 實業家
syj_org <- syj_ner_results %>% filter(Type2== "ORG")
syj_org_uniq <- syj_org %>% distinct(Type2, Text)
We retain only the person who appeared at least twice (not used below to build the graph - we keep all)
syj_org_top2 <- syj_org %>% group_by(Type2, Text) %>%
count() %>%
filter(n>1)
Third we create a file that binds the results for Persons and Organizations Bind all concordance files by binding rows
syj_data <- bind_rows(syj_pers, syj_org)
Create an edge list in the form of a table with “to” and “from” designating persons and organizations 實業家 Here the inner_join joins the table with itself through DocID It creates a link for each couple of relation Filter is used to eliminate duplicates by document
edges_OrgPer2 <- syj_data %>%
inner_join(syj_data, by = "DocID") %>%
filter(Text.x < Text.y) %>%
transmute(from=Text.x, to=Text.y) %>%
distinct()
Two-mode network - Person to Organization 實業家 We create the two-mode network Person-to-person with igraph and tidygraph
edgesTest4 <- edges_OrgPer2 %>% transmute(from=from, to=to)
ig2 <- graph_from_data_frame(d=edgesTest4, vertices=NULL, directed = FALSE)
tg2 <- tidygraph::as_tbl_graph(ig2)
We project the two-mode network Person-Organization based on igraph and tidygraph into Padagraph
tg2 %N>% mutate(label=name) %>%
enpchina::in_padagraph("syj-PerOrg")
Graph 3b is accessible here : https://pdg.enpchina.eu/rstudio?gid=syj-PerOrg