Case 1

The purpose here is to delete all the white spaces in the file (head tails and end tails). This is something that the eye can hardly spot, especially white spaces in end tails, whereas this can create unwelcome duplicates of the same data in a spreadsheet (very bad in pivot tables, GIS, SNA, etc.).

This is the original file with just one column containing the surname and first name with white spaces (head tail/end tail)

Example

Original Desired Output
 Young, Robt. Heyden Young, Robt. Heyden
Young, S. C.  Young, S. C.
 Young, William Alexander Young, William Alexander
 龔厂樵 龔厂樵
龔去非  龔去非


Upload case file 1

Case1 <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1


We propose two solutions.


Script 1

The first solution is the automatic removal of white spaces at the time of import: import the csv file and tick the box “trim spaces”

Case1a <- read_delim("Case1.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case1a


Script 2

The second solution consists in using the “trim_ws” functions. We create three distinct columns to check the result for each parameter (left/right/both).

Case1b <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1b$left_strip <- trimws(Case1b$Original, which = c("left"))
Case1b$right_strip <- trimws(Case1b$Original, which = c("right"))
Case1b$both_strip <- trimws(Case1b$Original, which = c("both"))
Case1b


Case 2

We want to split the last name and first name when the two terms are separated by a comma (or any other separator). This is very handy when one needs to enter distinct data into a database or to do an analysis of each component. Such separation can apply to other types of data.

Example:

Original Desired Output
Young, Robt. Heyden Young Robt. Heyden
Young, S. C. Young S. C.
Young, William Alexander Young William Alexander
Young, William Stewart Young William Stewart


Upload case file 2

This is the original file with just one column containing the last name and first name separated by a comma (or any other separator).


We propose two solutions.


Script 1

The first solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the comma.

We proceed in two steps:

  1. We create two columns named FirstName and LastName that will contain the final results
  2. We use the “str_replace” function to extract the first name and the last name by deleting in each case the information before and after the comma that separates first name and last name. The first parameter is the column that we want to use as a base (Original). The second parameter is the regular expression that we use to catch the defined pattern in the Original (“[^,]+,” or ",.*“). The third parameter is what will come instead of that captured pattern, which is empty (”")
Case2a <- Case2 %>% mutate(Original,
                           FirstName=str_replace(Original,"[^,]+, ",""),
                           LastName=str_replace(Original,",.*",""))
Case2a


Script 2

The second solution consists in extracting the string of characters before (to extract last name) and after (to extract first name).

We propose to proceed in three steps:

  1. We create a column named FirstName and LastName that contain intermediary results
  2. We use the “str_extract” function to extract the first name and the last name. The first parameter is the column that we want to use as a base (Original). The second parameter is the regular expression that we want to extract in the Original (“[^,]+,” or ",.*")
  3. We remove ‘,’ or ‘,’ from the intermediary columns and we create two columns “LastName2” and “FirstName2” that contain the final results.
Case2b <- Case2 %>% 
  mutate(Original, 
         LastName=str_extract(Original,"[^,]+,"), 
         FirstName=str_extract(Original,",.*")) %>% 
  mutate(LastName2 = str_remove_all(LastName,",")) %>% 
  mutate(FirstName2 = str_remove_all(FirstName,", "))
Case2b


Case 3

This is a second example of splitting up data: Surname and Given Name separated by a slash (but no white space).

Example

Original Desired Output
Young/Robt. Heyden Young Robt. Heyden
Young/S. C. Young S. C.
Young/William Alexander Young William Alexander
Young/William Stewart Young William Stewart


Upload case file 3

This is the original file with just one column with Surname and Given name separated by a slash sign.


Script

The solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the slash sign.


Case 4

This is a third example of splitting up data: Split Surname and Given Name separated by a white space

Example

Original Desired Output
Adachi Tsunayuki Adachi Tsunayuki
Aibara Kuragoro Aibara Kuragoro
Aida Isuneo Aida Isuneo
Akabane Shiro Akabane Shiro
Akamatsu Noriyoshi Akamatsu Noriyoshi


Upload case file 4

This is the original file with just one column containing the surname and the first name separated by a white space.


Script

The solution we propose consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the white space using the function “str_replace”.

Case4a <- Case4 %>% mutate(
  FirstName=str_replace(Original,"[^ ]+ ",""),
  LastName=str_replace(Original," .*",""))
Case4a


Case 5

This is a trickier example of splitting up data: Split Surname and Given Names separated by repeated white spaces

Example

Original Desired Output
Baron Alexandre de   Forth-Rouen de Forth-Rouen Alexandre Baron
Jules   Berthemy Berthemy Jules
Comte Charles   de Lallemand de Lallemand Charles Comte
Comte Julien   de Rochechouart de Rochechouart Julien Comte
Vicomte   Brenier de Montmorand de Montmorand Brenier Vicomte
Albert   Bourée Bourée Albert


Upload case file 5

This is the original file with just one column containing the surname, the first name and the title (if any)


Script

We propose to proceed in three steps:

  1. First we separate the title and the full name
  2. Next we extract the given name and the surname
  3. Last we use the variable given name to extract the surname from the Original variable.


Step 1

Case5c <- Case5 %>% mutate(Original2 = str_remove_all(Original,"(Baron|Comte|Vicomte) "))


Step 2

Case5d <- Case5c %>% mutate(FirstName = str_extract(Original2,"[^ ]+"))


Step 3

Case5e <- Case5d %>% mutate(LastName = str_remove_all(Original2, FirstName)) %>% 
  mutate(LastName = trimws(LastName, which = c("left")))


Case 6

Split Chinese Surname and Given Names: this case is specific to Chinese names, first because we often handle Chinese characters; second because there is never any separation/separator between Surname and Given name in Chinese; third because Chinese names can vary from 2 to 4 characters, with a variety of combination when the Surname has more than one character.

Example

Original Desired Output
龔厂樵 厂樵
龔去非 去非
龔同元 同元
龔國富 國富
龔年
龔懷
歐陽祖經 歐陽 祖經
歐陽竟無 歐陽 竟無
歐陽遵詮 歐陽 遵詮
歐陽遵 歐陽 文煥
歐游阿 歐游 阿菊


Upload case file 6

This is the initial file with one column for original full names (“Original”). We want to split the original full names into surname and given name.

Case6 <- read_csv("Case6.csv")
Case6


Script

Case 6 presents a particular challenge due to the varying composition of Chinese surnames and names. The main difficulty is compound surnames since there is no way to find a clear separator with the given name as in the names with one-character surnames and one- or two-character given names. We present a solution in three steps:


Step 1

We separate surname and given name based on the place and number of characters for the surname and given name in the string of characters. This step catches all the one-character surnames and one- or two-character given names, but not the compound surnames (e.g., 歐陽). This is a simple solution, but it will imply manual checking to identify the compound surnames.

Case6$chinese_surname <- substr(Case6$Original,1,1)
Case6$chinese_given_name <- substr(Case6$Original,2,10)
Case6


Step 2

In this step, we apply the conditional function (case_when) to define a pattern based on a given compound name (or a list of compound names separated by | [vertical bar]). The list can be extended as much as necessary. This will extract the compound names from the “Original” column. For a truly systematic use of a list, one needs to create a vector with all the compound surnames and use a loop function to extract them. This is a function that we address in Case 23.

Case6b <- Case6 %>% 
  mutate(Surname=case_when(str_detect(Original, "歐陽")~"歐陽", 
                           TRUE~substr(Original,1,1)))
Case6b


Step 3

We extract the given name by substracting the content of the “Surname” column from the “Original” column.

Case6c <- Case6b %>% mutate(GivenName= str_remove_all(Original, Surname))
Case6c


Case 7

Transliteration of Chinese characters to Pinyin: this is a requirement in all Western-language publications. When dealing with a long list of names, this can become especially cumbersome, time-consuming, and tedious.

Example

Original Original2 Desired Output
厂樵 gong changqiao
去非 gong qufei
同元 gong tongyuan
國富 gong guofu
gong nian
gong huai
歐陽 祖經 ouyang zujing
歐陽 竟無 ouyang jingwu
歐陽 遵詮 ouyang zunquan


Upload case file 7

This is the initial file with two columns: “Original_S” for Surname and “Original_GN” for given name

Case7 <- read_delim("Case7.csv", ";", escape_double = FALSE, 
    trim_ws = TRUE)
knitr::kable(Case7)
Original_S Original_GN
厂樵
去非
同元
國富
歐陽 祖經
歐陽 竟無
歐陽 遵詮
歐陽 文煥
歐游 阿菊


We propose two solutions to solve the case.


Script 1

The first solution consists in using the function “sinogram_to_py” in the “enpchina” package. It works very well, but since it relies on a dictionary that lists all the possible pronunciations of characters, the final result is not ideal.

library(enpchina)
Case7 %>% mutate(Surname_py = sinograms_to_py(c(Original_S))) %>% 
  mutate(GivenName_py = sinograms_to_py(c(Original_GN)))


Script 2

The second solution consists in using the package pinyin. There are still some issues with characters that have more than one pronunciation. The transformation is made based on the most frequent transliteration, but it is not absolutely perfect either. Yet this provides an easy way to transliterate long lists of Chinese characters.

library(pinyin)
Case7 %>% mutate(Surname_py = py(c(Original_S), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>% 
  mutate(GivenName_py = py(c(Original_GN), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>% 
  mutate(GivenName_py = str_to_title(GivenName_py)) %>% 
  mutate(Surname_py = str_to_title(Surname_py))


Case 8

Capitalize first letters of the names: this is a simple operation of editing, depending on how one wants to present data in a table for example

Example

Original Desired   Output
gong changqiao Gong Changqiao
gong qufei Gong Qufei
gong tongyuan Gong Tongyuan
gong guofu Gong Guofu
gong nian Gong Nian
gong huai Gong Huai
ouyang zujing Ouyang Zujing
ouyang jingwu Ouyang Jingwu
ouyang zunquan Ouyang Zunquan
ouyang zun Ouyang Zun
ouyou a Ouyou A


Upload case file 8

Case8 <- read_delim("Case8.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case8


Script

The solution consists in using the str_to_title function.

Case8a <- Case8 %>% mutate(Surname = str_to_title(Original_S)) %>% 
  mutate(GivenName = str_to_title(Original_GN))
Case8a


Case 9

Reassembling Surname and Given name into a Full Name: data may come in a format that does not meet one’s needs. In this case, we have a case in which the Surname and Given name are placed in distinct columns. The objective to to merge them in a single column.

Example

Original Original Desired Output
Qu Zuhui Qu Zuhui
Yan Renguang Yan Renguang
Nie Guang Nie Guang
Yan Kaiyuan Yan Kaiyuan
Yan Daoyuan Yan Daoyuan
Yan Dunjian Yan Dunjian


Upload case file 9

This is the original file with the Surname and Given name in two distinct columns.

Case9 <- read_csv("Case9.csv", trim_ws = FALSE)
Case9


Script

The solution consists in using the “paste” function with the indication of a separator (here white space)

Case9a <- Case9 %>% mutate(FullName = paste(Original, Original2, sep = " "))
Case9a

Case 10

Split province and location: this is a recurrent motif in data transformation, especially in Chinese where locations (e.g., birth place) are given as a compound name (江蘇無錫,江蘇省無錫縣). We want to extract the exact name of the location and retain the designation of its administrative level.

Example

Original Desired Output Desired Output
江蘇省寳山縣 江蘇 寳山縣
浙江省紹興縣 浙江 紹興縣
江西省萬安縣 江西 萬安縣
湖南省王山 湖南 省王山
河南省 河南
東京市 東京
湖南長沙 湖南 長沙
廣東省興寧縣 廣東 興寧縣
江蘇省杭縣 江蘇 杭縣
浙江省紹興縣 浙江 紹興縣


Upload case file 10

This is the original file with the name of the province and location in a single column

Case10 <- read_csv("Case10.csv", trim_ws = FALSE)
Case10


Script

We propose various modes of splitting and extracting the information:


Mode 1

A. To split and extract the name of the province based on a pattern: with this method we define the list of terms of be extracted and we search the file for matching terms. This is quite useful when the number of items in the list is limited.

Case10A <-Case10 %>% mutate (Province = str_extract(Original,
                                                    "(臺灣|福建|浙江|遼寧|湖南|河北|安徽|湖北|江西|山東|四川河南|廣西|陝西|貴州|山西|吉林|黑龍江|安東|雲南|海南島|新疆|廿肅|察哈爾|龍江|靑海|江蘇|廣東|四川|河南)"))
Case10A


Mode 2

B. To extract only the administrative levels: we proceed to extract the designation of administrative levels based on a pattern (here the name of the level: 省, 縣)

Case10B <-Case10 %>% mutate (Prov = str_extract(Original, "省")) %>% 
  mutate (Xian = str_extract(Original, "縣"))
Case10B


Mode 3

C. To extract the name of the Location: here we do it by removing anything before the character 省

Case10C <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省",""))
Case10C


Mode 4

D. To extract the name of the Province: we do it by extracting anything before the character 省

Case10D <- Case10C %>% mutate(Province=str_extract(Original,"[^省]+"))
Case10D


Mode 5

E. To extract the name of the Location and the Province (by combining the two formula C & D )

Case10E <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省","")) %>% 
  mutate(Province=str_extract(Original,"[^省]+"))
Case10E


Case 11

Split Institution and Position: the objective is to extract separately the name of the positions and the name of the institutions listed in a single column as a single compound of characters in Chinese.

Example

Original Desired Output Desired Output
國民黨中央黨部社會部長兼特工總部主任 國民黨中央黨部社會部 部長
國民黨中央黨部社會部長兼特工總部主任 特工總部 主任
四川省政府主席 四川省政府 主席
蒙古聯合自治政府副主席 蒙古聯合自治政府 副主席
新中國日報總主筆 新中國日報 總主筆
龍煙鐵礦株式會社理事長 龍煙鐵礦株式會社 理事長
西南建設委員會秘書長 西南建設委員會 秘書長
民衆敎育幹部訓練所長 民衆敎育幹部訓練所 所長
神州大學敎務主任 神州大學 神敎務主任


Upload case file 11

This is the original file with the name of the institutions and positions in a single column

Case11 <- read_delim("Case11.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case11


Script

We proceed in several steps to take into account the presence of not just a single position, but sometimes dual positions.


Step 1

First, we separate the rows with two or more positions into distinct rows. The separator here is “兼”, but it could also be “及”, “並” or any form of Chinese punctuation (as in case 12).

Case11S <- Case11 %>% separate_rows(Original, sep="兼")
Case11S


Step 2

Second, we split the positions based on pattern matching. This approach is similar to Case10A in case 10. It relies on a pre-defined list of terms.

Case11A <-Case11S %>% mutate (Position = str_extract(Original,
         "主任|部長|副主席|總主筆|理事長|秘書長|所長|敎務主任|主席"))
Case11A


Step 3

Third, we extract the name of the institution. Do achieve this, we use the content of the “Position” column to remove all the names of positions in the Original column. This create a new Institution column.

Case11B <- Case11A %>% mutate(Institution = str_remove_all(Original, Position))
Case11B


Case12

This is about splitting the various sequences of positions

Example

This is the original table with two columns. The second column contains information that needs to be split up so that each piece of information in Information is placed in a distinct row linked to the item (person) listed in Person column. The splitting is based on the elements of punctuation in the sentences. This can apply to any other form of punctuation or separator, including characters that may indicate a new item (e.g., in Chinese 及, 並, 兼, etc.). Note that the output for information could be further refined to separate the terms after の as we show in case 13 below.

Person Information
楊壽枬 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
沈士遠 浙江高等師範學校、國立北京大學、燕京大學の敎授
陳其釆 故陳其美の實弟、陳果夫、陳立夫の伯父


This is what we expect to achieve:

Original Original Desired Output Desired Output
楊壽枬 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長 楊壽枬 國民政府水利委員會委員長
楊壽枬 前清時代の舉人出身
楊壽枬 長蘆鹽運使
楊壽枬 廣東海關監督
楊壽枬 山東財政廳長
楊壽枬 財政次長
陳其釆 故陳其美の實弟、陳果夫、陳立夫の伯父 陳其釆 陳果夫
陳其釆 陳立夫の伯父
陳其釆 故陳其美の實弟
沈士遠 浙江高等師範學校、國立北京大學、燕京大學の敎授 沈士遠 浙江高等師範學
沈士遠 國立北京大學
沈士遠 燕京大學の敎授


Upload case file 12

Case12 <- read_delim("Case12.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case12


Script

Case12A <- Case12 %>% separate_rows(Information, sep="。|、")


Case 13

Separate a set of attributes attached to an individual in Japanese.

Example

Original Information
施肇基 唐紹儀の甥。施肇曾の弟
沈士遠 燕京大學の敎授
聶其杰 曾國藩の孫


This is what we expect to achieve:

Desired   Output Desired Output Desired Output
施肇基 唐紹儀
施肇基 唐紹儀
聶其杰 曾國藩
沈士遠 燕京大學 敎授


Upload case file 13

Case13 <- read_delim("Case13.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case13

We proceed in two steps: 1. First, separate row with dual information 2. Second, extract the information through pattern matching This is done in a single operation that separates, extracts the data, and place it in a new column (Status)


Script

Case13S <- Case13 %>% separate_rows(Information, sep="。") %>% 
  mutate (Status = str_extract(Information, "甥|弟|孫|敎授"))


Case 14

Separate a set of attributes attached to an individual in distinct columns: in the present case, we want to split up the information in the Information column and distribute it in distinct columns.

Example

Original Information
楊壽枬 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
沈士遠 浙江高等師範學校、國立北京大學、燕京大學の敎授


This is what we expect to achieve:

Desired   Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output
楊壽枬 國民政府水利委員會委員長 前清時代の舉人出身 長蘆鹽運使 廣東海關監督 山東財政廳長 財政次長
沈士遠 浙江高等師範學校 國立北京大學 燕京大學の敎授


Upload case file 14

Case14 <- read_delim("Case14.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case14


Script

Case14S <- Case14 %>% separate(Information, c("Pos1", "Pos2", "Pos3", "Pos4"), sep="。|、", extra = "drop")


Case 15

Split transliterated Full name and Chinese Full name: spliting could be done on the basis of blank spaces, but if there is more than one blank space, the splitting will produce unwanted separations. We turn to splitting based on the type of script (here Chinese characters)

Example

Original
Tsai-ch’un 載淳
Wên-hsiang 文祥
Ch’ung-shih 崇實
Ch’ên Li 陳澧
Chao Chih-ch’ien 趙之謙
Tso Tsung-t’ang 左宗棠


This is what we expect to achieve:

Desired   Output Desired Output
載淳 Tsai-ch’un
文祥 Wên-hsiang
崇實 Ch’ung-shih
陳澧 Ch’ên Li
趙之謙 Chao Chih-ch’ien
左宗棠 Tso Tsung-t’ang


Upload case file 15

Case15 <- read_delim("Case15.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case15


Script

Case15B <- Case15 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
  mutate (NameTrans = str_remove_all(Original, NameZh))


Case 16

Split transliterated Full name and Chinese Full name (no white space): splitting is based on the type of script and it works regardless of the presence or absence of any separator

Example

Tsai-ch’un載淳 載淳
Wên-hsiang文祥 文祥
Ch’ung-shih崇實 崇實
Ch’ên Li陳澧 陳澧
Chao Chih-ch’ien趙之謙 趙之謙
Tso Tsung-t’ang左宗棠 左宗棠

This is what we expect to achieve:

Desired   Output Desired Output
載淳 Tsai-ch’un
文祥 Wên-hsiang
崇實 Ch’ung-shih
陳澧 Ch’ên Li
趙之謙 Chao Chih-ch’ien
左宗棠 Tso Tsung-t’ang


Upload case file 16

Case16 <- read_delim("Case16.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case16


Script

Case16A <- Case16 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
  mutate (NameTrans = str_remove_all(Original, NameZh))

Case 17

Transform Wade-Giles transliteration into Pinyin transliteration: the conversion of names in Wade & Giles transliteration to pinyin can come handy when dealing with names extracted from Western sources (including academic literature) published before 1981 when pinyin became the norm for academic publishing.

Example

Original Desired Output
Tsai-ch’un Zai Chun
Wên-hsiang Wen Xiang
Ch’ung-shih Chong Shi
Ch’ên Li Chen Li
Chao Chih-ch’ien Zhao Shiqian
Tso Tsung-t’ang Zuo Zongtang


Upload case file 17

Case17 <- read_delim("Case17.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case17


Script

Because the library that we use still produces some minor errors, we need to make additional corrections to the initial output of the “wade_to_py” function. This will be corrected in due time.

Case17A <- Case17 %>% mutate(NameWG = wade_to_py(Original)) %>%
  mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
  mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>% 
  mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che"))


Case 18

Change all the names in capitals into names with a capital letter on the first letter of each word only. This is a simple, but useful edit operation to adapt the format of names. This can be used for the transformation of titles, etc.

Example

Original Desired Output
CH’EN KUO-JUI Ch’ên Kuo-jui
CHAO CHIH-CH’IEN Chao Chih-ch’ien
TSO TSUNG-T’ANG Tso Tsung-t’ang
CHIN HO Chin Ho
PAO CH’AO Pao Ch’ao


Upload case file 18

Case18 <- read_delim("Case18.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case18


Script

Case18A <- Case18 %>% mutate(Name = str_to_title(Original))


Case 19

Change all the names into capital letters: this is also an edit operation doing the opposite of Case 18.

Example

Original Desired Output
CH’EN KUO-JUI Ch’ên Kuo-jui
CHAO CHIH-CH’IEN Chao Chih-ch’ien
TSO TSUNG-T’ANG Tso Tsung-t’ang
CHIN HO Chin Ho
PAO CH’AO Pao Ch’ao


Upload case file 19

Case19 <- read_delim("Case19.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case19


Script


Case 20

Split the elements of information in the original “sentence” into distinct columns: in this particular case, the commas are followed by a white space, which we included in the split formula.

Example

Original Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output
Chow, T. C. (1922), 周贊衡(柱臣),   北平地質調查研 Chow, T. C. (1922) 周贊衡(柱臣) 北平地質調查研 Chow T. C. (1922) 1922 北平地質調查研


Upload case file 20

Case20 <- read_delim("Case20.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case20


Script

To achieve the desired output above, we need to proceed in successive steps to split up and refine the information until we reach the desired distribution of data.


Step 1

First, split up the information based on separators. In this case, commas are followed by a white space, which we include in the split formula.

Case20S <- Case20 %>% separate(Original, c("Surname", "InitDate", "NameZhFull", "Institution"), sep=", ", remove = FALSE)


Step 2

Second, extract everything before the opening parenthesis

Case20B <- Case20S %>% tidyr::extract(InitDate, c("GivenName"), "([^(]+),", remove=FALSE)


Step 3

Third, extract the initials of the person name

Case20C <- Case20S %>% tidyr::extract(InitDate, c("Initials"), "([^,]+)\\(", remove=FALSE)


Step 4

Fourth, extract the year

Case20D <- Case20C %>% tidyr::extract(InitDate, c("Date"), "([[:digit:]]+)", remove=FALSE) 


Step 5

Fifth, extract the Chinese full name

Case20E <- Case20D %>% tidyr::extract(Original, c("NameZhFull"), ",([^()]+),", remove=FALSE)


Step 6

Sixth, extract the Chinese name before parenthesis

Case20F <- Case20E %>% tidyr::extract(NameZhFull, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)


Step 7

Seventh, extract the Chinese name between parenthesis - Solution 1

Case20G <- Case20F %>% tidyr::extract(NameZhFull, c("NameZhAlt"), "(([\\p{Han}]+))", remove=FALSE) %>%
  tidyr::extract(NameZhAlt, c("NameZhAlt"), "(\\p{Han}+)", remove=FALSE)


Step 8

Eighth, extract the Chinese name between parenthesis - Solution 2

Case20Gca <- Case20F %>% mutate(NameAlt = str_remove_all(NameZhFull, NameZh)) %>% 
  tidyr::extract(NameAlt, c("NameAlt"), "([\\p{Han}]+)", remove=FALSE)


Case21

Split the elements of information in the original sentence" into distinct columns

Example

Grabau,   A. W. (1922), The Geological Survey of China, Peiping.
Barbour, G. B. (1922), 4   Clareville Grove, London SW. 7, England.
Fortuyn, A. B. D. (1931) ,   Department of Anatomy, P. U. M. C., Peiping.
Garretson, M. W.(Mrs.) (1925),   26, Greendale Ave, Mount Vernoon, N. Y., U. S. A.


This is what we expect to achieve:

Desired Output
Grabau, A. W. (1922) Grabau, A. W. Grabau Grabau, A. W. (1922) 1922 The Geological Survey of China Peiping.
Barbour, G. B. (1922) Barbour, G. B. Barbour, G. B. G. B. (1922) 1922 4 Clareville Grove London SW. 7
Fortuyn, A. B. D. (1931) Fortuyn, A. B. D. Fortuyn A. B. D. (1931) 1931 Department of Anatomy P. U. M. C. Peiping.
Garretson, M. W.(Mrs.) (1925),   26, Greendale Ave, Mount Vernoon, N. Y., U. S. A. Garretson, M. W.(Mrs.) (1925), 26, Greendale Ave, Mount   Vernoon, N. Y., U. S. A. Garretson, M. W.(Mrs.) (1925) Garretson, M. W.(Mrs.) Garretson, M. W. Garretson M. W. 26, Greendale Ave Mount Vernoon


Upload case file 21

Case21 <- read_delim("Case21.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case21


Script

Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.


Step 1

Extract the Surname and initials

Case21SI <- Case21 %>% tidyr::extract(Original, c("SurnIt"), "([^(]+)", remove=FALSE)


Step 2

Extract the date between parenthesis

Case21Y <- Case21SI %>% tidyr::extract(Original, c("Year"), "([[:digit:]]+)", remove=FALSE) 


Step 3

Extract everything after the closing parenthesis and comma -> not perfect


Step 4

Split the items separated by commas in Information


Case22

Split the elements of information in a complex bibliographical reference into distinct columns

Example

Original
CAI Shicun (TSAI   Shih-chun ; TSHAI Che-tchhwen) (dossier d’archives A-122) 蔡時椿 Incidents   ou accidents consécutifs aux injections de lipiodol employées comme moyen de   diagnostic des tumeurs intra-rachidiennes Lyon : Bosc Frères & Riou,   1929 71 p., 26 cm Thèse : Médecine Cote : CH TH 021


This is what we expect to achieve:

Desired   Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output Desired Output
CAI Shicun (TSAI Shih-chun ; TSHAI Che-tchhwen) (dossier d’archives A-122) Bosc Frères & Riou, 192971 p.,   26 cmThèse : MédecineCote : CH TH 024 蔡時椿 Incidents ou accidents consécutifs aux injections de lipiodol   employées comme moyen de diagnostic des tumeurs intra-rachidiennes Lyon Bosc Frères & Riou 1929 71 p.


Upload case file 22

Case22 <- read_delim("Case22.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case22


Script

Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.


Step 1

Extract the name in pinyin before the first opening parenthesis

Case22A <- Case22 %>% tidyr::extract(Original, c("NamePY"), "([^(]+)", remove=FALSE)


Step 2

Extract the text in the groups between parenthesis

Case22B <- Case22A %>% mutate(Information = str_remove_all(Original, NamePY))


Step 3

Extract the text alone from the groups between parenthesis

Case22C <- Case22B %>% tidyr::extract(Information, c("NameWG"), "([^\\(\\)]+)", remove=FALSE) 
Case22D <- Case22C %>% mutate(Information2 = str_remove_all(Information, NameWG))


Step 4

Suppress the white space before the second opening parenthesis

Case22E <- Case22D %>% mutate(Information2 = str_remove_all(Information, NameWG)) %>% 
  mutate(Information2 = str_remove_all(Information2, "(\\(\\) )")) 


Step 5

Extract the source reference

Case22F <- Case22E %>% tidyr::extract(Information2, c("Source"), "([^\\(\\)]+)", remove=FALSE)


Step 6

Extract the Chinese name

Case22G <- Case22F %>% tidyr::extract(Information2, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)


Step 7

Extract the dissertation title and references (caught with the Chinese name)

Case22H <- Case22G %>% tidyr::extract(Original, c("Dissertation"), "([\\p{Han}].+)", remove=FALSE)


Step 8

Remove the Chinese name

Case22H2 <- Case22H %>% mutate(Dissertation2 = str_remove_all(Dissertation, "([\\p{Han}]+ )"))


Case23

Extract positions based on a list of positions

Example

name work
何素璞 世界紅卍字會會長。濟南道院統掌
陳群 中央政治委員會委員。國民政府內政部長
張群 最高國防會議秘書長、西南建設委員會秘書長、行政院秘書長、四川省政府主席、國民精神總動員委員會秘書長
諸靑來 國民政府交通部長。上海銀行週報。大厦、持志、光華各大學敎授、神州大學敎務主任
黃敏中 國民政府敎育部常務次長。國民黨員、浙江省黨部常務委員
楊壽枬 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
許繼祥 國民政府海軍部常務次長。北京政府時代海軍部軍法司長、全國海道測量局長、全國海岸巡防處長


This is what we expect to achieve:

name work Position Institution
何素璞 世界紅卍字會會長 會長 世界紅卍字會
何素璞 濟南道院統掌 統掌 濟南道院
陳群 中央政治委員會委員 委員 中央政治會
陳群 國民政府內政部長 部長 國民政府內政
張群 最高國防會議秘書長 秘書長 最高國防會議
張群 西南建設委員會秘書長 秘書長 西南建設委員會
張群 行政院秘書長 秘書長 行政院
張群 四川省政府主席 主席 四川省政府
張群 國民精神總動員委員會秘書長 秘書長 國民精神總動員委員會
諸靑來 國民政府交通部長 部長 國民政府交通
諸靑來 上海銀行週報編輯 編輯 上海銀行週報
諸靑來 光華各大學敎授 敎授 光華各大學
諸靑來 神州大學敎務主任 主任 神州大學敎務
黃敏中 國民政府敎育部常務次長 常務次長 國民政府敎育部
黃敏中 國民黨員 黨員 國民
黃敏中 浙江省黨部常務委員 常務委員 浙江省黨部
楊壽枬 國民政府水利委員會委員長 委員長 國民政府水利委員會
楊壽枬 長蘆鹽運使 運使 長蘆鹽
楊壽枬 廣東海關監督 監督 廣東海關
楊壽枬 山東財政廳長 廳長 山東財政
楊壽枬 財政次長 次長 財政
許繼祥 國民政府海軍部常務次長 常務次長 國民政府海軍部
許繼祥 北京政府時代海軍部軍法司長 司長 北京政府時代海軍部軍法
許繼祥 全國海道測量局長 局長 全國海道測量
許繼祥 全國海岸巡防處長 處長 全國海岸巡防


Upload case file 23

Case23 <- read_delim("Case23.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case23


Script


Step 1

We import the list of positions

PosList <- read_csv("PosList.csv")


Step 2

We create a vector that lists all the names of positions. The original data in the Position column in the PosList file is turned into a vector.

positions_reference <- unlist(PosList)


Step 3

We create a loop by which each position name will be searched to find a match in the Case23 file.

positions_reference_vec <- paste0("(", paste(positions_reference, sep = "", collapse = "|"), ")$")


Step 4

We split multiple positions into separate rows

Case23A <- Case23 %>% separate_rows(work, sep="。|、")


Step 5

We extract the positions

Case23B <- Case23A %>% mutate(Position = str_extract(work, positions_reference_vec))


Step 6

We create the Institution column by removing the whole set of positions from the original work column