Data Transformation

Cécile Armand

Aix-Marseille University (ENP-China)

Christian Henriot

Aix-Marseille University (ENP-China)

Nora Van den Bosch

Aix-Marseille University (ENP-China)

Weiting Guo

Aix-Marseille University (ENP-China)

2021-03-30

Abstract

With the present Markdown script we mean to provide a series of operations for the transformation of data, including messy data, to fit the requirements of clean tabular data. Our purpose is to facilitate the production of data in Chinese studies according to the norms of academia in Western countries. For example, Chinese characters need to be transliterated into pinyin. Yet, our script can be adapted to fit the needs of other languages.

Data transformation is a tedious task that requires rigor and accuracy. While small data sets can be processed by hand in a spreadsheet, larger data sets can quickly become time consuming, with a greater risk of making mistakes. An R script enables the systematic processing of data without erasing the original data set on which the transformation is being applied.

We propose here a wide range of examples, from simple text editing to the extraction of data from complex sentences. The examples we use apply to both English (or any latin script) and Chinese. Altogether, we present 23 case studies. Some involve a single operation, some involve a series of steps to reach the desired results. All case studies are presented in the same way:

First, we present the data set in its original format

Second, we display the desired output of data transformation

Third, we provide the code to transform the data and to display the results

Case 1

The purpose here is to delete all the white spaces in the file (head tails and end tails). This is something that the eye can hardly spot, especially white spaces in end tails, whereas this can create unwelcome duplicates of the same data in a spreadsheet (very bad in pivot tables, GIS, SNA, etc.).

This is the original file with just one column containing the surname and first name with white spaces (head tail/end tail)

Example

Original	Desired Output
Young, Robt. Heyden	Young, Robt. Heyden
Young, S. C.	Young, S. C.
Young, William Alexander	Young, William Alexander
龔厂樵	龔厂樵
龔去非	龔去非

Upload case file 1

Case1 <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1

We propose two solutions.

Script 1

The first solution is the automatic removal of white spaces at the time of import: import the csv file and tick the box “trim spaces”

Case1a <- read_delim("Case1.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case1a

Script 2

The second solution consists in using the “trim_ws” functions. We create three distinct columns to check the result for each parameter (left/right/both).

Case1b <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1b$left_strip <- trimws(Case1b$Original, which = c("left"))
Case1b$right_strip <- trimws(Case1b$Original, which = c("right"))
Case1b$both_strip <- trimws(Case1b$Original, which = c("both"))
Case1b

Case 2

We want to split the last name and first name when the two terms are separated by a comma (or any other separator). This is very handy when one needs to enter distinct data into a database or to do an analysis of each component. Such separation can apply to other types of data.

Example:

Original	Desired Output
Young, Robt. Heyden	Young	Robt. Heyden
Young, S. C.	Young	S. C.
Young, William Alexander	Young	William Alexander
Young, William Stewart	Young	William Stewart

Upload case file 2

This is the original file with just one column containing the last name and first name separated by a comma (or any other separator).

We propose two solutions.

Script 1

The first solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the comma.

We proceed in two steps:

We create two columns named FirstName and LastName that will contain the final results
We use the “str_replace” function to extract the first name and the last name by deleting in each case the information before and after the comma that separates first name and last name. The first parameter is the column that we want to use as a base (Original). The second parameter is the regular expression that we use to catch the defined pattern in the Original (“[^,]+,” or ",.*“). The third parameter is what will come instead of that captured pattern, which is empty (”")

Case2a <- Case2 %>% mutate(Original,
                           FirstName=str_replace(Original,"[^,]+, ",""),
                           LastName=str_replace(Original,",.*",""))
Case2a

Script 2

The second solution consists in extracting the string of characters before (to extract last name) and after (to extract first name).

We propose to proceed in three steps:

We create a column named FirstName and LastName that contain intermediary results
We use the “str_extract” function to extract the first name and the last name. The first parameter is the column that we want to use as a base (Original). The second parameter is the regular expression that we want to extract in the Original (“[^,]+,” or ",.*")
We remove ‘,’ or ‘,’ from the intermediary columns and we create two columns “LastName2” and “FirstName2” that contain the final results.

Case2b <- Case2 %>% 
  mutate(Original, 
         LastName=str_extract(Original,"[^,]+,"), 
         FirstName=str_extract(Original,",.*")) %>% 
  mutate(LastName2 = str_remove_all(LastName,",")) %>% 
  mutate(FirstName2 = str_remove_all(FirstName,", "))
Case2b

Case 3

This is a second example of splitting up data: Surname and Given Name separated by a slash (but no white space).

Example

Original	Desired Output
Young/Robt. Heyden	Young	Robt. Heyden
Young/S. C.	Young	S. C.
Young/William Alexander	Young	William Alexander
Young/William Stewart	Young	William Stewart

Upload case file 3

This is the original file with just one column with Surname and Given name separated by a slash sign.

Script

The solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the slash sign.

Case 4

This is a third example of splitting up data: Split Surname and Given Name separated by a white space

Example

Original	Desired Output
Adachi Tsunayuki	Adachi	Tsunayuki
Aibara Kuragoro	Aibara	Kuragoro
Aida Isuneo	Aida	Isuneo
Akabane Shiro	Akabane	Shiro
Akamatsu Noriyoshi	Akamatsu	Noriyoshi

Upload case file 4

This is the original file with just one column containing the surname and the first name separated by a white space.

Script

The solution we propose consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the white space using the function “str_replace”.

Case4a <- Case4 %>% mutate(
  FirstName=str_replace(Original,"[^ ]+ ",""),
  LastName=str_replace(Original," .*",""))
Case4a

Case 5

This is a trickier example of splitting up data: Split Surname and Given Names separated by repeated white spaces

Example

Original	Desired Output
Baron Alexandre de Forth-Rouen	de Forth-Rouen	Alexandre	Baron
Jules Berthemy	Berthemy	Jules
Comte Charles de Lallemand	de Lallemand	Charles	Comte
Comte Julien de Rochechouart	de Rochechouart	Julien	Comte
Vicomte Brenier de Montmorand	de Montmorand	Brenier	Vicomte
Albert Bourée	Bourée	Albert

Upload case file 5

This is the original file with just one column containing the surname, the first name and the title (if any)

Script

We propose to proceed in three steps:

First we separate the title and the full name
Next we extract the given name and the surname
Last we use the variable given name to extract the surname from the Original variable.

Step 1

Case5c <- Case5 %>% mutate(Original2 = str_remove_all(Original,"(Baron|Comte|Vicomte) "))

Step 2

Case5d <- Case5c %>% mutate(FirstName = str_extract(Original2,"[^ ]+"))

Step 3

Case5e <- Case5d %>% mutate(LastName = str_remove_all(Original2, FirstName)) %>% 
  mutate(LastName = trimws(LastName, which = c("left")))

Case 6

Split Chinese Surname and Given Names: this case is specific to Chinese names, first because we often handle Chinese characters; second because there is never any separation/separator between Surname and Given name in Chinese; third because Chinese names can vary from 2 to 4 characters, with a variety of combination when the Surname has more than one character.

Example

Original	Desired Output
龔厂樵	龔	厂樵
龔去非	龔	去非
龔同元	龔	同元
龔國富	龔	國富
龔年	龔	年
龔懷	龔	懷
歐陽祖經	歐陽	祖經
歐陽竟無	歐陽	竟無
歐陽遵詮	歐陽	遵詮
歐陽遵	歐陽	文煥
歐游阿	歐游	阿菊

Upload case file 6

This is the initial file with one column for original full names (“Original”). We want to split the original full names into surname and given name.

Case6 <- read_csv("Case6.csv")
Case6

Script

Case 6 presents a particular challenge due to the varying composition of Chinese surnames and names. The main difficulty is compound surnames since there is no way to find a clear separator with the given name as in the names with one-character surnames and one- or two-character given names. We present a solution in three steps:

Step 1

We separate surname and given name based on the place and number of characters for the surname and given name in the string of characters. This step catches all the one-character surnames and one- or two-character given names, but not the compound surnames (e.g., 歐陽). This is a simple solution, but it will imply manual checking to identify the compound surnames.

Case6$chinese_surname <- substr(Case6$Original,1,1)
Case6$chinese_given_name <- substr(Case6$Original,2,10)
Case6

Step 2

In this step, we apply the conditional function (case_when) to define a pattern based on a given compound name (or a list of compound names separated by | [vertical bar]). The list can be extended as much as necessary. This will extract the compound names from the “Original” column. For a truly systematic use of a list, one needs to create a vector with all the compound surnames and use a loop function to extract them. This is a function that we address in Case 23.

Case6b <- Case6 %>% 
  mutate(Surname=case_when(str_detect(Original, "歐陽")~"歐陽", 
                           TRUE~substr(Original,1,1)))
Case6b

Step 3

We extract the given name by substracting the content of the “Surname” column from the “Original” column.

Case6c <- Case6b %>% mutate(GivenName= str_remove_all(Original, Surname))
Case6c

Case 7

Transliteration of Chinese characters to Pinyin: this is a requirement in all Western-language publications. When dealing with a long list of names, this can become especially cumbersome, time-consuming, and tedious.

Example

Original	Original2	Desired Output
龔	厂樵	gong	changqiao
龔	去非	gong	qufei
龔	同元	gong	tongyuan
龔	國富	gong	guofu
龔	年	gong	nian
龔	懷	gong	huai
歐陽	祖經	ouyang	zujing
歐陽	竟無	ouyang	jingwu
歐陽	遵詮	ouyang	zunquan

Upload case file 7

This is the initial file with two columns: “Original_S” for Surname and “Original_GN” for given name

Case7 <- read_delim("Case7.csv", ";", escape_double = FALSE, 
    trim_ws = TRUE)
knitr::kable(Case7)

Original_S	Original_GN
龔	厂樵
龔	去非
龔	同元
龔	國富
龔	年
龔	懷
歐陽	祖經
歐陽	竟無
歐陽	遵詮
歐陽	文煥
歐游	阿菊

We propose two solutions to solve the case.

Script 1

The first solution consists in using the function “sinogram_to_py” in the “enpchina” package. It works very well, but since it relies on a dictionary that lists all the possible pronunciations of characters, the final result is not ideal.

library(enpchina)
Case7 %>% mutate(Surname_py = sinograms_to_py(c(Original_S))) %>% 
  mutate(GivenName_py = sinograms_to_py(c(Original_GN)))

Script 2

The second solution consists in using the package pinyin. There are still some issues with characters that have more than one pronunciation. The transformation is made based on the most frequent transliteration, but it is not absolutely perfect either. Yet this provides an easy way to transliterate long lists of Chinese characters.

library(pinyin)
Case7 %>% mutate(Surname_py = py(c(Original_S), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>% 
  mutate(GivenName_py = py(c(Original_GN), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>% 
  mutate(GivenName_py = str_to_title(GivenName_py)) %>% 
  mutate(Surname_py = str_to_title(Surname_py))

Case 8

Capitalize first letters of the names: this is a simple operation of editing, depending on how one wants to present data in a table for example

Example

Original		Desired Output
gong	changqiao	Gong	Changqiao
gong	qufei	Gong	Qufei
gong	tongyuan	Gong	Tongyuan
gong	guofu	Gong	Guofu
gong	nian	Gong	Nian
gong	huai	Gong	Huai
ouyang	zujing	Ouyang	Zujing
ouyang	jingwu	Ouyang	Jingwu
ouyang	zunquan	Ouyang	Zunquan
ouyang	zun	Ouyang	Zun
ouyou	a	Ouyou	A

Upload case file 8

Case8 <- read_delim("Case8.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case8

Script

The solution consists in using the str_to_title function.

Case8a <- Case8 %>% mutate(Surname = str_to_title(Original_S)) %>% 
  mutate(GivenName = str_to_title(Original_GN))
Case8a

Case 9

Reassembling Surname and Given name into a Full Name: data may come in a format that does not meet one’s needs. In this case, we have a case in which the Surname and Given name are placed in distinct columns. The objective to to merge them in a single column.

Example

Original	Original	Desired Output
Qu	Zuhui	Qu Zuhui
Yan	Renguang	Yan Renguang
Nie	Guang	Nie Guang
Yan	Kaiyuan	Yan Kaiyuan
Yan	Daoyuan	Yan Daoyuan
Yan	Dunjian	Yan Dunjian

Upload case file 9

This is the original file with the Surname and Given name in two distinct columns.

Case9 <- read_csv("Case9.csv", trim_ws = FALSE)
Case9

Script

The solution consists in using the “paste” function with the indication of a separator (here white space)

Case9a <- Case9 %>% mutate(FullName = paste(Original, Original2, sep = " "))
Case9a

Case 10

Split province and location: this is a recurrent motif in data transformation, especially in Chinese where locations (e.g., birth place) are given as a compound name (江蘇無錫，江蘇省無錫縣). We want to extract the exact name of the location and retain the designation of its administrative level.

Example

Original	Desired Output	Desired Output
江蘇省寳山縣	江蘇	寳山縣
浙江省紹興縣	浙江	紹興縣
江西省萬安縣	江西	萬安縣
湖南省王山	湖南	省王山
河南省	河南
東京市	東京	市
湖南長沙	湖南	長沙
廣東省興寧縣	廣東	興寧縣
江蘇省杭縣	江蘇	杭縣
浙江省紹興縣	浙江	紹興縣

Upload case file 10

This is the original file with the name of the province and location in a single column

Case10 <- read_csv("Case10.csv", trim_ws = FALSE)
Case10

Script

We propose various modes of splitting and extracting the information:

Mode 1

A. To split and extract the name of the province based on a pattern: with this method we define the list of terms of be extracted and we search the file for matching terms. This is quite useful when the number of items in the list is limited.

Case10A <-Case10 %>% mutate (Province = str_extract(Original,
                                                    "(臺灣|福建|浙江|遼寧|湖南|河北|安徽|湖北|江西|山東|四川河南|廣西|陝西|貴州|山西|吉林|黑龍江|安東|雲南|海南島|新疆|廿肅|察哈爾|龍江|靑海|江蘇|廣東|四川|河南)"))
Case10A

Mode 2

B. To extract only the administrative levels: we proceed to extract the designation of administrative levels based on a pattern (here the name of the level: 省, 縣)

Case10B <-Case10 %>% mutate (Prov = str_extract(Original, "省")) %>% 
  mutate (Xian = str_extract(Original, "縣"))
Case10B

Mode 3

C. To extract the name of the Location: here we do it by removing anything before the character 省

Case10C <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省",""))
Case10C

Mode 4

D. To extract the name of the Province: we do it by extracting anything before the character 省

Case10D <- Case10C %>% mutate(Province=str_extract(Original,"[^省]+"))
Case10D

Mode 5

E. To extract the name of the Location and the Province (by combining the two formula C & D )

Case10E <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省","")) %>% 
  mutate(Province=str_extract(Original,"[^省]+"))
Case10E

Case 11

Split Institution and Position: the objective is to extract separately the name of the positions and the name of the institutions listed in a single column as a single compound of characters in Chinese.

Example

Original	Desired Output	Desired Output
國民黨中央黨部社會部長兼特工總部主任	國民黨中央黨部社會部	部長
國民黨中央黨部社會部長兼特工總部主任	特工總部	主任
四川省政府主席	四川省政府	主席
蒙古聯合自治政府副主席	蒙古聯合自治政府	副主席
新中國日報總主筆	新中國日報	總主筆
龍煙鐵礦株式會社理事長	龍煙鐵礦株式會社	理事長
西南建設委員會秘書長	西南建設委員會	秘書長
民衆敎育幹部訓練所長	民衆敎育幹部訓練所	所長
神州大學敎務主任	神州大學	神敎務主任

Upload case file 11

This is the original file with the name of the institutions and positions in a single column

Case11 <- read_delim("Case11.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case11

Script

We proceed in several steps to take into account the presence of not just a single position, but sometimes dual positions.

Step 1

First, we separate the rows with two or more positions into distinct rows. The separator here is “兼”, but it could also be “及”, “並” or any form of Chinese punctuation (as in case 12).

Case11S <- Case11 %>% separate_rows(Original, sep="兼")
Case11S

Step 2

Second, we split the positions based on pattern matching. This approach is similar to Case10A in case 10. It relies on a pre-defined list of terms.

Case11A <-Case11S %>% mutate (Position = str_extract(Original,
         "主任|部長|副主席|總主筆|理事長|秘書長|所長|敎務主任|主席"))
Case11A

Step 3

Third, we extract the name of the institution. Do achieve this, we use the content of the “Position” column to remove all the names of positions in the Original column. This create a new Institution column.

Case11B <- Case11A %>% mutate(Institution = str_remove_all(Original, Position))
Case11B

Case12

This is about splitting the various sequences of positions

Example

This is the original table with two columns. The second column contains information that needs to be split up so that each piece of information in Information is placed in a distinct row linked to the item (person) listed in Person column. The splitting is based on the elements of punctuation in the sentences. This can apply to any other form of punctuation or separator, including characters that may indicate a new item (e.g., in Chinese 及, 並, 兼, etc.). Note that the output for information could be further refined to separate the terms after の as we show in case 13 below.

Person	Information
楊壽枬	國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
沈士遠	浙江高等師範學校、國立北京大學、燕京大學の敎授
陳其釆	故陳其美の實弟、陳果夫、陳立夫の伯父

This is what we expect to achieve:

Original	Original	Desired Output	Desired Output
楊壽枬	國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長	楊壽枬	國民政府水利委員會委員長
		楊壽枬	前清時代の舉人出身
		楊壽枬	長蘆鹽運使
		楊壽枬	廣東海關監督
		楊壽枬	山東財政廳長
		楊壽枬	財政次長
陳其釆	故陳其美の實弟、陳果夫、陳立夫の伯父	陳其釆	陳果夫
		陳其釆	陳立夫の伯父
		陳其釆	故陳其美の實弟
沈士遠	浙江高等師範學校、國立北京大學、燕京大學の敎授	沈士遠	浙江高等師範學
		沈士遠	國立北京大學
		沈士遠	燕京大學の敎授

Upload case file 12

Case12 <- read_delim("Case12.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case12

Script

Case12A <- Case12 %>% separate_rows(Information, sep="。|、")

Case 13

Separate a set of attributes attached to an individual in Japanese.

Example

Original	Information
施肇基	唐紹儀の甥。施肇曾の弟
沈士遠	燕京大學の敎授
聶其杰	曾國藩の孫

This is what we expect to achieve:

Desired Output	Desired Output	Desired Output
施肇基	唐紹儀	甥
施肇基	唐紹儀	弟
聶其杰	曾國藩	孫
沈士遠	燕京大學	敎授

Upload case file 13

Case13 <- read_delim("Case13.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case13

We proceed in two steps: 1. First, separate row with dual information 2. Second, extract the information through pattern matching This is done in a single operation that separates, extracts the data, and place it in a new column (Status)

Script

Case13S <- Case13 %>% separate_rows(Information, sep="。") %>% 
  mutate (Status = str_extract(Information, "甥|弟|孫|敎授"))

Case 14

Separate a set of attributes attached to an individual in distinct columns: in the present case, we want to split up the information in the Information column and distribute it in distinct columns.

Example

Original	Information
楊壽枬	國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
沈士遠	浙江高等師範學校、國立北京大學、燕京大學の敎授

This is what we expect to achieve:

Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output
楊壽枬	國民政府水利委員會委員長	前清時代の舉人出身	長蘆鹽運使	廣東海關監督	山東財政廳長	財政次長
沈士遠	浙江高等師範學校	國立北京大學	燕京大學の敎授

Upload case file 14

Case14 <- read_delim("Case14.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case14

Script

Case14S <- Case14 %>% separate(Information, c("Pos1", "Pos2", "Pos3", "Pos4"), sep="。|、", extra = "drop")

Case 15

Split transliterated Full name and Chinese Full name: spliting could be done on the basis of blank spaces, but if there is more than one blank space, the splitting will produce unwanted separations. We turn to splitting based on the type of script (here Chinese characters)

Example

Original
Tsai-ch’un 載淳
Wên-hsiang 文祥
Ch’ung-shih 崇實
Ch’ên Li 陳澧
Chao Chih-ch’ien 趙之謙
Tso Tsung-t’ang 左宗棠

This is what we expect to achieve:

Desired Output	Desired Output
載淳	Tsai-ch’un
文祥	Wên-hsiang
崇實	Ch’ung-shih
陳澧	Ch’ên Li
趙之謙	Chao Chih-ch’ien
左宗棠	Tso Tsung-t’ang

Upload case file 15

Case15 <- read_delim("Case15.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case15

Script

Case15B <- Case15 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
  mutate (NameTrans = str_remove_all(Original, NameZh))

Case 16

Split transliterated Full name and Chinese Full name (no white space): splitting is based on the type of script and it works regardless of the presence or absence of any separator

Example

Tsai-ch’un載淳	載淳
Wên-hsiang文祥	文祥
Ch’ung-shih崇實	崇實
Ch’ên Li陳澧	陳澧
Chao Chih-ch’ien趙之謙	趙之謙
Tso Tsung-t’ang左宗棠	左宗棠

This is what we expect to achieve:

Desired Output	Desired Output
載淳	Tsai-ch’un
文祥	Wên-hsiang
崇實	Ch’ung-shih
陳澧	Ch’ên Li
趙之謙	Chao Chih-ch’ien
左宗棠	Tso Tsung-t’ang

Upload case file 16

Case16 <- read_delim("Case16.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case16

Script

Case16A <- Case16 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
  mutate (NameTrans = str_remove_all(Original, NameZh))

Case 17

Transform Wade-Giles transliteration into Pinyin transliteration: the conversion of names in Wade & Giles transliteration to pinyin can come handy when dealing with names extracted from Western sources (including academic literature) published before 1981 when pinyin became the norm for academic publishing.

Example

Original	Desired Output
Tsai-ch’un	Zai Chun
Wên-hsiang	Wen Xiang
Ch’ung-shih	Chong Shi
Ch’ên Li	Chen Li
Chao Chih-ch’ien	Zhao Shiqian
Tso Tsung-t’ang	Zuo Zongtang

Upload case file 17

Case17 <- read_delim("Case17.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case17

Script

Because the library that we use still produces some minor errors, we need to make additional corrections to the initial output of the “wade_to_py” function. This will be corrected in due time.

Case17A <- Case17 %>% mutate(NameWG = wade_to_py(Original)) %>%
  mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
  mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>% 
  mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che"))

Case 18

Change all the names in capitals into names with a capital letter on the first letter of each word only. This is a simple, but useful edit operation to adapt the format of names. This can be used for the transformation of titles, etc.

Example

Original	Desired Output
CH’EN KUO-JUI	Ch’ên Kuo-jui
CHAO CHIH-CH’IEN	Chao Chih-ch’ien
TSO TSUNG-T’ANG	Tso Tsung-t’ang
CHIN HO	Chin Ho
PAO CH’AO	Pao Ch’ao

Upload case file 18

Case18 <- read_delim("Case18.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case18

Script

Case18A <- Case18 %>% mutate(Name = str_to_title(Original))

Case 19

Change all the names into capital letters: this is also an edit operation doing the opposite of Case 18.

Example

Original	Desired Output
CH’EN KUO-JUI	Ch’ên Kuo-jui
CHAO CHIH-CH’IEN	Chao Chih-ch’ien
TSO TSUNG-T’ANG	Tso Tsung-t’ang
CHIN HO	Chin Ho
PAO CH’AO	Pao Ch’ao

Upload case file 19

Case19 <- read_delim("Case19.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case19

Script

Case 20

Split the elements of information in the original “sentence” into distinct columns: in this particular case, the commas are followed by a white space, which we included in the split formula.

Example

Original	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output
Chow， T. C. (1922），周贊衡（柱臣），北平地質調查研	Chow， T. C. (1922)	周贊衡（柱臣)	北平地質調查研	Chow	T. C.	(1922）	1922	北平地質調查研

Upload case file 20

Case20 <- read_delim("Case20.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case20

Script

To achieve the desired output above, we need to proceed in successive steps to split up and refine the information until we reach the desired distribution of data.

Step 1

First, split up the information based on separators. In this case, commas are followed by a white space, which we include in the split formula.

Case20S <- Case20 %>% separate(Original, c("Surname", "InitDate", "NameZhFull", "Institution"), sep="， ", remove = FALSE)

Step 2

Second, extract everything before the opening parenthesis

Case20B <- Case20S %>% tidyr::extract(InitDate, c("GivenName"), "([^(]+)，", remove=FALSE)

Step 3

Third, extract the initials of the person name

Case20C <- Case20S %>% tidyr::extract(InitDate, c("Initials"), "([^，]+)\\(", remove=FALSE)

Step 4

Fourth, extract the year

Case20D <- Case20C %>% tidyr::extract(InitDate, c("Date"), "([[:digit:]]+)", remove=FALSE)

Step 5

Fifth, extract the Chinese full name

Case20E <- Case20D %>% tidyr::extract(Original, c("NameZhFull"), "，([^()]+)，", remove=FALSE)

Step 6

Sixth, extract the Chinese name before parenthesis

Case20F <- Case20E %>% tidyr::extract(NameZhFull, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)

Step 7

Seventh, extract the Chinese name between parenthesis - Solution 1

Case20G <- Case20F %>% tidyr::extract(NameZhFull, c("NameZhAlt"), "(（[\\p{Han}]+）)", remove=FALSE) %>%
  tidyr::extract(NameZhAlt, c("NameZhAlt"), "(\\p{Han}+)", remove=FALSE)

Step 8

Eighth, extract the Chinese name between parenthesis - Solution 2

Case20Gca <- Case20F %>% mutate(NameAlt = str_remove_all(NameZhFull, NameZh)) %>% 
  tidyr::extract(NameAlt, c("NameAlt"), "([\\p{Han}]+)", remove=FALSE)

Case21

Split the elements of information in the original sentence" into distinct columns

Example

Grabau， A. W. (1922）， The Geological Survey of China， Peiping.
Barbour， G. B. (1922)， 4 Clareville Grove， London SW. 7， England.
Fortuyn， A. B. D. (1931) ， Department of Anatomy， P. U. M. C.， Peiping.
Garretson， M. W.(Mrs.) (1925)， 26， Greendale Ave， Mount Vernoon， N. Y.， U. S. A.

This is what we expect to achieve:

Desired Output
Grabau， A. W. (1922)	Grabau， A. W.	Grabau	Grabau， A. W.	(1922）	1922	The Geological Survey of China	Peiping.
Barbour， G. B. (1922)	Barbour， G. B.	Barbour， G. B.	G. B.	(1922）	1922	4 Clareville Grove	London	SW. 7
Fortuyn， A. B. D. (1931)	Fortuyn， A. B. D.	Fortuyn	A. B. D.	(1931）	1931	Department of Anatomy	P. U. M. C.	Peiping.
Garretson， M. W.(Mrs.) (1925)， 26， Greendale Ave， Mount Vernoon， N. Y.， U. S. A.	Garretson， M. W.(Mrs.) (1925)， 26， Greendale Ave， Mount Vernoon， N. Y.， U. S. A.	Garretson， M. W.(Mrs.) (1925)	Garretson， M. W.(Mrs.)	Garretson， M. W.	Garretson	M. W.	26， Greendale Ave	Mount Vernoon

Upload case file 21

Case21 <- read_delim("Case21.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case21

Script

Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.

Step 1

Extract the Surname and initials

Case21SI <- Case21 %>% tidyr::extract(Original, c("SurnIt"), "([^(]+)", remove=FALSE)

Step 2

Extract the date between parenthesis

Case21Y <- Case21SI %>% tidyr::extract(Original, c("Year"), "([[:digit:]]+)", remove=FALSE)

Step 3

Extract everything after the closing parenthesis and comma -> not perfect

Step 4

Split the items separated by commas in Information

Case22

Split the elements of information in a complex bibliographical reference into distinct columns

Example

Original
CAI Shicun (TSAI Shih-chun ; TSHAI Che-tchhwen) (dossier d’archives A-122) 蔡時椿 Incidents ou accidents consécutifs aux injections de lipiodol employées comme moyen de diagnostic des tumeurs intra-rachidiennes Lyon : Bosc Frères & Riou, 1929 71 p., 26 cm Thèse : Médecine Cote : CH TH 021

This is what we expect to achieve:

Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output	Desired Output
CAI Shicun	(TSAI Shih-chun ; TSHAI Che-tchhwen)	(dossier d’archives A-122) Bosc Frères & Riou, 192971 p., 26 cmThèse : MédecineCote : CH TH 024	蔡時椿	Incidents ou accidents consécutifs aux injections de lipiodol employées comme moyen de diagnostic des tumeurs intra-rachidiennes	Lyon	Bosc Frères & Riou	1929	71 p.

Upload case file 22

Case22 <- read_delim("Case22.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case22

Script

Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.

Step 1

Extract the name in pinyin before the first opening parenthesis

Case22A <- Case22 %>% tidyr::extract(Original, c("NamePY"), "([^(]+)", remove=FALSE)

Step 2

Extract the text in the groups between parenthesis

Case22B <- Case22A %>% mutate(Information = str_remove_all(Original, NamePY))

Step 3

Extract the text alone from the groups between parenthesis

Case22C <- Case22B %>% tidyr::extract(Information, c("NameWG"), "([^\\(\\)]+)", remove=FALSE) 
Case22D <- Case22C %>% mutate(Information2 = str_remove_all(Information, NameWG))

Step 4

Suppress the white space before the second opening parenthesis

Case22E <- Case22D %>% mutate(Information2 = str_remove_all(Information, NameWG)) %>% 
  mutate(Information2 = str_remove_all(Information2, "(\\(\\) )"))

Step 5

Extract the source reference

Case22F <- Case22E %>% tidyr::extract(Information2, c("Source"), "([^\\(\\)]+)", remove=FALSE)

Step 6

Extract the Chinese name

Case22G <- Case22F %>% tidyr::extract(Information2, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)

Step 7

Extract the dissertation title and references (caught with the Chinese name)

Case22H <- Case22G %>% tidyr::extract(Original, c("Dissertation"), "([\\p{Han}].+)", remove=FALSE)

Step 8

Remove the Chinese name

Case22H2 <- Case22H %>% mutate(Dissertation2 = str_remove_all(Dissertation, "([\\p{Han}]+ )"))

Case23

Extract positions based on a list of positions

Example

name	work
何素璞	世界紅卍字會會長。濟南道院統掌
陳群	中央政治委員會委員。國民政府內政部長
張群	最高國防會議秘書長、西南建設委員會秘書長、行政院秘書長、四川省政府主席、國民精神總動員委員會秘書長
諸靑來	國民政府交通部長。上海銀行週報。大厦、持志、光華各大學敎授、神州大學敎務主任
黃敏中	國民政府敎育部常務次長。國民黨員、浙江省黨部常務委員
楊壽枬	國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長
許繼祥	國民政府海軍部常務次長。北京政府時代海軍部軍法司長、全國海道測量局長、全國海岸巡防處長

This is what we expect to achieve:

name	work	Position	Institution
何素璞	世界紅卍字會會長	會長	世界紅卍字會
何素璞	濟南道院統掌	統掌	濟南道院
陳群	中央政治委員會委員	委員	中央政治會
陳群	國民政府內政部長	部長	國民政府內政
張群	最高國防會議秘書長	秘書長	最高國防會議
張群	西南建設委員會秘書長	秘書長	西南建設委員會
張群	行政院秘書長	秘書長	行政院
張群	四川省政府主席	主席	四川省政府
張群	國民精神總動員委員會秘書長	秘書長	國民精神總動員委員會
諸靑來	國民政府交通部長	部長	國民政府交通
諸靑來	上海銀行週報編輯	編輯	上海銀行週報
諸靑來	光華各大學敎授	敎授	光華各大學
諸靑來	神州大學敎務主任	主任	神州大學敎務
黃敏中	國民政府敎育部常務次長	常務次長	國民政府敎育部
黃敏中	國民黨員	黨員	國民
黃敏中	浙江省黨部常務委員	常務委員	浙江省黨部
楊壽枬	國民政府水利委員會委員長	委員長	國民政府水利委員會
楊壽枬	長蘆鹽運使	運使	長蘆鹽
楊壽枬	廣東海關監督	監督	廣東海關
楊壽枬	山東財政廳長	廳長	山東財政
楊壽枬	財政次長	次長	財政
許繼祥	國民政府海軍部常務次長	常務次長	國民政府海軍部
許繼祥	北京政府時代海軍部軍法司長	司長	北京政府時代海軍部軍法
許繼祥	全國海道測量局長	局長	全國海道測量
許繼祥	全國海岸巡防處長	處長	全國海岸巡防

Upload case file 23

Case23 <- read_delim("Case23.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case23

Script

Step 1

We import the list of positions

PosList <- read_csv("PosList.csv")

Step 2

We create a vector that lists all the names of positions. The original data in the Position column in the PosList file is turned into a vector.

positions_reference <- unlist(PosList)

Step 3

We create a loop by which each position name will be searched to find a match in the Case23 file.

positions_reference_vec <- paste0("(", paste(positions_reference, sep = "", collapse = "|"), ")$")

Step 4

We split multiple positions into separate rows

Case23A <- Case23 %>% separate_rows(work, sep="。|、")

Step 5

We extract the positions

Case23B <- Case23A %>% mutate(Position = str_extract(work, positions_reference_vec))

Step 6

We create the Institution column by removing the whole set of positions from the original work column