Abstract
With the present Markdown script we mean to provide a series of operations for the transformation of data, including messy data, to fit the requirements of clean tabular data. Our purpose is to facilitate the production of data in Chinese studies according to the norms of academia in Western countries. For example, Chinese characters need to be transliterated into pinyin. Yet, our script can be adapted to fit the needs of other languages.
Data transformation is a tedious task that requires rigor and accuracy. While small data sets can be processed by hand in a spreadsheet, larger data sets can quickly become time consuming, with a greater risk of making mistakes. An R script enables the systematic processing of data without erasing the original data set on which the transformation is being applied.
We propose here a wide range of examples, from simple text editing to the extraction of data from complex sentences. The examples we use apply to both English (or any latin script) and Chinese. Altogether, we present 23 case studies. Some involve a single operation, some involve a series of steps to reach the desired results. All case studies are presented in the same way:
First, we present the data set in its original format
Second, we display the desired output of data transformation
Third, we provide the code to transform the data and to display the results
The purpose here is to delete all the white spaces in the file (head tails and end tails). This is something that the eye can hardly spot, especially white spaces in end tails, whereas this can create unwelcome duplicates of the same data in a spreadsheet (very bad in pivot tables, GIS, SNA, etc.).
This is the original file with just one column containing the surname and first name with white spaces (head tail/end tail)
Original | Desired Output |
---|---|
Young, Robt. Heyden | Young, Robt. Heyden |
Young, S. C. | Young, S. C. |
Young, William Alexander | Young, William Alexander |
龔厂樵 | 龔厂樵 |
龔去非 | 龔去非 |
Case1 <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1
We propose two solutions.
The first solution is the automatic removal of white spaces at the time of import: import the csv file and tick the box “trim spaces”
Case1a <- read_delim("Case1.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case1a
The second solution consists in using the “trim_ws” functions. We create three distinct columns to check the result for each parameter (left/right/both).
Case1b <- read_delim("Case1.csv", ";", escape_double = FALSE, trim_ws = FALSE)
Case1b$left_strip <- trimws(Case1b$Original, which = c("left"))
Case1b$right_strip <- trimws(Case1b$Original, which = c("right"))
Case1b$both_strip <- trimws(Case1b$Original, which = c("both"))
Case1b
We want to split the last name and first name when the two terms are separated by a comma (or any other separator). This is very handy when one needs to enter distinct data into a database or to do an analysis of each component. Such separation can apply to other types of data.
Original | Desired Output | |
---|---|---|
Young, Robt. Heyden | Young | Robt. Heyden |
Young, S. C. | Young | S. C. |
Young, William Alexander | Young | William Alexander |
Young, William Stewart | Young | William Stewart |
This is the original file with just one column containing the last name and first name separated by a comma (or any other separator).
We propose two solutions.
The first solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the comma.
We proceed in two steps:
Case2a <- Case2 %>% mutate(Original,
FirstName=str_replace(Original,"[^,]+, ",""),
LastName=str_replace(Original,",.*",""))
Case2a
The second solution consists in extracting the string of characters before (to extract last name) and after (to extract first name).
We propose to proceed in three steps:
Case2b <- Case2 %>%
mutate(Original,
LastName=str_extract(Original,"[^,]+,"),
FirstName=str_extract(Original,",.*")) %>%
mutate(LastName2 = str_remove_all(LastName,",")) %>%
mutate(FirstName2 = str_remove_all(FirstName,", "))
Case2b
This is a second example of splitting up data: Surname and Given Name separated by a slash (but no white space).
Original | Desired Output | |
---|---|---|
Young/Robt. Heyden | Young | Robt. Heyden |
Young/S. C. | Young | S. C. |
Young/William Alexander | Young | William Alexander |
Young/William Stewart | Young | William Stewart |
This is the original file with just one column with Surname and Given name separated by a slash sign.
The solution consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the slash sign.
This is a third example of splitting up data: Split Surname and Given Name separated by a white space
Original | Desired Output | |
---|---|---|
Adachi Tsunayuki | Adachi | Tsunayuki |
Aibara Kuragoro | Aibara | Kuragoro |
Aida Isuneo | Aida | Isuneo |
Akabane Shiro | Akabane | Shiro |
Akamatsu Noriyoshi | Akamatsu | Noriyoshi |
This is the original file with just one column containing the surname and the first name separated by a white space.
The solution we propose consists in erasing the string of characters before (extract First name) and after (to extract Last Name) the white space using the function “str_replace”.
Case4a <- Case4 %>% mutate(
FirstName=str_replace(Original,"[^ ]+ ",""),
LastName=str_replace(Original," .*",""))
Case4a
This is a trickier example of splitting up data: Split Surname and Given Names separated by repeated white spaces
Original | Desired Output | ||
---|---|---|---|
Baron Alexandre de Forth-Rouen | de Forth-Rouen | Alexandre | Baron |
Jules Berthemy | Berthemy | Jules | |
Comte Charles de Lallemand | de Lallemand | Charles | Comte |
Comte Julien de Rochechouart | de Rochechouart | Julien | Comte |
Vicomte Brenier de Montmorand | de Montmorand | Brenier | Vicomte |
Albert Bourée | Bourée | Albert |
This is the original file with just one column containing the surname, the first name and the title (if any)
We propose to proceed in three steps:
Case5c <- Case5 %>% mutate(Original2 = str_remove_all(Original,"(Baron|Comte|Vicomte) "))
Case5d <- Case5c %>% mutate(FirstName = str_extract(Original2,"[^ ]+"))
Case5e <- Case5d %>% mutate(LastName = str_remove_all(Original2, FirstName)) %>%
mutate(LastName = trimws(LastName, which = c("left")))
Split Chinese Surname and Given Names: this case is specific to Chinese names, first because we often handle Chinese characters; second because there is never any separation/separator between Surname and Given name in Chinese; third because Chinese names can vary from 2 to 4 characters, with a variety of combination when the Surname has more than one character.
Original | Desired Output | |
---|---|---|
龔厂樵 | 龔 | 厂樵 |
龔去非 | 龔 | 去非 |
龔同元 | 龔 | 同元 |
龔國富 | 龔 | 國富 |
龔年 | 龔 | 年 |
龔懷 | 龔 | 懷 |
歐陽祖經 | 歐陽 | 祖經 |
歐陽竟無 | 歐陽 | 竟無 |
歐陽遵詮 | 歐陽 | 遵詮 |
歐陽遵 | 歐陽 | 文煥 |
歐游阿 | 歐游 | 阿菊 |
This is the initial file with one column for original full names (“Original”). We want to split the original full names into surname and given name.
Case6 <- read_csv("Case6.csv")
Case6
Case 6 presents a particular challenge due to the varying composition of Chinese surnames and names. The main difficulty is compound surnames since there is no way to find a clear separator with the given name as in the names with one-character surnames and one- or two-character given names. We present a solution in three steps:
We separate surname and given name based on the place and number of characters for the surname and given name in the string of characters. This step catches all the one-character surnames and one- or two-character given names, but not the compound surnames (e.g., 歐陽). This is a simple solution, but it will imply manual checking to identify the compound surnames.
Case6$chinese_surname <- substr(Case6$Original,1,1)
Case6$chinese_given_name <- substr(Case6$Original,2,10)
Case6
In this step, we apply the conditional function (case_when) to define a pattern based on a given compound name (or a list of compound names separated by | [vertical bar]). The list can be extended as much as necessary. This will extract the compound names from the “Original” column. For a truly systematic use of a list, one needs to create a vector with all the compound surnames and use a loop function to extract them. This is a function that we address in Case 23.
Case6b <- Case6 %>%
mutate(Surname=case_when(str_detect(Original, "歐陽")~"歐陽",
TRUE~substr(Original,1,1)))
Case6b
We extract the given name by substracting the content of the “Surname” column from the “Original” column.
Case6c <- Case6b %>% mutate(GivenName= str_remove_all(Original, Surname))
Case6c
Transliteration of Chinese characters to Pinyin: this is a requirement in all Western-language publications. When dealing with a long list of names, this can become especially cumbersome, time-consuming, and tedious.
Original | Original2 | Desired Output | |
---|---|---|---|
龔 | 厂樵 | gong | changqiao |
龔 | 去非 | gong | qufei |
龔 | 同元 | gong | tongyuan |
龔 | 國富 | gong | guofu |
龔 | 年 | gong | nian |
龔 | 懷 | gong | huai |
歐陽 | 祖經 | ouyang | zujing |
歐陽 | 竟無 | ouyang | jingwu |
歐陽 | 遵詮 | ouyang | zunquan |
This is the initial file with two columns: “Original_S” for Surname and “Original_GN” for given name
Case7 <- read_delim("Case7.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
knitr::kable(Case7)
Original_S | Original_GN |
---|---|
龔 | 厂樵 |
龔 | 去非 |
龔 | 同元 |
龔 | 國富 |
龔 | 年 |
龔 | 懷 |
歐陽 | 祖經 |
歐陽 | 竟無 |
歐陽 | 遵詮 |
歐陽 | 文煥 |
歐游 | 阿菊 |
We propose two solutions to solve the case.
The first solution consists in using the function “sinogram_to_py” in the “enpchina” package. It works very well, but since it relies on a dictionary that lists all the possible pronunciations of characters, the final result is not ideal.
library(enpchina)
Case7 %>% mutate(Surname_py = sinograms_to_py(c(Original_S))) %>%
mutate(GivenName_py = sinograms_to_py(c(Original_GN)))
The second solution consists in using the package pinyin. There are still some issues with characters that have more than one pronunciation. The transformation is made based on the most frequent transliteration, but it is not absolutely perfect either. Yet this provides an easy way to transliterate long lists of Chinese characters.
library(pinyin)
Case7 %>% mutate(Surname_py = py(c(Original_S), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>%
mutate(GivenName_py = py(c(Original_GN), sep = "", other_replace = NULL, dic = pydic(method='toneless'))) %>%
mutate(GivenName_py = str_to_title(GivenName_py)) %>%
mutate(Surname_py = str_to_title(Surname_py))
Capitalize first letters of the names: this is a simple operation of editing, depending on how one wants to present data in a table for example
Original | Desired Output | ||
---|---|---|---|
gong | changqiao | Gong | Changqiao |
gong | qufei | Gong | Qufei |
gong | tongyuan | Gong | Tongyuan |
gong | guofu | Gong | Guofu |
gong | nian | Gong | Nian |
gong | huai | Gong | Huai |
ouyang | zujing | Ouyang | Zujing |
ouyang | jingwu | Ouyang | Jingwu |
ouyang | zunquan | Ouyang | Zunquan |
ouyang | zun | Ouyang | Zun |
ouyou | a | Ouyou | A |
Case8 <- read_delim("Case8.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case8
The solution consists in using the str_to_title function.
Case8a <- Case8 %>% mutate(Surname = str_to_title(Original_S)) %>%
mutate(GivenName = str_to_title(Original_GN))
Case8a
Reassembling Surname and Given name into a Full Name: data may come in a format that does not meet one’s needs. In this case, we have a case in which the Surname and Given name are placed in distinct columns. The objective to to merge them in a single column.
Original | Original | Desired Output |
---|---|---|
Qu | Zuhui | Qu Zuhui |
Yan | Renguang | Yan Renguang |
Nie | Guang | Nie Guang |
Yan | Kaiyuan | Yan Kaiyuan |
Yan | Daoyuan | Yan Daoyuan |
Yan | Dunjian | Yan Dunjian |
This is the original file with the Surname and Given name in two distinct columns.
Case9 <- read_csv("Case9.csv", trim_ws = FALSE)
Case9
The solution consists in using the “paste” function with the indication of a separator (here white space)
Case9a <- Case9 %>% mutate(FullName = paste(Original, Original2, sep = " "))
Case9a
Split province and location: this is a recurrent motif in data transformation, especially in Chinese where locations (e.g., birth place) are given as a compound name (江蘇無錫,江蘇省無錫縣). We want to extract the exact name of the location and retain the designation of its administrative level.
Original | Desired Output | Desired Output |
---|---|---|
江蘇省寳山縣 | 江蘇 | 寳山縣 |
浙江省紹興縣 | 浙江 | 紹興縣 |
江西省萬安縣 | 江西 | 萬安縣 |
湖南省王山 | 湖南 | 省王山 |
河南省 | 河南 | |
東京市 | 東京 | 市 |
湖南長沙 | 湖南 | 長沙 |
廣東省興寧縣 | 廣東 | 興寧縣 |
江蘇省杭縣 | 江蘇 | 杭縣 |
浙江省紹興縣 | 浙江 | 紹興縣 |
This is the original file with the name of the province and location in a single column
Case10 <- read_csv("Case10.csv", trim_ws = FALSE)
Case10
We propose various modes of splitting and extracting the information:
A. To split and extract the name of the province based on a pattern: with this method we define the list of terms of be extracted and we search the file for matching terms. This is quite useful when the number of items in the list is limited.
Case10A <-Case10 %>% mutate (Province = str_extract(Original,
"(臺灣|福建|浙江|遼寧|湖南|河北|安徽|湖北|江西|山東|四川河南|廣西|陝西|貴州|山西|吉林|黑龍江|安東|雲南|海南島|新疆|廿肅|察哈爾|龍江|靑海|江蘇|廣東|四川|河南)"))
Case10A
B. To extract only the administrative levels: we proceed to extract the designation of administrative levels based on a pattern (here the name of the level: 省, 縣)
Case10B <-Case10 %>% mutate (Prov = str_extract(Original, "省")) %>%
mutate (Xian = str_extract(Original, "縣"))
Case10B
C. To extract the name of the Location: here we do it by removing anything before the character 省
Case10C <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省",""))
Case10C
D. To extract the name of the Province: we do it by extracting anything before the character 省
Case10D <- Case10C %>% mutate(Province=str_extract(Original,"[^省]+"))
Case10D
E. To extract the name of the Location and the Province (by combining the two formula C & D )
Case10E <- Case10 %>% mutate(Location=str_replace(Original, "[^省]+省","")) %>%
mutate(Province=str_extract(Original,"[^省]+"))
Case10E
Split Institution and Position: the objective is to extract separately the name of the positions and the name of the institutions listed in a single column as a single compound of characters in Chinese.
Original | Desired Output | Desired Output |
---|---|---|
國民黨中央黨部社會部長兼特工總部主任 | 國民黨中央黨部社會部 | 部長 |
國民黨中央黨部社會部長兼特工總部主任 | 特工總部 | 主任 |
四川省政府主席 | 四川省政府 | 主席 |
蒙古聯合自治政府副主席 | 蒙古聯合自治政府 | 副主席 |
新中國日報總主筆 | 新中國日報 | 總主筆 |
龍煙鐵礦株式會社理事長 | 龍煙鐵礦株式會社 | 理事長 |
西南建設委員會秘書長 | 西南建設委員會 | 秘書長 |
民衆敎育幹部訓練所長 | 民衆敎育幹部訓練所 | 所長 |
神州大學敎務主任 | 神州大學 | 神敎務主任 |
This is the original file with the name of the institutions and positions in a single column
Case11 <- read_delim("Case11.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case11
We proceed in several steps to take into account the presence of not just a single position, but sometimes dual positions.
First, we separate the rows with two or more positions into distinct rows. The separator here is “兼”, but it could also be “及”, “並” or any form of Chinese punctuation (as in case 12).
Case11S <- Case11 %>% separate_rows(Original, sep="兼")
Case11S
Second, we split the positions based on pattern matching. This approach is similar to Case10A in case 10. It relies on a pre-defined list of terms.
Case11A <-Case11S %>% mutate (Position = str_extract(Original,
"主任|部長|副主席|總主筆|理事長|秘書長|所長|敎務主任|主席"))
Case11A
Third, we extract the name of the institution. Do achieve this, we use the content of the “Position” column to remove all the names of positions in the Original column. This create a new Institution column.
Case11B <- Case11A %>% mutate(Institution = str_remove_all(Original, Position))
Case11B
This is about splitting the various sequences of positions
This is the original table with two columns. The second column contains information that needs to be split up so that each piece of information in Information is placed in a distinct row linked to the item (person) listed in Person column. The splitting is based on the elements of punctuation in the sentences. This can apply to any other form of punctuation or separator, including characters that may indicate a new item (e.g., in Chinese 及, 並, 兼, etc.). Note that the output for information could be further refined to separate the terms after の as we show in case 13 below.
Person | Information |
---|---|
楊壽枬 | 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長 |
沈士遠 | 浙江高等師範學校、國立北京大學、燕京大學の敎授 |
陳其釆 | 故陳其美の實弟、陳果夫、陳立夫の伯父 |
This is what we expect to achieve:
Original | Original | Desired Output | Desired Output |
---|---|---|---|
楊壽枬 | 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長 | 楊壽枬 | 國民政府水利委員會委員長 |
楊壽枬 | 前清時代の舉人出身 | ||
楊壽枬 | 長蘆鹽運使 | ||
楊壽枬 | 廣東海關監督 | ||
楊壽枬 | 山東財政廳長 | ||
楊壽枬 | 財政次長 | ||
陳其釆 | 故陳其美の實弟、陳果夫、陳立夫の伯父 | 陳其釆 | 陳果夫 |
陳其釆 | 陳立夫の伯父 | ||
陳其釆 | 故陳其美の實弟 | ||
沈士遠 | 浙江高等師範學校、國立北京大學、燕京大學の敎授 | 沈士遠 | 浙江高等師範學 |
沈士遠 | 國立北京大學 | ||
沈士遠 | 燕京大學の敎授 |
Case12 <- read_delim("Case12.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case12
Case12A <- Case12 %>% separate_rows(Information, sep="。|、")
Separate a set of attributes attached to an individual in Japanese.
Original | Information |
---|---|
施肇基 | 唐紹儀の甥。施肇曾の弟 |
沈士遠 | 燕京大學の敎授 |
聶其杰 | 曾國藩の孫 |
This is what we expect to achieve:
Desired Output | Desired Output | Desired Output |
---|---|---|
施肇基 | 唐紹儀 | 甥 |
施肇基 | 唐紹儀 | 弟 |
聶其杰 | 曾國藩 | 孫 |
沈士遠 | 燕京大學 | 敎授 |
Case13 <- read_delim("Case13.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case13
We proceed in two steps: 1. First, separate row with dual information 2. Second, extract the information through pattern matching This is done in a single operation that separates, extracts the data, and place it in a new column (Status)
Case13S <- Case13 %>% separate_rows(Information, sep="。") %>%
mutate (Status = str_extract(Information, "甥|弟|孫|敎授"))
Separate a set of attributes attached to an individual in distinct columns: in the present case, we want to split up the information in the Information column and distribute it in distinct columns.
Original | Information |
---|---|
楊壽枬 | 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長 |
沈士遠 | 浙江高等師範學校、國立北京大學、燕京大學の敎授 |
This is what we expect to achieve:
Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output |
---|---|---|---|---|---|---|
楊壽枬 | 國民政府水利委員會委員長 | 前清時代の舉人出身 | 長蘆鹽運使 | 廣東海關監督 | 山東財政廳長 | 財政次長 |
沈士遠 | 浙江高等師範學校 | 國立北京大學 | 燕京大學の敎授 |
Case14 <- read_delim("Case14.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case14
Case14S <- Case14 %>% separate(Information, c("Pos1", "Pos2", "Pos3", "Pos4"), sep="。|、", extra = "drop")
Split transliterated Full name and Chinese Full name: spliting could be done on the basis of blank spaces, but if there is more than one blank space, the splitting will produce unwanted separations. We turn to splitting based on the type of script (here Chinese characters)
Original |
---|
Tsai-ch’un 載淳 |
Wên-hsiang 文祥 |
Ch’ung-shih 崇實 |
Ch’ên Li 陳澧 |
Chao Chih-ch’ien 趙之謙 |
Tso Tsung-t’ang 左宗棠 |
This is what we expect to achieve:
Desired Output | Desired Output |
---|---|
載淳 | Tsai-ch’un |
文祥 | Wên-hsiang |
崇實 | Ch’ung-shih |
陳澧 | Ch’ên Li |
趙之謙 | Chao Chih-ch’ien |
左宗棠 | Tso Tsung-t’ang |
Case15 <- read_delim("Case15.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case15
Case15B <- Case15 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
mutate (NameTrans = str_remove_all(Original, NameZh))
Split transliterated Full name and Chinese Full name (no white space): splitting is based on the type of script and it works regardless of the presence or absence of any separator
Tsai-ch’un載淳 | 載淳 |
---|---|
Wên-hsiang文祥 | 文祥 |
Ch’ung-shih崇實 | 崇實 |
Ch’ên Li陳澧 | 陳澧 |
Chao Chih-ch’ien趙之謙 | 趙之謙 |
Tso Tsung-t’ang左宗棠 | 左宗棠 |
This is what we expect to achieve:
Desired Output | Desired Output |
---|---|
載淳 | Tsai-ch’un |
文祥 | Wên-hsiang |
崇實 | Ch’ung-shih |
陳澧 | Ch’ên Li |
趙之謙 | Chao Chih-ch’ien |
左宗棠 | Tso Tsung-t’ang |
Case16 <- read_delim("Case16.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case16
Case16A <- Case16 %>% extract(Original, c("NameZh"), "(\\p{Han}+)", remove=FALSE) %>%
mutate (NameTrans = str_remove_all(Original, NameZh))
Transform Wade-Giles transliteration into Pinyin transliteration: the conversion of names in Wade & Giles transliteration to pinyin can come handy when dealing with names extracted from Western sources (including academic literature) published before 1981 when pinyin became the norm for academic publishing.
Original | Desired Output |
---|---|
Tsai-ch’un | Zai Chun |
Wên-hsiang | Wen Xiang |
Ch’ung-shih | Chong Shi |
Ch’ên Li | Chen Li |
Chao Chih-ch’ien | Zhao Shiqian |
Tso Tsung-t’ang | Zuo Zongtang |
Case17 <- read_delim("Case17.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case17
Because the library that we use still produces some minor errors, we need to make additional corrections to the initial output of the “wade_to_py” function. This will be corrected in due time.
Case17A <- Case17 %>% mutate(NameWG = wade_to_py(Original)) %>%
mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che"))
Change all the names in capitals into names with a capital letter on the first letter of each word only. This is a simple, but useful edit operation to adapt the format of names. This can be used for the transformation of titles, etc.
Original | Desired Output |
---|---|
CH’EN KUO-JUI | Ch’ên Kuo-jui |
CHAO CHIH-CH’IEN | Chao Chih-ch’ien |
TSO TSUNG-T’ANG | Tso Tsung-t’ang |
CHIN HO | Chin Ho |
PAO CH’AO | Pao Ch’ao |
Case18 <- read_delim("Case18.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case18
Case18A <- Case18 %>% mutate(Name = str_to_title(Original))
Change all the names into capital letters: this is also an edit operation doing the opposite of Case 18.
Original | Desired Output |
---|---|
CH’EN KUO-JUI | Ch’ên Kuo-jui |
CHAO CHIH-CH’IEN | Chao Chih-ch’ien |
TSO TSUNG-T’ANG | Tso Tsung-t’ang |
CHIN HO | Chin Ho |
PAO CH’AO | Pao Ch’ao |
Case19 <- read_delim("Case19.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case19
Split the elements of information in the original “sentence” into distinct columns: in this particular case, the commas are followed by a white space, which we included in the split formula.
Original | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output |
---|---|---|---|---|---|---|---|---|
Chow, T. C. (1922), 周贊衡(柱臣), 北平地質調查研 | Chow, T. C. (1922) | 周贊衡(柱臣) | 北平地質調查研 | Chow | T. C. | (1922) | 1922 | 北平地質調查研 |
Case20 <- read_delim("Case20.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case20
To achieve the desired output above, we need to proceed in successive steps to split up and refine the information until we reach the desired distribution of data.
First, split up the information based on separators. In this case, commas are followed by a white space, which we include in the split formula.
Case20S <- Case20 %>% separate(Original, c("Surname", "InitDate", "NameZhFull", "Institution"), sep=", ", remove = FALSE)
Second, extract everything before the opening parenthesis
Case20B <- Case20S %>% tidyr::extract(InitDate, c("GivenName"), "([^(]+),", remove=FALSE)
Third, extract the initials of the person name
Case20C <- Case20S %>% tidyr::extract(InitDate, c("Initials"), "([^,]+)\\(", remove=FALSE)
Fourth, extract the year
Case20D <- Case20C %>% tidyr::extract(InitDate, c("Date"), "([[:digit:]]+)", remove=FALSE)
Fifth, extract the Chinese full name
Case20E <- Case20D %>% tidyr::extract(Original, c("NameZhFull"), ",([^()]+),", remove=FALSE)
Sixth, extract the Chinese name before parenthesis
Case20F <- Case20E %>% tidyr::extract(NameZhFull, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)
Seventh, extract the Chinese name between parenthesis - Solution 1
Case20G <- Case20F %>% tidyr::extract(NameZhFull, c("NameZhAlt"), "(([\\p{Han}]+))", remove=FALSE) %>%
tidyr::extract(NameZhAlt, c("NameZhAlt"), "(\\p{Han}+)", remove=FALSE)
Eighth, extract the Chinese name between parenthesis - Solution 2
Case20Gca <- Case20F %>% mutate(NameAlt = str_remove_all(NameZhFull, NameZh)) %>%
tidyr::extract(NameAlt, c("NameAlt"), "([\\p{Han}]+)", remove=FALSE)
Split the elements of information in the original sentence" into distinct columns
Grabau, A. W. (1922), The Geological Survey of China, Peiping. |
---|
Barbour, G. B. (1922), 4 Clareville Grove, London SW. 7, England. |
Fortuyn, A. B. D. (1931) , Department of Anatomy, P. U. M. C., Peiping. |
Garretson, M. W.(Mrs.) (1925), 26, Greendale Ave, Mount Vernoon, N. Y., U. S. A. |
This is what we expect to achieve:
Desired Output | ||||||||
---|---|---|---|---|---|---|---|---|
Grabau, A. W. (1922) | Grabau, A. W. | Grabau | Grabau, A. W. | (1922) | 1922 | The Geological Survey of China | Peiping. | |
Barbour, G. B. (1922) | Barbour, G. B. | Barbour, G. B. | G. B. | (1922) | 1922 | 4 Clareville Grove | London | SW. 7 |
Fortuyn, A. B. D. (1931) | Fortuyn, A. B. D. | Fortuyn | A. B. D. | (1931) | 1931 | Department of Anatomy | P. U. M. C. | Peiping. |
Garretson, M. W.(Mrs.) (1925), 26, Greendale Ave, Mount Vernoon, N. Y., U. S. A. | Garretson, M. W.(Mrs.) (1925), 26, Greendale Ave, Mount Vernoon, N. Y., U. S. A. | Garretson, M. W.(Mrs.) (1925) | Garretson, M. W.(Mrs.) | Garretson, M. W. | Garretson | M. W. | 26, Greendale Ave | Mount Vernoon |
Case21 <- read_delim("Case21.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case21
Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.
Extract the Surname and initials
Case21SI <- Case21 %>% tidyr::extract(Original, c("SurnIt"), "([^(]+)", remove=FALSE)
Extract the date between parenthesis
Case21Y <- Case21SI %>% tidyr::extract(Original, c("Year"), "([[:digit:]]+)", remove=FALSE)
Extract everything after the closing parenthesis and comma -> not perfect
Split the items separated by commas in Information
Split the elements of information in a complex bibliographical reference into distinct columns
Original |
---|
CAI Shicun (TSAI Shih-chun ; TSHAI Che-tchhwen) (dossier d’archives A-122) 蔡時椿 Incidents ou accidents consécutifs aux injections de lipiodol employées comme moyen de diagnostic des tumeurs intra-rachidiennes Lyon : Bosc Frères & Riou, 1929 71 p., 26 cm Thèse : Médecine Cote : CH TH 021 |
This is what we expect to achieve:
Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output | Desired Output |
---|---|---|---|---|---|---|---|---|
CAI Shicun | (TSAI Shih-chun ; TSHAI Che-tchhwen) | (dossier d’archives A-122) Bosc Frères & Riou, 192971 p., 26 cmThèse : MédecineCote : CH TH 024 | 蔡時椿 | Incidents ou accidents consécutifs aux injections de lipiodol employées comme moyen de diagnostic des tumeurs intra-rachidiennes | Lyon | Bosc Frères & Riou | 1929 | 71 p. |
Case22 <- read_delim("Case22.csv", ";", escape_double = FALSE,
trim_ws = TRUE)
Case22
Because this is a complex sentence, we need to proceed in several successive steps. Some steps only serve to isolate a chunk of information on which further splitting is applied.
Extract the name in pinyin before the first opening parenthesis
Case22A <- Case22 %>% tidyr::extract(Original, c("NamePY"), "([^(]+)", remove=FALSE)
Extract the text in the groups between parenthesis
Case22B <- Case22A %>% mutate(Information = str_remove_all(Original, NamePY))
Extract the text alone from the groups between parenthesis
Case22C <- Case22B %>% tidyr::extract(Information, c("NameWG"), "([^\\(\\)]+)", remove=FALSE)
Case22D <- Case22C %>% mutate(Information2 = str_remove_all(Information, NameWG))
Suppress the white space before the second opening parenthesis
Case22E <- Case22D %>% mutate(Information2 = str_remove_all(Information, NameWG)) %>%
mutate(Information2 = str_remove_all(Information2, "(\\(\\) )"))
Extract the source reference
Case22F <- Case22E %>% tidyr::extract(Information2, c("Source"), "([^\\(\\)]+)", remove=FALSE)
Extract the Chinese name
Case22G <- Case22F %>% tidyr::extract(Information2, c("NameZh"), "([\\p{Han}]+)", remove=FALSE)
Extract the dissertation title and references (caught with the Chinese name)
Case22H <- Case22G %>% tidyr::extract(Original, c("Dissertation"), "([\\p{Han}].+)", remove=FALSE)
Remove the Chinese name
Case22H2 <- Case22H %>% mutate(Dissertation2 = str_remove_all(Dissertation, "([\\p{Han}]+ )"))
Extract positions based on a list of positions
name | work |
---|---|
何素璞 | 世界紅卍字會會長。濟南道院統掌 |
陳群 | 中央政治委員會委員。國民政府內政部長 |
張群 | 最高國防會議秘書長、西南建設委員會秘書長、行政院秘書長、四川省政府主席、國民精神總動員委員會秘書長 |
諸靑來 | 國民政府交通部長。上海銀行週報。大厦、持志、光華各大學敎授、神州大學敎務主任 |
黃敏中 | 國民政府敎育部常務次長。國民黨員、浙江省黨部常務委員 |
楊壽枬 | 國民政府水利委員會委員長。前清時代の舉人出身。長蘆鹽運使、廣東海關監督、山東財政廳長、財政次長 |
許繼祥 | 國民政府海軍部常務次長。北京政府時代海軍部軍法司長、全國海道測量局長、全國海岸巡防處長 |
This is what we expect to achieve:
name | work | Position | Institution |
---|---|---|---|
何素璞 | 世界紅卍字會會長 | 會長 | 世界紅卍字會 |
何素璞 | 濟南道院統掌 | 統掌 | 濟南道院 |
陳群 | 中央政治委員會委員 | 委員 | 中央政治會 |
陳群 | 國民政府內政部長 | 部長 | 國民政府內政 |
張群 | 最高國防會議秘書長 | 秘書長 | 最高國防會議 |
張群 | 西南建設委員會秘書長 | 秘書長 | 西南建設委員會 |
張群 | 行政院秘書長 | 秘書長 | 行政院 |
張群 | 四川省政府主席 | 主席 | 四川省政府 |
張群 | 國民精神總動員委員會秘書長 | 秘書長 | 國民精神總動員委員會 |
諸靑來 | 國民政府交通部長 | 部長 | 國民政府交通 |
諸靑來 | 上海銀行週報編輯 | 編輯 | 上海銀行週報 |
諸靑來 | 光華各大學敎授 | 敎授 | 光華各大學 |
諸靑來 | 神州大學敎務主任 | 主任 | 神州大學敎務 |
黃敏中 | 國民政府敎育部常務次長 | 常務次長 | 國民政府敎育部 |
黃敏中 | 國民黨員 | 黨員 | 國民 |
黃敏中 | 浙江省黨部常務委員 | 常務委員 | 浙江省黨部 |
楊壽枬 | 國民政府水利委員會委員長 | 委員長 | 國民政府水利委員會 |
楊壽枬 | 長蘆鹽運使 | 運使 | 長蘆鹽 |
楊壽枬 | 廣東海關監督 | 監督 | 廣東海關 |
楊壽枬 | 山東財政廳長 | 廳長 | 山東財政 |
楊壽枬 | 財政次長 | 次長 | 財政 |
許繼祥 | 國民政府海軍部常務次長 | 常務次長 | 國民政府海軍部 |
許繼祥 | 北京政府時代海軍部軍法司長 | 司長 | 北京政府時代海軍部軍法 |
許繼祥 | 全國海道測量局長 | 局長 | 全國海道測量 |
許繼祥 | 全國海岸巡防處長 | 處長 | 全國海岸巡防 |
Case23 <- read_delim("Case23.csv", ",", escape_double = FALSE,
trim_ws = TRUE)
Case23
We import the list of positions
PosList <- read_csv("PosList.csv")
We create a vector that lists all the names of positions. The original data in the Position column in the PosList file is turned into a vector.
positions_reference <- unlist(PosList)
We create a loop by which each position name will be searched to find a match in the Case23 file.
positions_reference_vec <- paste0("(", paste(positions_reference, sep = "", collapse = "|"), ")$")
We split multiple positions into separate rows
Case23A <- Case23 %>% separate_rows(work, sep="。|、")
We extract the positions
Case23B <- Case23A %>% mutate(Position = str_extract(work, positions_reference_vec))
We create the Institution column by removing the whole set of positions from the original work column