7 Chinese-specific functions
A key feature of HistText is to provide a set of functions designed to process documents in “transitional Chinese,” a term that we coined to refer to the Chinese language as it evolved from the near-classical language of the administration and imperial publications from the 1850s to the near-contemporary Chinese of the late 1940s (Blouin et al. 2023).
7.1 Tokenization
Tokenization refers to the operation of segmenting a text into tokens, which are the most elementary semantic units in a text. This is a crucial step for text analysis. Currently, two models are available in HistText, as listed below. The trftc_shunpao:zh:cws model is based on the initial annotation campaign conducted by Huang Hen-hsen (Academia Sinica) in 2021. The trftc_shunpao_23:zh:cws model is a refined model based on a second annotation campaign conducted by the ENP-China project in 2023 (Blouin et al. 2023).
list_cws_models()
The tokenizer can be applied to a corpus built with HistText (see section X) using the function cws_on_corpus, or it can be used directly on a specific data frame provided by the researcher using the function cws_on_df.
Below we provide an example for each case.
7.1.1 Corpus
The cws_on_corpus function includes the following arguments:
- docids: the ‘DocId’ column returned with the search_document function.
- corpus: the corpus to be used, chosen from the available corpora in MCTB.
- model: allows you to select the specific model to be used. If not specified, it defaults to the model set for the chosen corpus.
- field: input text to be tokenized.
- detailed_output: When set to TRUE, it returns a data frame with one row per token, including the position in the text and confidence scores.
- token_separator: specifies the character used to separate each token. By default, a single white space is used.
cws_on_corpus(
docids,corpus = "__",
model = "__default__",
field = "__default__",
detailed_output = FALSE,
token_separator = " ",
batch_size = 10,
verbose = TRUE
)
Below is an example to illustrate how the function works:
# create sample corpus
<- histtext::search_documents('"共產黨員"', "imh-zh")
sample_corpus
# tokenize the corpus
<- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)
tokenized_corpus
tokenized_corpus
kable(tokenized_corpus, caption = "Tokenized corpus") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
If you wish to display the detailed output:
<- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)
tokenized_corpus_detailed
kable(tokenized_corpus_detailed, caption = "Tokenized corpus with detailed output") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
7.1.2 Data frame
The function cws_on_df follows a similar structure:
df: data frame which contains texts to tokenize text_column: name of the column which contains the input text to be tokenized. id_column: id of the column that is used to associate ids to the CWS outputs. By default, uses the row index in ‘df.’ model: selection of the model to be used (use ‘list_cws_models()’ to get available models) detailed_output: if TRUE, return a dataframe with one row per token (with positions and confidence scores) token_separator: the character to use to separate each token (default uses a normal white space)
cws_on_df(
df,
text_column,id_column = NULL,
model = "trftc_shunpao_23:zh:cws",
detailed_output = FALSE,
token_separator = " ",
verbose = TRUE
)
To illustrate how the tokenizer functions on a data frame, we provide a sample dataset (sample_df) that contains four documents extracted from the Shenbao between 1874 and 1889. The data frame includes five columns: DocId, Date, Title, Source, Text.
kable(sample_df, caption = "Sample dataframe") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
We apply the tokenizer on the “Text” column of the sample data frame:
<- cws_on_df(
tokenized_df
sample_df,text_column = "Text",
id_column = "DocId",
model = "trftc_shunpao_23:zh:cws",
detailed_output = FALSE,
token_separator = " ",
verbose = TRUE
)
The function returns a data frame with two columns, one with the Tokenized text (Text) and another with the original ids of the documents (DocId):
kable(tokenized_df, caption = "Tokenized dataframe") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
7.2 Conversions
Historical sources use various forms of transliteration of Chinese characters. The function wade_to_py serves to convert the standard (but obsolete) Wade-Giles transliteration system into pinyin. In the example below, there remain some issues in the conversion operators that we correct with lines of code. These issues will be addressed in the near future.
library(readr)
<- read_csv("wgconv.csv")
wgconv
%>% mutate(NameWG = wade_to_py(Original)) %>%
wgconv mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che")) %>%
select(Original, NameWG5) %>%
rename(output = NameWG5)
## # A tibble: 6 × 2
## Original output
## <chr> <chr>
## 1 Tsai-ch'un Zaichun
## 2 Wên-hsiang Wenxiang
## 3 Ch'ung-shih Chongshi
## 4 Ch'ên Li Chen Li
## 5 Chao Chih-ch'ien Zhao Zhiqian
## 6 Tso Tsung-t'ang Zuo Zongtang