7 Chinese-specific functions

A key feature of HistText is to provide a set of functions designed to process documents in “transitional Chinese,” a term that we coined to refer to the Chinese language as it evolved from the near-classical language of the administration and imperial publications from the 1850s to the near-contemporary Chinese of the late 1940s (Blouin et al. 2023).

7.1 Tokenization

Tokenization refers to the operation of segmenting a text into tokens, which are the most elementary semantic units in a text. This is a crucial step for text analysis. Currently, two models are available in HistText, as listed below. The trftc_shunpao:zh:cws model is based on the initial annotation campaign conducted by Huang Hen-hsen (Academia Sinica) in 2021. The trftc_shunpao_23:zh:cws model is a refined model based on a second annotation campaign conducted by the ENP-China project in 2023 (Blouin et al. 2023).

list_cws_models()

The tokenizer can be applied to a corpus built with HistText (see section X) using the function cws_on_corpus, or it can be used directly on a specific data frame provided by the researcher using the function cws_on_df.

Below we provide an example for each case.

7.1.1 Corpus

The cws_on_corpus function includes the following arguments:

docids: the ‘DocId’ column returned with the search_document function.
corpus: the corpus to be used, chosen from the available corpora in MCTB.
model: allows you to select the specific model to be used. If not specified, it defaults to the model set for the chosen corpus.
field: input text to be tokenized.
detailed_output: When set to TRUE, it returns a data frame with one row per token, including the position in the text and confidence scores.
token_separator: specifies the character used to separate each token. By default, a single white space is used.

cws_on_corpus(
  docids,
  corpus = "__",
  model = "__default__",
  field = "__default__",
  detailed_output = FALSE,
  token_separator = " ",
  batch_size = 10,
  verbose = TRUE
)

Below is an example to illustrate how the function works:

# create sample corpus
sample_corpus <- histtext::search_documents('"共產黨員"', "imh-zh")

# tokenize the corpus
tokenized_corpus <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)

tokenized_corpus

kable(tokenized_corpus, caption = "Tokenized corpus") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

If you wish to display the detailed output:

tokenized_corpus_detailed <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)

kable(tokenized_corpus_detailed, caption = "Tokenized corpus with detailed output") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.1.2 Data frame

The function cws_on_df follows a similar structure:

df: data frame which contains texts to tokenize text_column: name of the column which contains the input text to be tokenized. id_column: id of the column that is used to associate ids to the CWS outputs. By default, uses the row index in ‘df.’ model: selection of the model to be used (use ‘list_cws_models()’ to get available models) detailed_output: if TRUE, return a dataframe with one row per token (with positions and confidence scores) token_separator: the character to use to separate each token (default uses a normal white space)

cws_on_df(
  df,
  text_column,
  id_column = NULL,
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)

To illustrate how the tokenizer functions on a data frame, we provide a sample dataset (sample_df) that contains four documents extracted from the Shenbao between 1874 and 1889. The data frame includes five columns: DocId, Date, Title, Source, Text.

kable(sample_df, caption = "Sample dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

We apply the tokenizer on the “Text” column of the sample data frame:

tokenized_df <- cws_on_df(
  sample_df,
  text_column = "Text",
  id_column = "DocId",
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)

The function returns a data frame with two columns, one with the Tokenized text (Text) and another with the original ids of the documents (DocId):

kable(tokenized_df, caption = "Tokenized dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.2 Conversions

Historical sources use various forms of transliteration of Chinese characters. The function wade_to_py serves to convert the standard (but obsolete) Wade-Giles transliteration system into pinyin. In the example below, there remain some issues in the conversion operators that we correct with lines of code. These issues will be addressed in the near future.

library(readr)
wgconv <- read_csv("wgconv.csv")

wgconv %>% mutate(NameWG = wade_to_py(Original)) %>%
  mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
  mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che")) %>%
  select(Original, NameWG5) %>%
  rename(output = NameWG5)

## # A tibble: 6 × 2
##   Original         output      
##   <chr>            <chr>       
## 1 Tsai-ch'un       Zaichun     
## 2 Wên-hsiang       Wenxiang    
## 3 Ch'ung-shih      Chongshi    
## 4 Ch'ên Li         Chen Li     
## 5 Chao Chih-ch'ien Zhao Zhiqian
## 6 Tso Tsung-t'ang  Zuo Zongtang

References

Blouin, Baptiste, Hen-Hsen Huang, Christian Henriot, and Cécile Armand. 2023. “Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts, 2023 (with Blouin, Baptiste, Hen-Hsen Huang, and Christian Henriot).” Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities & 8th International Workshop on Computational Linguistics for Uralic Languages.