1 Introduction

This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.

HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.

The main artisans of the HistText package are:

  1. Pierre Magistry (Ph.D., initially a postdoctoral researcher in the ENP-China Project, now an associate professor at INALCO, laid the foundation for HistText with the creation of the R ‘enpchina’ library. This library offered essential functionalities for querying documents, retrieving full-text content, and extracting named entities from diverse corpora.
  2. Jeremy Auguste (Ph.D., postdoctoral researcher in the ENP-China Project refined the functionalities of HistText. He focused particularly on improving the ‘extended’ search and concordance features, enabling the introduction of filters to facilitate more precise narrowing down of results based on time, publications, fields, and other metadata. Additionally, Auguste spearheaded the development of the user interface in R-Shiny, designed to cater to non-programming users.
  3. Baptiste Blouin, then a Ph.D. candidate, contributed to enhancing the named entity recognition (NER) capabilities for Chinese sources. As a postdoc researcher, Blouin further advanced HistText into a comprehensive application. Blouin made significant contributions to improving the R-Shiny interface, incorporating a diverse array of data visualizations that enhance the user experience. Importantly, behind the scenes, Blouin organized the implementation of several annotation campaigns focused on tokenization, named entity recognition, and event extraction in Chinese historical sources.

We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:

  • to be stored on a SolR server
  • to be full text
  • to be fully segmented

Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.

The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.

The HistText library is also available on Gitlab [https://gitlab.com/histtext] for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.

For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).

The complete list of functions is available in the appendix.

2 Set Up

2.1 Installation and configuration

devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")

Configuration of the package (replace fields with actual server information)

histtext::set_config_file(domain = "https://rapi.enpchina.eu", 
                          user = "user_info", password = "user_info_password")

If successfully configured, the following command will return “OKâ€

histtext::get_server_status()

Now you can upload the library

library(histtext)

2.2 Available Corpora

The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):

histtext::list_corpora()
##  [1] "archives"             "chinajournal-pages"   "csmo-pages"          
##  [4] "dongfangzz"           "elder_workers"        "elder_workers_format"
##  [7] "imh-en"               "imh-zh"               "kmt9k"               
## [10] "ncbras"               "proquest"             "reports-en"          
## [13] "reports-fr"           "scmp-recent"          "shimingru-diary"     
## [16] "shunpao"              "shunpao-revised"      "shunpao-tok"         
## [19] "waiguozaihua"         "wikibio-en"           "wikibio-zh"          
## [22] "zhanggangdiary"


### Brief description

Periodicals:

  • shunpao: Chinese newspaper Shenbao 申報 (1872-1949): original version from the provider (GetHong)
  • shunpao-revised: Chinese newspaper Shenbao 申報 (1872-1949): corrected version by the ENP-China project (date formatting, correction of titles mixed up with text, segmentation of extra-long articles)
  • proquest: English-language periodicals from the ProQuest Chinese Newspapers Collection (CNC)
  • dongfangzz: Dongfang zazhi æ±æ–¹é›œèªŒ (1904-1948)
  • ncbras: Journal of the North China Branch of the Royal Asiatic Society (1858-1948)
  • chinajournal-pages: The China Journal (1904-1949) (access at page level)
  • csmo-pages: Chinese Students’ Monthly (1906-1931) (access at page level)
  • scmp-recent: South China Morning Post (1954-2000) (subset from the ProQuest collection)
  • cmj: China Medical Journal (1887-1949)
  • elder_workers_format: corpus of interviews of Shanghai workers (1953-1958) [not public]

Other printed sources:

Archives:

Wikipedia:

  • wikibio-en: corpus of biographies of individuals active in modern China extracted from Wikipedia (English)
  • wikibio-zh: corpus of biographies of individuals active in modern China extracted from Wikipedia (Chinese)

Diaries and Memoirs

The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.

2.2.1 Statistics

Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).

Statistics Homepage List of folders (ProQuest) Collection statistics for ProQuest North China Herald corpus statistics

3 Query functions

3.2 Basic concordance

The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =â€) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社†in the Shenbao corpus, and we set 100 as the desired context:

concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)

concs


The output is similar to the table generated by the search_documents function, with three additional columns containing the queried term (Matching) and the text before (Before) and after (After). In the table above, each row no longer represents a unique document, but each occurrence of the queried term in the documents. The concordance table usually contains more rows than the table of documents, since the queried term may appear several times in the same document.

3.3 Full Text Retrieval

The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):

docs_ft <- histtext::get_documents(docs, "shunpao")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
docs_eng_ft <- histtext::get_documents(docs_eng, "proquest") 
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
## [1] 511
## [1] 521
## [1] 531
## [1] 541
## [1] 551
## [1] 561
## [1] 571
## [1] 581
## [1] 591
## [1] 601
## [1] 611
## [1] 621
## [1] 631
## [1] 641
## [1] 651
## [1] 661
## [1] 671
## [1] 681
## [1] 691
## [1] 701
## [1] 711
## [1] 721
## [1] 731
## [1] 741
## [1] 751
## [1] 761
## [1] 771
## [1] 781
docs_ft