1 Introduction

This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.

HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.

The main artisans of the HistText package are:

  1. Pierre Magistry (Ph.D., initially a postdoctoral researcher in the ENP-China Project, now an associate professor at INALCO, laid the foundation for HistText with the creation of the R ‘enpchina’ library. This library offered essential functionalities for querying documents, retrieving full-text content, and extracting named entities from diverse corpora.
  2. Jeremy Auguste (Ph.D., postdoctoral researcher in the ENP-China Project refined the functionalities of HistText. He focused particularly on improving the ‘extended’ search and concordance features, enabling the introduction of filters to facilitate more precise narrowing down of results based on time, publications, fields, and other metadata. Additionally, Auguste spearheaded the development of the user interface in R-Shiny, designed to cater to non-programming users.
  3. Baptiste Blouin, then a Ph.D. candidate, contributed to enhancing the named entity recognition (NER) capabilities for Chinese sources. As a postdoc researcher, Blouin further advanced HistText into a comprehensive application. Blouin made significant contributions to improving the R-Shiny interface, incorporating a diverse array of data visualizations that enhance the user experience. Importantly, behind the scenes, Blouin organized the implementation of several annotation campaigns focused on tokenization, named entity recognition, and event extraction in Chinese historical sources.

We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:

  • to be stored on a SolR server
  • to be full text
  • to be fully segmented

Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.

The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.

The HistText library is also available on Gitlab for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.

For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).

The complete list of functions is available in the appendix.

References

Blouin, Baptiste, Christian Henriot, and Cécile Armand. 2023. HistText: An Application for Leveraging Large-Scale Historical Textbases.” Journal of Data Mining and Digital Humanities. https://shs.hal.science/halshs-04178820.