1 Introduction

This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.

HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.

The main artisans of the HistText package are:

Pierre Magistry (Ph.D., initially a postdoctoral researcher in the ENP-China Project, now an associate professor at INALCO, laid the foundation for HistText with the creation of the R ‘enpchina’ library. This library offered essential functionalities for querying documents, retrieving full-text content, and extracting named entities from diverse corpora.
Jeremy Auguste (Ph.D., postdoctoral researcher in the ENP-China Project refined the functionalities of HistText. He focused particularly on improving the ‘extended’ search and concordance features, enabling the introduction of filters to facilitate more precise narrowing down of results based on time, publications, fields, and other metadata. Additionally, Auguste spearheaded the development of the user interface in R-Shiny, designed to cater to non-programming users.
Baptiste Blouin, then a Ph.D. candidate, contributed to enhancing the named entity recognition (NER) capabilities for Chinese sources. As a postdoc researcher, Blouin further advanced HistText into a comprehensive application. Blouin made significant contributions to improving the R-Shiny interface, incorporating a diverse array of data visualizations that enhance the user experience. Importantly, behind the scenes, Blouin organized the implementation of several annotation campaigns focused on tokenization, named entity recognition, and event extraction in Chinese historical sources.

We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:

to be stored on a SolR server
to be full text
to be fully segmented

Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.

The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.

The HistText library is also available on Gitlab [https://gitlab.com/histtext] for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.

For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).

The complete list of functions is available in the appendix.

2 Set Up

2.1 Installation and configuration

devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")

Configuration of the package (replace fields with actual server information)

histtext::set_config_file(domain = "https://rapi.enpchina.eu", 
                          user = "user_info", password = "user_info_password")

If successfully configured, the following command will return “OK”

histtext::get_server_status()

Now you can upload the library

library(histtext)

2.2 Available Corpora

The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):

histtext::list_corpora()

##  [1] "archives"             "chinajournal-pages"   "csmo-pages"          
##  [4] "dongfangzz"           "elder_workers"        "elder_workers_format"
##  [7] "imh-en"               "imh-zh"               "kmt9k"               
## [10] "ncbras"               "proquest"             "reports-en"          
## [13] "reports-fr"           "scmp-recent"          "shimingru-diary"     
## [16] "shunpao"              "shunpao-revised"      "shunpao-tok"         
## [19] "waiguozaihua"         "wikibio-en"           "wikibio-zh"          
## [22] "zhanggangdiary"

### Brief description

Periodicals:

shunpao: Chinese newspaper Shenbao 申報 (1872-1949): original version from the provider (GetHong)
shunpao-revised: Chinese newspaper Shenbao 申報 (1872-1949): corrected version by the ENP-China project (date formatting, correction of titles mixed up with text, segmentation of extra-long articles)
proquest: English-language periodicals from the ProQuest Chinese Newspapers Collection (CNC)
dongfangzz: Dongfang zazhi 東方雜誌 (1904-1948)
ncbras: Journal of the North China Branch of the Royal Asiatic Society (1858-1948)
chinajournal-pages: The China Journal (1904-1949) (access at page level)
csmo-pages: Chinese Students’ Monthly (1906-1931) (access at page level)
scmp-recent: South China Morning Post (1954-2000) (subset from the ProQuest collection)
cmj: China Medical Journal (1887-1949)
elder_workers_format: corpus of interviews of Shanghai workers (1953-1958) [not public]

Other printed sources:

imh-en: Collection of English-language who’s whos, directories and other biographical data from the Institute of Modern History (IMH), Academia Sinica 近現代人物資訊整合系統 The Integrated Information System on Modern and Contemporary Characters (IISMCC)
imh-zh: Collection of Chinese-language who’s whos, directories and other biographical data from the Institute of Modern History (IMH), Academia Sinica 近現代人物資訊整合系統 The Integrated Information System on Modern and Contemporary Characters (IISMCC)
kmt9k: Biographical dictionary of Zhongguo Guomindang jiuqian jiangling 中国国民党九千将领 (9,000 Generals of the Guomindang)
waiguozaihua: The Universal Dictionary of Foreign Business in Modern China 外国在华工商企业辞典 China Waiguo zaihua gongshang qiye cidian, Chengdu 成都, Sichuan renmin chubanshe 四川人民出版社, 1995.

Archives:

archives: Shanghai Municipal Police (SMP) Archives and Records of the Department of State Relating to Political Relations between China and Japan 1930-1944
reports-fr: Annual Reports of the French Municipal Administration in Shanghai (“Compte rendu de la gestion pour l’exercice…”) (1893-1940)
reports-en: Annual Reports of the Shanghai Municipal Council (SMC) (“Report for the year…”) (1859-1943)

Wikipedia:

wikibio-en: corpus of biographies of individuals active in modern China extracted from Wikipedia (English)
wikibio-zh: corpus of biographies of individuals active in modern China extracted from Wikipedia (Chinese)

Diaries and Memoirs

zhanggangdiary: Zhang Gang Diary (1888–1942)

The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.

2.2.1 Statistics

Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).

Statistics Homepage List of folders (ProQuest)

3 Query functions

3.1 Basic search

The function search_documents serves to find the documents based on one or several terms. The function is composed of two main arguments: the queried term(s) and the targeted corpus. If the term consists of just one word or character, use the double quotation marks. For compound words, in English or Chinese, add simple quotations marks as in the example below:

search_documents("Rotary", "proquest")
search_documents('"Rotary Club"', "proquest")
search_documents('"扶輪社"', "shunpao-revised")

It is also possible to run a query with multiple terms using Boolean operators, as in the example below (here | = OR). For a detailed list of possible operators in R, see this document:

search_documents('"Rotary" | "Rotary Club"', "proquest")
search_documents('"扶輪社" | "上海扶輪社"', "shunpao-revised")

The function generates a table with four columns indicating the unique identifier of each document (DocId), the date of publication (Date, in YYYMMDD format), the title of the article (Title), and the source (Source), e.g. the name of the periodical in the ProQuest collection. In the table below, each row represents a unique document:

docs_eng <- search_documents('"Shanghai Rotary Club"', "proquest")
docs <- search_documents('"扶輪社"', "shunpao-revised")

docs

The count_search_documents function allows researchers to determine the number of documents that can be returned by a particular query without retrieving the actual documents. This function aids researchers in understanding the potential size and scale of their query results, enabling them to gauge the feasibility and magnitude of their research endeavors before executing resource-intensive queries.

histtext::count_search_documents('"上海"', "shunpao-revised")

The function count_documents serves to visualize the distribution of documents matching a query over time:

histtext::count_documents('"上海扶輪社"', "shunpao-revised") %>% 
  mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>% 
  mutate(Year= year(Date)) %>%  
  group_by(Year) %>% summarise(N=sum(N)) %>% 
  filter (Year>=1920) %>%  
  ggplot(aes(Year,N)) + geom_col(alpha = 0.8) + 
  labs(title = "The Rotary Club of Shanghai in the Shenbao",
       subtitle = "Number of articles mentioning 上海扶輪社",
       x = "Year", 
       y = "Number of articles")

Counting can also be applied to specific fields that vary according to the queried corpora. The list of possible fields can be obtained with the function list_filter_fields():

list_filter_fields("proquest")

## [1] "publisher" "category"

list_filter_fields("dongfangzz")

## [1] "category" "volume"   "issue"

For example, if we want to count the number of documents in Dongfang zazhi for the queried term “調查” in the category field:

histtext::count_documents('"調查"', "dongfangzz", by_field = "category") %>% 
  arrange(desc(N))

In the example below, we count the number of documents by publisher in the “ProQuest” corpus:

histtext::count_documents('"Rotary Club"', "proquest", by_field = "publisher")  %>% 
  arrange(desc(N))

3.2 Basic concordance

The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =”) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社” in the Shenbao corpus, and we set 100 as the desired context:

concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)

concs

The output is similar to the table generated by the search_documents function, with three additional columns containing the queried term (Matching) and the text before (Before) and after (After). In the table above, each row no longer represents a unique document, but each occurrence of the queried term in the documents. The concordance table usually contains more rows than the table of documents, since the queried term may appear several times in the same document.

3.3 Full Text Retrieval

The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):

docs_ft <- histtext::get_documents(docs, "shunpao")

## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501

docs_eng_ft <- histtext::get_documents(docs_eng, "proquest")

## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
## [1] 511
## [1] 521
## [1] 531
## [1] 541
## [1] 551
## [1] 561
## [1] 571
## [1] 581
## [1] 591
## [1] 601
## [1] 611
## [1] 621
## [1] 631
## [1] 641
## [1] 651
## [1] 661
## [1] 671
## [1] 681
## [1] 691
## [1] 701
## [1] 711
## [1] 721
## [1] 731
## [1] 741
## [1] 751
## [1] 761
## [1] 771
## [1] 781

docs_ft

The function generates the same table as search_documents, with an additional column that contains the full text of the document (Text).

3.4 Close reading

If you want to read more closely individual documents, use the view_document function. The document will appear in your Viewer panel.

view_document("SPSP193602290401", "shunpao-revised")

If you want to display a term of interest in the document, add the term after the “query” argument as indicated below:

view_document("SPSP193602290401", "shunpao-revised", query = '"扶輪社"')

3.4.1 ProQuest Documents

The function proquest_view() is designed to display the original document on the ProQuest platform, based on the ID of the document (for ProQuest subscribers only):

proquest_view(1371585080)

3.4.2 Other documents

The function view_document() is designed to display the full text of a single document with the functionality of highlighting selected key words in the text. The function is composed of three arguments:

docid = document identifier
corpus = name of the corpus
query = list of queried terms (query string or vector of query strings).

view_document(1371585080, "proquest", query = c("Nanking", "club"))
view_document("SPSP193011141210", "shunpao", query = c("扶輪社", "飯店"))

3.5 Advanced functions

The package provides more advanced functions to perform multi-field queries.

The function search_documents serves to query terms in fields other than the content of articles. For instance, one can perform queries based on the title of articles and/or date of publication. It is possible to search several fields at the same time. The function is composed of five arguments, as described below:

q: the queried term (q)
corpus: the corpus (within quotation marks)
search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described below)
start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see below).

search_documents <- function(q, 
                                  corpus="shunpao", 
                                  search_fields=c(),
                                  start=0,
                                  dates=c())

To obtain the list of searchable fields available in the targeted corpus, use the function list_search_fields. For instance, the example below shows that three fields can be searched in the Dongfang zazhi 東方雜誌 - the title, the author, or the full text of the article:

list_search_fields("dongfangzz")

## [1] "text"    "title"   "authors"

If no field is selected explicitly, queries will be performed in all fields.

The function list_filter_fields() lists the other fields that are available but are not searchable (technically searchable, but not relevant):

list_filter_fields("dongfangzz")

## [1] "category" "volume"   "issue"

In the above example, you can filter the results of your search in the Dongfang zazhi by the category of article, the volume and issue.

The function list_possible_filters() can be applied to any filterable field to display its contents:

list_possible_filters("dongfangzz", "category") %>% arrange(desc(N))

The first column “Value” contains the possible filters that you can use. The second column “N” indicates the number of documents in the corresponding filter value.

Use accepts_date_queries to test whether the targeted corpus supports date filters. For instance, the example below shows that it is possible to search dates in the Shenbao or Dongfang zazhi, but not in Wikipedia (Chinese or English):

histtext::accepts_date_queries("shunpao")

## [1] TRUE

histtext::accepts_date_queries("dongfangzz")

## [1] TRUE

histtext::accepts_date_queries("wikibio-zh")

## [1] FALSE

histtext::accepts_date_queries("wikibio-en")

## [1] FALSE

3.5.1 Multifield queries

For instance, search the term 扶輪社 in the Shenbao in the title only:

search_documents('"扶輪社"', corpus="shunpao", search_fields= "title")

Search the same term 扶輪社 in the Shenbao in both title and full text :

search_documents('"扶輪社"', corpus="shunpao", search_fields=c("title", "text"))

3.5.2 Date filtering

Search the term 扶輪社 in the Shenbao in all possible fields for the year 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933")

Search the term 扶輪社 in the Shenbao in all possible fields for March 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933-03")

Search the term 扶輪社 in the Shenbao in all possible fields for the two years 1933 and 1940:

search_documents('"扶輪社"', corpus="shunpao", dates=c("1933", "1940"))

Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940:

search_documents('"扶輪社"', corpus="shunpao", dates="[1930 TO 1940]")

Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940 and 1945:

search_documents('"扶輪社"', corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))

Combined query on different fields and dates:

combined_search <- search_documents('"扶輪社"', corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))

combined_search

3.5.3 Concordance

The extended search function also applies to concordance with the function search_concordance. The search_concordance is composed of six arguments, the same as in search_documents, plus an additional argument for the context size:

q: the queried term (q)
corpus: the corpus (within quotation marks)
search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described above)
context_size: the size of the context (number of characters before/after)
start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see above).

search_concordance <- function(q, 
                                   corpus="shunpao", 
                                   search_fields=c(),
                                   context_size=30, 
                                   start=0,
                                   dates=c())

The advanced functions that we described above - list_search_fields(), list_filter_fields(), and list_possible_filters() - can also be applied in combination with the search_concordance() function.

3.5.4 Word embeddings

In HistText, pre-computed word embeddings (i.e., learned representations for text where words with similar meanings have similar representations), can be utilized to enhance queries by incorporating similar terms. Please note that this function is currently under construction. However, it is already accessible through the HistText Search interface.

4 Corpus Statistics

4.1 Word frequencies

4.1.1 Number of occurrences per article

The function stats_count provides statistics on the queried term(s). With the classic parameters, you can obtain the number of occurrences of this/these keyword(s) per article.

stats_count(df,"noble")

4.1.2 Mean number of occurences over time

By adding the over_time=TRUE parameter, you can obtain the average values at which this keyword appears over the years

stats_count(df,"noble",over_time=TRUE)

4.1.3 Percentage compared to other words over time

By adding the by_char=TRUE parameter, we can contrast the number of occurrences of the queried term with the overall number of tokens in the text per year.

stats_count(df,"noble",over_time=TRUE,by_char=TRUE)

4.1.4 Plot or dataframe

For all statistical functions, it is possible to retrieve the statistics dataframe without generating the graph by setting to_plot=FALSE.

stats_count(df,"noble",over_time=TRUE,by_char=TRUE,to_plot=FALSE)

4.1.5 Interactivity

All visualizations are created using Plotly to make them interactive. However, it is possible to generate a non-interactive graph by deactivating ly=FALSE .

stats_count(df,"noble",over_time=TRUE,by_char=TRUE,ly=FALSE)

4.2 Characters count

The function count_character_bins counts the number of characters in the text and then groups them by “bins”, i.e. series of ranges of numerical value into which data are sorted in statistical analysis. It is possible to manually set the number of bins using the nb_bins argument, as shown below.

count_character_bins(df)

count_character_bins(df,nb_bins=20)

4.3 Date-related statistics

The stats_date function allows you to compute date-related statistics. In the classic configuration, it displays the number of results per year.

stats_date(df,"ncbras")

By adding the parameter over_all=TRUE, it’s possible to compare the time distribution of a specific query with the distribution over all possible results in the selected collection.

stats_date(df,"ncbras",over_all=TRUE)

4.4 Document Term Matrix (DTM)

The classic_dtm function allows you to explore the semantic content of the text in more depth. To use this function, you need to have the following packages installed:

install.packages("stopwords")
install.packages("quanteda")
install.packages("quanteda.textstats")

4.4.1 Top words

The classic configuration allows you to see the top ‘n’ words present in the given texts. You can set the number of desired words manually using the top argument:

classic_dtm(df,top=20)

4.4.2 Top words over time

By adding over_time=TRUE it is possible to display the frequency of these top “n” words over time:

classic_dtm(df,top=20,over_time=TRUE)

4.4.3 Document Similarity

By adding doc_simi=TRUE it is possible to calculate the similarity between documents with respect to the frequencies of the words they use.

install.packages("ggdendro")
classic_dtm(df,top=20,doc_simi=TRUE)

5 Named Entity Recognition (NER)

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. In histText, the function ner_on_corpus applies a default NER model that varies according to the language (English or Chinese) or the nature (nature of the language, OCR, etc.) of the queried corpus.

The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into eight main types: persons (PERS), organizations (ORG), locations (LOC), geopolitical entities (GPE) (countries, administrative regions…), temporal entities (DATE, TIME), numerical entities (MONEY, PERCENTAGE), and miscellaneous entities (MISC).

5.1 Named Entity Extraction

The function ner_on_corpus is composed of two arguments: the collection of documents with their full text (output of the “get_documents()” function), and the targeted corpus.

Example in English (ProQuest:

docs_ner_eng <- ner_on_corpus(docs_eng, corpus = "proquest")

## 1/785
## 11/785
## 21/785
## 31/785
## 41/785
## 51/785
## 61/785
## 71/785
## 81/785
## 91/785
## 101/785
## 111/785
## 121/785
## 131/785
## 141/785
## 151/785
## 161/785
## 171/785
## 181/785
## 191/785
## 201/785
## 211/785
## 221/785
## 231/785
## 241/785
## 251/785
## 261/785
## 271/785
## 281/785
## 291/785
## 301/785
## 311/785
## 321/785
## 331/785
## 341/785
## 351/785
## 361/785
## 371/785
## 381/785
## 391/785
## 401/785
## 411/785
## 421/785
## 431/785
## 441/785
## 451/785
## 461/785
## 471/785
## 481/785
## 491/785
## 501/785
## 511/785
## 521/785
## 531/785
## 541/785
## 551/785
## 561/785
## 571/785
## 581/785
## 591/785
## 601/785
## 611/785
## 621/785
## 631/785
## 641/785
## 651/785
## 661/785
## 671/785
## 681/785
## 691/785
## 701/785
## 711/785
## 721/785
## 731/785
## 741/785
## 751/785
## 761/785
## 771/785
## 781/785

docs_ner_eng

Example in Chinese (Shenbao):

docs_ner_zh <- ner_on_corpus(docs_ft, corpus = "shunpao")

## 1/507
## 11/507
## 21/507
## 31/507
## 41/507
## 51/507
## 61/507
## 71/507
## 81/507
## 91/507
## 101/507
## 111/507
## 121/507
## 131/507
## 141/507
## 151/507
## 161/507
## 171/507
## 181/507
## 191/507
## 201/507
## 211/507
## 221/507
## 231/507
## 241/507
## 251/507
## 261/507
## 271/507
## 281/507
## 291/507
## 301/507
## 311/507
## 321/507
## 331/507
## 341/507
## 351/507
## 361/507
## 371/507
## 381/507
## 391/507
## 401/507
## 411/507
## 421/507
## 431/507
## 441/507
## 451/507
## 461/507
## 471/507
## 481/507
## 491/507
## 501/507

docs_ner_zh

The function returns a table with five columns:

Id: unique identifier of the document
Text: the name of the entity (as given in the source)
Type: the type of the entity extracted with its confidence index (measuring the accuracy of the classification by the algorithm)
Start: the position of the character immediately preceding the entity in the text
End: the position of the character immediately following the entity in the text

The histext package includes alternative models tailored to the nature of the text in the different corpora. Mostly, it provides specific models for text with noisy Optical Character Recognition (OCR) and/or missing punctuation.

The function list_ner_models() lists all the models available in the package:

list_ner_models()

##  [1] "spacy_sm:en:ner"            "spacy:en:ner"              
##  [3] "spacy_sm:zh:ner"            "spacy:zh:ner"              
##  [5] "trftc_noisy_nopunct:en:ner" "trftc_noisy:en:ner"        
##  [7] "trftc_nopunct:zh:ner"       "trftc_camembert:fr:ner"    
##  [9] "trftc_person_1class:zh:ner" "trftc_person_4class:zh:ner"

There are two main models: spaCy and trftc (Transformer Token Classification) - a model specifically developed by the ENP-China team (Baptiste Blouin, Jeremy Auguste) to handle specific issues that pertain to historical corpora in multiple languages.

Name	Model	Language	Features
spacy_sm:en:ner	spaCy	English	Default model for large corpora (faster but less reliable)
spacy_sm:zh:ner	spaCy	Chinese	Default model for large corpora (faster but less reliable)
spacy:en:ner	spaCy	English	Default model for large or small corpora (slower but more reliable)
spacy:zh:ner	spaCy	Chinese	Default model for large or small corpora (slower but more reliable)
trftc_noisy_nopunct:en:ner	TrfTC	English	Model for texts with noisy OCR and missing punctuation
trftc_noisy:en:ner	TrfTC	English	Model for texts with noisy OCR
trftc_nopunct:zh:ner	TrfTC	Chinese	Model for texts with no punctuation
trftc_camembert:fr:ner	TrfTC	French	Models for texts in French (based on BERT)
trftc_person_1class:zh:ner	TrfTC	Chinese	Model trained to identify persons in any category
trftc_person_4class:zh:ner	TrfTC	Chinese	Model trained to detect persons’ names with embedded titles (王少年 for 王局長少年, 王君少年)

To know which default model is used for a given corpus, use the function get_default_ner_model():

get_default_ner_model("proquest")
get_default_ner_model("imh-zh")

To specify the model you want to apply, use the argument “model =” in the ner_on_corpus() function:

ner_on_corpus(docs_ft, corpus = "proquest", model = "trftc_noisy_nopunct:en:ner")
ner_on_corpus(docs_ft, corpus = "shunpao", model = "trftc_person_4class:zh:ner")

5.2 Padagraph Visualization

The data extracted through NER can be visualized as a network graph using Padagraph. To enable this function, one needs to create first a network object using libraries such as igraph or tidygraph. The following lines of code detail the successive steps to transform the results of NER into an edge list and eventually a network object. The last step consists in applying the function in_padagraph() to project the tidygraph object in Padagraph. In the following example, we propose to build a two-mode network linking persons or organizations with the documents in which they appear:

# create the edge list linking documents with persons or organizations 
edge_ner_eng <- docs_ner_eng %>% 
  filter(Type %in% c("PERSON", "ORG")) %>% # select persons and organizations
  select(DocID, Text) %>% # retain only the relevant variables (from/to)
  filter(!Text %in% c("He","She","His","Her","Him", "Hers", "Chin", "Mrs", "Mr", "Gen", "Madame", "Madam", "he","she","his","her","him", "hers")) # remove personal pronouns
# retain entities that appeared at least twice
topnode_docs_eng <- edge_ner_eng %>% 
  group_by(Text) %>% 
  add_tally() %>% 
  filter(n>1) %>% 
  distinct(Text)
# build the network with igraph/tidygraph
topedge <- edge_ner_eng %>% 
  filter(Text %in% topnode_docs_eng$Text) %>% 
  rename(from = DocID, to = Text)
ig <- graph_from_data_frame(d=topedge, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig) %>% 
  activate(nodes) %>% 
  mutate(label=name)
# project in padagraph
tg %>% histtext::in_padagraph("RotaryNetwork")

The function get_padagraph_url directly returns the URL for displaying the graph:

tg %>% histtext::get_padagraph_url("RotaryNetwork")

The URL for the graph is created as a permanent link that can be used in third applications or to return to the network.

The function load_in_padagraph() serves to import a tidygraph object into “Padagraph”. The function is composed of three arguments:

filepath: the path to a file which contains a tidygraph (created with the function ‘save_graph’ included in histtext)
name: the name to be given to the graph
show_graph: if TRUE, show the graph in a RStudio viewer

The function returns an URL that displays the tidygraph object in Padagraph.

save_graph(graph, filepath) 
load_in_padagraph(filepath, name, show_graph = TRUE)

5.3 NER on external documents

HistText includes two functions to implement NER on external documents, i.e. that can be uploaded directly into R Studio:

run_ner() : to be applied on a string of character
ner_on_df(): to be applied to a dataframe (e.g. a text extracted from a PDF document or scraped from the web, in a dataframe format)

In addition, HistText includes a function that transform a pre-ocerized PDF document into a dataframe.

The function run_ner() contained two arguments: the text to be analyzed (text) and the model to be used (model) (use ‘list_ner_models()’ to obtain the available models):

run_ner(text, model = "spacy:en:ner", verbose = FALSE)

The function ner_on_df requires at least two arguments: the name of the variable that contains the text (Text) and the language model that will apply (English or Chinese). In the example below, we add an argument to specify the page number with it s identifier (document). We also choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document.

ner_df <- ner_on_df(dplyr::slice(docs_eng_ft, 1:10), "Text", id_column="DocId", model = "spacy:en:ner")

## 1/10

ner_df

If applied to a PDF document, the function ner_on_df() consists in three steps:

Upload the PDF document (if based on a PDF document)
Convert the text into a dataframe (df) using the function load_pdf_as_df
Apply the function to the dataframe.

In the example below, we choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document. To eliminate the pages that do not contain text of that contain little text (pages with images), we set the minimum limit to at least 100 characters per page. The identifier argument serves to assign the name of the document (as you choose to label it) to the numbered pages (by default, pages appear simply as numbers)

dplyr::slice(docs_eng_ft, 1:10)

6 Question & Answer

The implementation of question-and-answer (Q&A) queries is another valuable functionality provided by the HistText library. This feature enables researchers to target and extract specific content from natural-language texts based on user-defined queries. By formulating questions or prompts, researchers can use the Q&A feature to extract data from documents in natural language. Q&A functions in HistText are particularly effective for retrieving biographical information.

It’s important to note that these models may not be fully functional and could time out on long texts, especially when multiple questions are asked simultaneously.

Two models are currently available in HistText: one for Chinese and one for English. You can use the list_qa_models() to list the available models:

histtext::list_qa_models()

6.1 Basic usage

The most basic use is to ask a single question:

imh_en_df <- histtext::search_documents('"member of party"', "imh-en")

histtext::qa_on_corpus(imh_en_df, "What is his full name?", "imh-en")

Alternatively, you can ask multiple variants of a question:

histtext::qa_on_corpus(imh_en_df, c("What is his full name?", "What name?"), "imh-en")

6.2 More complex usage

A more advanced usage of Q&A can be achieved when questions depend on previous questions:

questions <- list("name:full" = c("What is his full name?", "What name?"),
                  "education:location" = c("Where {name:full} study at?", "Where study at?"))
histtext::qa_on_corpus(imh_en_df, questions, "imh-en")

ou can also specify the number of answers that a question should be allowed to produce:

histtext::qa_on_corpus(imh_en_df, questions, "imh-en", max_answers = list("education:location" = 2))

Examples of questions on which models where trained with can be accessed using the following functions:

histtext::biography_questions("en")
histtext::biography_questions("zh")

Please avoid using directly the output of this function as an input to the Q&A functions. The amount of questions will overwhelm the server and your query will probably time-out in many cases. You can however take a small subset of these.

7 Chinese-specific functions

A key feature of HistText is to provide a set of functions designed to process documents in “transitional Chinese”, a term that we coined to refer to the Chinese language as it evolved from the near-classical language of the administration and imperial publications from the 1850s to the near-contemporary Chinese of the late 1940s (Blouin et al. 2023).

7.1 Tokenization

Tokenization refers to the operation of segmenting a text into tokens, which are the most elementary semantic units in a text. This is a crucial step for text analysis. Currently, two models are available in HistText, as listed below. The trftc_shunpao:zh:cws model is based on the initial annotation campaign conducted by Huang Hen-hsen (Academia Sinica) in 2021. The trftc_shunpao_23:zh:cws model is a refined model based on a second annotation campaign conducted by the ENP-China project in 2023 (Blouin et al. 2023).

list_cws_models()

The tokenizer can be applied to a corpus built with HistText (see section X) using the function cws_on_corpus, or it can be used directly on a specific data frame provided by the researcher using the function cws_on_df.

Below we provide an example for each case.

7.1.1 Corpus

The cws_on_corpus function includes the following arguments:

docids: the ‘DocId’ column returned with the search_document function.
corpus: the corpus to be used, chosen from the available corpora in MCTB.
model: allows you to select the specific model to be used. If not specified, it defaults to the model set for the chosen corpus.
field: input text to be tokenized.
detailed_output: When set to TRUE, it returns a data frame with one row per token, including the position in the text and confidence scores.
token_separator: specifies the character used to separate each token. By default, a single white space is used.

cws_on_corpus(
  docids,
  corpus = "__",
  model = "__default__",
  field = "__default__",
  detailed_output = FALSE,
  token_separator = " ",
  batch_size = 10,
  verbose = TRUE
)

Below is an example to illustrate how the function works:

# create sample corpus 
sample_corpus <- histtext::search_documents('"共產黨員"', "imh-zh")

# tokenize the corpus
tokenized_corpus <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)

tokenized_corpus

kable(tokenized_corpus, caption = "Tokenized corpus") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

If you wish to display the detailed output:

tokenized_corpus_detailed <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)

kable(tokenized_corpus_detailed, caption = "Tokenized corpus with detailed output") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.1.2 Data frame

The function cws_on_df follows a similar structure:

df: data frame which contains texts to tokenize text_column: name of the column which contains the input text to be tokenized. id_column: id of the column that is used to associate ids to the CWS outputs. By default, uses the row index in ‘df’. model: selection of the model to be used (use ‘list_cws_models()’ to get available models) detailed_output: if TRUE, return a dataframe with one row per token (with positions and confidence scores) token_separator: the character to use to separate each token (default uses a normal white space)

cws_on_df(
  df,
  text_column,
  id_column = NULL,
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)

To illustrate how the tokenizer functions on a data frame, we provide a sample dataset (sample_df) that contains four documents extracted from the Shenbao between 1874 and 1889. The data frame includes five columns: DocId, Date, Title, Source, Text.

kable(sample_df, caption = "Sample dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

We apply the tokenizer on the “Text” column of the sample data frame:

tokenized_df <- cws_on_df(
  sample_df,
  text_column = "Text",
  id_column = "DocId",
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)

The function returns a data frame with two columns, one with the Tokenized text (Text) and another with the original ids of the documents (DocId):

kable(tokenized_df, caption = "Tokenized dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.2 Conversions

Historical sources use various forms of transliteration of Chinese characters. The function wade_to_py serves to convert the standard (but obsolete) Wade-Giles transliteration system into pinyin. In the example below, there remain some issues in the conversion operators that we correct with lines of code. These issues will be addressed in the near future.

library(readr)
wgconv <- read_csv("wgconv.csv")

wgconv %>% mutate(NameWG = wade_to_py(Original)) %>%
  mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
  mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>% 
  mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che")) %>% 
  select(Original, NameWG5) %>% 
  rename(output = NameWG5)

8 Additional features

8.1 Regular Expressions

The function extract_regexps_from_subcorpus() is designed to search a list of regular expressions in a corpus of documents. The function is composed of two arguments:

“corpus”: consists of a table with the documents (must include DocId, Title and Text columns), typically the output of the “get_documents()” function (e.g. “docs_eng_ft”).
“regexps”: a table that indicates the pattern(s) to look for (including two columns “Regexp” and “Type”)

The function returns a three-column table: the document id in which the pattern was found, the type of pattern, and the matched term (pattern).

regexpample <- read_delim("regexps.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE) # load example table with regular expressions to search
regexp_output <- extract_regexps_from_subcorpus(docs_eng_ft, regexpample)

8.2 Data transformation

This document presents a substantial workflow for the transformation of data.

9 Appendix

This table describes the 42 functions available in HistText as of November 2023.

Control Functions
accepts_date_queries	Check if a corpus accepts date queries
get_default_ner_model	Get the name of the default NER model for a given corpus
get_error_status	Retrieve the error status of a response.
get_server_status	Get the status of the server
list_corpora	List available collections in SolR
Query Functions
search_documents	Search for documents
search_documents_ex	Extended Search for documents
search_concordance	KWIC Search In ENP Corpora
search_concordance_ex	Extended KWIC Search In ENP Corpora
search_concordance_on_df	KWIC search in a custom dataframe
get_documents	Retrieve document from ID
count_documents	Get the number of articles matching a query, by date
count_search_documents	Count the number of documents that can be returned by a query
view_document	View a single document in RStudio
Data extraction functions
ner_on_corpus	Apply Named Entity Recognition on a corpus
ner_on_df	Apply Named Entity Recognition on the specified column of a dataframe
run_ner	Apply Named Entity Recognition on a string
run_qa	Apply Question-Answering on a string
qa_on_corpus	Apply Named Entity Recognition on a corpus
qa_on_df	Apply Named Entity Recognition on the specified column of a dataframe
extract_regexps_from_subcorpus	apply a collection of Regexps to a collection of documents
Advanced functions
list_search_fields	List possible search fields for a given corpus
get_search_fields_content	Retrieve the content associated with each search field
list_filter_fields	List possible filter fields for a given corpus
list_ner_models	List available NER models on the server
list_possible_filters	List possible filter values for a given filter field
list_precomputed_corpora	List corpora with precomputed annotations
list_precomputed_fields	List fields of a given corpus that have precomputed annotations
list_qa_models	List available NER models on the server
load_pdf_as_df	Load the text from a PDF into a data frame
proquest_view	Display an entry from ProQuest Corpus
Chinese-specific functions
list_cws_models	List available CWS models on the server
run_cws	Apply Chinese Word Segmentation on a string
get_default_cws_model	Get the name of the default CWS model for a given corpus
cws_on_corpus	Apply Chinese Word Segmentation on a corpus
cws_on_df	Apply Chinese Word Segmentation on the specified column of a dataframe
sinograms_to_py	sinograms(漢字) to pinyin conversion
wade_to_py	wade-giles to pinyin conversion
Graph functions
get_padagraph_url	Send a tidygraph to padagraph and return the URL
in_padagraph	Send a tidygraph to padagraph and displays it
load_in_padagraph	Load and send a previously saved graph object into padagraph
save_graph	Save a tidygraph into a file
Server functions
query_server_get	GET a resource from the server
query_server_post	POST a file to the server
set_config_file	Sets the config file in order to specify the server URL to use (+ other needed information).

10 Further documentation

ENP-China R package: Update 0.2.5. 08-06-2021, by Jeremy Auguste. HistText Updates (1.0.0). 07-10-2021. 08-06-2021, by Jeremy Auguste. HistText 1.6.2: Updates & News. 27-09-2022, by Jeremy Auguste.

Bibliography

Blouin, Baptiste, Christian Henriot, and Cécile Armand. 2023. “HistText: An Application for Leveraging Large-Scale Historical Textbases.” Journal of Data Mining and Digital Humanities. https://shs.hal.science/halshs-04178820.

Blouin, Baptiste, Hen-Hsen Huang, Christian Henriot, and Cécile Armand. 2023. “Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts, 2023 (with Blouin, Baptiste, Hen-Hsen Huang, and Christian Henriot).” Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities & 8th International Workshop on Computational Linguistics for Uralic Languages.

HistText Manual

Christian Henriot

Cécile Armand

2023-11-13