1 Introduction

This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.

HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.

The main artisans of the HistText package are:

  1. Pierre Magistry (Ph.D., initially a postdoctoral researcher in the ENP-China Project, now an associate professor at INALCO, laid the foundation for HistText with the creation of the R ‘enpchina’ library. This library offered essential functionalities for querying documents, retrieving full-text content, and extracting named entities from diverse corpora.
  2. Jeremy Auguste (Ph.D., postdoctoral researcher in the ENP-China Project refined the functionalities of HistText. He focused particularly on improving the ‘extended’ search and concordance features, enabling the introduction of filters to facilitate more precise narrowing down of results based on time, publications, fields, and other metadata. Additionally, Auguste spearheaded the development of the user interface in R-Shiny, designed to cater to non-programming users.
  3. Baptiste Blouin, then a Ph.D. candidate, contributed to enhancing the named entity recognition (NER) capabilities for Chinese sources. As a postdoc researcher, Blouin further advanced HistText into a comprehensive application. Blouin made significant contributions to improving the R-Shiny interface, incorporating a diverse array of data visualizations that enhance the user experience. Importantly, behind the scenes, Blouin organized the implementation of several annotation campaigns focused on tokenization, named entity recognition, and event extraction in Chinese historical sources.

We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:

  • to be stored on a SolR server
  • to be full text
  • to be fully segmented

Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.

The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.

The HistText library is also available on Gitlab [https://gitlab.com/histtext] for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.

For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).

The complete list of functions is available in the appendix.

2 Set Up

2.1 Installation and configuration

devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")

Configuration of the package (replace fields with actual server information)

histtext::set_config_file(domain = "https://rapi.enpchina.eu", 
                          user = "user_info", password = "user_info_password")

If successfully configured, the following command will return “OK”

histtext::get_server_status()

Now you can upload the library

library(histtext)

2.2 Available Corpora

The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):

histtext::list_corpora()
##  [1] "archives"             "chinajournal-pages"   "csmo-pages"          
##  [4] "dongfangzz"           "elder_workers"        "elder_workers_format"
##  [7] "imh-en"               "imh-zh"               "kmt9k"               
## [10] "ncbras"               "proquest"             "reports-en"          
## [13] "reports-fr"           "scmp-recent"          "shimingru-diary"     
## [16] "shunpao"              "shunpao-revised"      "shunpao-tok"         
## [19] "waiguozaihua"         "wikibio-en"           "wikibio-zh"          
## [22] "zhanggangdiary"


### Brief description

Periodicals:

  • shunpao: Chinese newspaper Shenbao 申報 (1872-1949): original version from the provider (GetHong)
  • shunpao-revised: Chinese newspaper Shenbao 申報 (1872-1949): corrected version by the ENP-China project (date formatting, correction of titles mixed up with text, segmentation of extra-long articles)
  • proquest: English-language periodicals from the ProQuest Chinese Newspapers Collection (CNC)
  • dongfangzz: Dongfang zazhi 東方雜誌 (1904-1948)
  • ncbras: Journal of the North China Branch of the Royal Asiatic Society (1858-1948)
  • chinajournal-pages: The China Journal (1904-1949) (access at page level)
  • csmo-pages: Chinese Students’ Monthly (1906-1931) (access at page level)
  • scmp-recent: South China Morning Post (1954-2000) (subset from the ProQuest collection)
  • cmj: China Medical Journal (1887-1949)
  • elder_workers_format: corpus of interviews of Shanghai workers (1953-1958) [not public]

Other printed sources:

Archives:

Wikipedia:

  • wikibio-en: corpus of biographies of individuals active in modern China extracted from Wikipedia (English)
  • wikibio-zh: corpus of biographies of individuals active in modern China extracted from Wikipedia (Chinese)

Diaries and Memoirs

The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.

2.2.1 Statistics

Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).

Statistics Homepage List of folders (ProQuest) Collection statistics for ProQuest North China Herald corpus statistics

3 Query functions

3.2 Basic concordance

The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =”) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社” in the Shenbao corpus, and we set 100 as the desired context:

concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)

concs


The output is similar to the table generated by the search_documents function, with three additional columns containing the queried term (Matching) and the text before (Before) and after (After). In the table above, each row no longer represents a unique document, but each occurrence of the queried term in the documents. The concordance table usually contains more rows than the table of documents, since the queried term may appear several times in the same document.

3.3 Full Text Retrieval

The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):

docs_ft <- histtext::get_documents(docs, "shunpao")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
docs_eng_ft <- histtext::get_documents(docs_eng, "proquest") 
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
## [1] 511
## [1] 521
## [1] 531
## [1] 541
## [1] 551
## [1] 561
## [1] 571
## [1] 581
## [1] 591
## [1] 601
## [1] 611
## [1] 621
## [1] 631
## [1] 641
## [1] 651
## [1] 661
## [1] 671
## [1] 681
## [1] 691
## [1] 701
## [1] 711
## [1] 721
## [1] 731
## [1] 741
## [1] 751
## [1] 761
## [1] 771
## [1] 781
docs_ft


The function generates the same table as search_documents, with an additional column that contains the full text of the document (Text).


3.4 Close reading

If you want to read more closely individual documents, use the view_document function. The document will appear in your Viewer panel.

view_document("SPSP193602290401", "shunpao-revised")


If you want to display a term of interest in the document, add the term after the “query” argument as indicated below:

view_document("SPSP193602290401", "shunpao-revised", query = '"扶輪社"')

3.4.1 ProQuest Documents

The function proquest_view() is designed to display the original document on the ProQuest platform, based on the ID of the document (for ProQuest subscribers only):

proquest_view(1371585080)

3.4.2 Other documents

The function view_document() is designed to display the full text of a single document with the functionality of highlighting selected key words in the text. The function is composed of three arguments:

  • docid = document identifier
  • corpus = name of the corpus
  • query = list of queried terms (query string or vector of query strings).
view_document(1371585080, "proquest", query = c("Nanking", "club"))
view_document("SPSP193011141210", "shunpao", query = c("扶輪社", "飯店"))

3.5 Advanced functions

The package provides more advanced functions to perform multi-field queries.

The function search_documents serves to query terms in fields other than the content of articles. For instance, one can perform queries based on the title of articles and/or date of publication. It is possible to search several fields at the same time. The function is composed of five arguments, as described below:

  • q: the queried term (q)
  • corpus: the corpus (within quotation marks)
  • search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described below)
  • start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
  • dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see below).
search_documents <- function(q, 
                                  corpus="shunpao", 
                                  search_fields=c(),
                                  start=0,
                                  dates=c())


To obtain the list of searchable fields available in the targeted corpus, use the function list_search_fields. For instance, the example below shows that three fields can be searched in the Dongfang zazhi 東方雜誌 - the title, the author, or the full text of the article:

list_search_fields("dongfangzz")
## [1] "text"    "title"   "authors"


If no field is selected explicitly, queries will be performed in all fields.

The function list_filter_fields() lists the other fields that are available but are not searchable (technically searchable, but not relevant):

list_filter_fields("dongfangzz")
## [1] "category" "volume"   "issue"


In the above example, you can filter the results of your search in the Dongfang zazhi by the category of article, the volume and issue.

The function list_possible_filters() can be applied to any filterable field to display its contents:

list_possible_filters("dongfangzz", "category") %>% arrange(desc(N))


The first column “Value” contains the possible filters that you can use. The second column “N” indicates the number of documents in the corresponding filter value.

Use accepts_date_queries to test whether the targeted corpus supports date filters. For instance, the example below shows that it is possible to search dates in the Shenbao or Dongfang zazhi, but not in Wikipedia (Chinese or English):

histtext::accepts_date_queries("shunpao")
## [1] TRUE
histtext::accepts_date_queries("dongfangzz")
## [1] TRUE
histtext::accepts_date_queries("wikibio-zh")
## [1] FALSE
histtext::accepts_date_queries("wikibio-en")
## [1] FALSE


3.5.1 Multifield queries

For instance, search the term 扶輪社 in the Shenbao in the title only:

search_documents('"扶輪社"', corpus="shunpao", search_fields= "title")


Search the same term 扶輪社 in the Shenbao in both title and full text :

search_documents('"扶輪社"', corpus="shunpao", search_fields=c("title", "text"))

3.5.2 Date filtering

Search the term 扶輪社 in the Shenbao in all possible fields for the year 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933")


Search the term 扶輪社 in the Shenbao in all possible fields for March 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933-03")


Search the term 扶輪社 in the Shenbao in all possible fields for the two years 1933 and 1940:

search_documents('"扶輪社"', corpus="shunpao", dates=c("1933", "1940"))


Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940:

search_documents('"扶輪社"', corpus="shunpao", dates="[1930 TO 1940]")


Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940 and 1945:

search_documents('"扶輪社"', corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))


Combined query on different fields and dates:

combined_search <- search_documents('"扶輪社"', corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))

combined_search

3.5.3 Concordance

The extended search function also applies to concordance with the function search_concordance. The search_concordance is composed of six arguments, the same as in search_documents, plus an additional argument for the context size:

  • q: the queried term (q)
  • corpus: the corpus (within quotation marks)
  • search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described above)
  • context_size: the size of the context (number of characters before/after)
  • start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
  • dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see above).
search_concordance <- function(q, 
                                   corpus="shunpao", 
                                   search_fields=c(),
                                   context_size=30, 
                                   start=0,
                                   dates=c())


The advanced functions that we described above - list_search_fields(), list_filter_fields(), and list_possible_filters() - can also be applied in combination with the search_concordance() function.

3.5.4 Word embeddings

In HistText, pre-computed word embeddings (i.e., learned representations for text where words with similar meanings have similar representations), can be utilized to enhance queries by incorporating similar terms. Please note that this function is currently under construction. However, it is already accessible through the HistText Search interface.

4 Corpus Statistics

4.1 Word frequencies

4.1.1 Number of occurrences per article

The function stats_count provides statistics on the queried term(s). With the classic parameters, you can obtain the number of occurrences of this/these keyword(s) per article.

stats_count(df,"noble")

4.1.2 Mean number of occurences over time

By adding the over_time=TRUE parameter, you can obtain the average values at which this keyword appears over the years

stats_count(df,"noble",over_time=TRUE)

4.1.3 Percentage compared to other words over time

By adding the by_char=TRUE parameter, we can contrast the number of occurrences of the queried term with the overall number of tokens in the text per year.

stats_count(df,"noble",over_time=TRUE,by_char=TRUE)

4.1.4 Plot or dataframe

For all statistical functions, it is possible to retrieve the statistics dataframe without generating the graph by setting to_plot=FALSE.

stats_count(df,"noble",over_time=TRUE,by_char=TRUE,to_plot=FALSE)

4.1.5 Interactivity

All visualizations are created using Plotly to make them interactive. However, it is possible to generate a non-interactive graph by deactivating ly=FALSE .

stats_count(df,"noble",over_time=TRUE,by_char=TRUE,ly=FALSE)

4.2 Characters count

The function count_character_bins counts the number of characters in the text and then groups them by “bins”, i.e. series of ranges of numerical value into which data are sorted in statistical analysis. It is possible to manually set the number of bins using the nb_bins argument, as shown below.

count_character_bins(df)
count_character_bins(df,nb_bins=20)

4.4 Document Term Matrix (DTM)

The classic_dtm function allows you to explore the semantic content of the text in more depth. To use this function, you need to have the following packages installed:

install.packages("stopwords")
install.packages("quanteda")
install.packages("quanteda.textstats")

4.4.1 Top words

The classic configuration allows you to see the top ‘n’ words present in the given texts. You can set the number of desired words manually using the top argument:

classic_dtm(df,top=20)

4.4.2 Top words over time

By adding over_time=TRUE it is possible to display the frequency of these top “n” words over time:

classic_dtm(df,top=20,over_time=TRUE)

4.4.3 Document Similarity

By adding doc_simi=TRUE it is possible to calculate the similarity between documents with respect to the frequencies of the words they use.

install.packages("ggdendro")
classic_dtm(df,top=20,doc_simi=TRUE)

5 Named Entity Recognition (NER)

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. In histText, the function ner_on_corpus applies a default NER model that varies according to the language (English or Chinese) or the nature (nature of the language, OCR, etc.) of the queried corpus.

The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into eight main types: persons (PERS), organizations (ORG), locations (LOC), geopolitical entities (GPE) (countries, administrative regions…), temporal entities (DATE, TIME), numerical entities (MONEY, PERCENTAGE), and miscellaneous entities (MISC).

5.1 Named Entity Extraction

The function ner_on_corpus is composed of two arguments: the collection of documents with their full text (output of the “get_documents()” function), and the targeted corpus.

Example in English (ProQuest:

docs_ner_eng <- ner_on_corpus(docs_eng, corpus = "proquest")
## 1/785
## 11/785
## 21/785
## 31/785
## 41/785
## 51/785
## 61/785
## 71/785
## 81/785
## 91/785
## 101/785
## 111/785
## 121/785
## 131/785
## 141/785
## 151/785
## 161/785
## 171/785
## 181/785
## 191/785
## 201/785
## 211/785
## 221/785
## 231/785
## 241/785
## 251/785
## 261/785
## 271/785
## 281/785
## 291/785
## 301/785
## 311/785
## 321/785
## 331/785
## 341/785
## 351/785
## 361/785
## 371/785
## 381/785
## 391/785
## 401/785
## 411/785
## 421/785
## 431/785
## 441/785
## 451/785
## 461/785
## 471/785
## 481/785
## 491/785
## 501/785
## 511/785
## 521/785
## 531/785
## 541/785
## 551/785
## 561/785
## 571/785
## 581/785
## 591/785
## 601/785
## 611/785
## 621/785
## 631/785
## 641/785
## 651/785
## 661/785
## 671/785
## 681/785
## 691/785
## 701/785
## 711/785
## 721/785
## 731/785
## 741/785
## 751/785
## 761/785
## 771/785
## 781/785
docs_ner_eng


Example in Chinese (Shenbao):

docs_ner_zh <- ner_on_corpus(docs_ft, corpus = "shunpao")
## 1/507
## 11/507
## 21/507
## 31/507
## 41/507
## 51/507
## 61/507
## 71/507
## 81/507
## 91/507
## 101/507
## 111/507
## 121/507
## 131/507
## 141/507
## 151/507
## 161/507
## 171/507
## 181/507
## 191/507
## 201/507
## 211/507
## 221/507
## 231/507
## 241/507
## 251/507
## 261/507
## 271/507
## 281/507
## 291/507
## 301/507
## 311/507
## 321/507
## 331/507
## 341/507
## 351/507
## 361/507
## 371/507
## 381/507
## 391/507
## 401/507
## 411/507
## 421/507
## 431/507
## 441/507
## 451/507
## 461/507
## 471/507
## 481/507
## 491/507
## 501/507
docs_ner_zh


The function returns a table with five columns:

  • Id: unique identifier of the document
  • Text: the name of the entity (as given in the source)
  • Type: the type of the entity extracted with its confidence index (measuring the accuracy of the classification by the algorithm)
  • Start: the position of the character immediately preceding the entity in the text
  • End: the position of the character immediately following the entity in the text

The histext package includes alternative models tailored to the nature of the text in the different corpora. Mostly, it provides specific models for text with noisy Optical Character Recognition (OCR) and/or missing punctuation.

The function list_ner_models() lists all the models available in the package:

list_ner_models()
##  [1] "spacy_sm:en:ner"            "spacy:en:ner"              
##  [3] "spacy_sm:zh:ner"            "spacy:zh:ner"              
##  [5] "trftc_noisy_nopunct:en:ner" "trftc_noisy:en:ner"        
##  [7] "trftc_nopunct:zh:ner"       "trftc_camembert:fr:ner"    
##  [9] "trftc_person_1class:zh:ner" "trftc_person_4class:zh:ner"


There are two main models: spaCy and trftc (Transformer Token Classification) - a model specifically developed by the ENP-China team (Baptiste Blouin, Jeremy Auguste) to handle specific issues that pertain to historical corpora in multiple languages.

Name Model Language Features
spacy_sm:en:ner spaCy English Default model for large corpora (faster but less reliable)
spacy_sm:zh:ner spaCy Chinese Default model for large corpora (faster but less reliable)
spacy:en:ner spaCy English Default model for large or small corpora (slower but more reliable)
spacy:zh:ner spaCy Chinese Default model for large or small corpora (slower but more reliable)
trftc_noisy_nopunct:en:ner TrfTC English Model for texts with noisy OCR and missing punctuation
trftc_noisy:en:ner TrfTC English Model for texts with noisy OCR
trftc_nopunct:zh:ner TrfTC Chinese Model for texts with no punctuation
trftc_camembert:fr:ner TrfTC French Models for texts in French (based on BERT)
trftc_person_1class:zh:ner TrfTC Chinese Model trained to identify persons in any category
trftc_person_4class:zh:ner TrfTC Chinese Model trained to detect persons’ names with embedded titles (王少年 for 王局長少年, 王君少年)

To know which default model is used for a given corpus, use the function get_default_ner_model():

get_default_ner_model("proquest")
get_default_ner_model("imh-zh")


To specify the model you want to apply, use the argument “model =” in the ner_on_corpus() function:

ner_on_corpus(docs_ft, corpus = "proquest", model = "trftc_noisy_nopunct:en:ner")
ner_on_corpus(docs_ft, corpus = "shunpao", model = "trftc_person_4class:zh:ner")

5.2 Padagraph Visualization

The data extracted through NER can be visualized as a network graph using Padagraph. To enable this function, one needs to create first a network object using libraries such as igraph or tidygraph. The following lines of code detail the successive steps to transform the results of NER into an edge list and eventually a network object. The last step consists in applying the function in_padagraph() to project the tidygraph object in Padagraph. In the following example, we propose to build a two-mode network linking persons or organizations with the documents in which they appear:

# create the edge list linking documents with persons or organizations 
edge_ner_eng <- docs_ner_eng %>% 
  filter(Type %in% c("PERSON", "ORG")) %>% # select persons and organizations
  select(DocID, Text) %>% # retain only the relevant variables (from/to)
  filter(!Text %in% c("He","She","His","Her","Him", "Hers", "Chin", "Mrs", "Mr", "Gen", "Madame", "Madam", "he","she","his","her","him", "hers")) # remove personal pronouns
# retain entities that appeared at least twice
topnode_docs_eng <- edge_ner_eng %>% 
  group_by(Text) %>% 
  add_tally() %>% 
  filter(n>1) %>% 
  distinct(Text)
# build the network with igraph/tidygraph
topedge <- edge_ner_eng %>% 
  filter(Text %in% topnode_docs_eng$Text) %>% 
  rename(from = DocID, to = Text)
ig <- graph_from_data_frame(d=topedge, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig) %>% 
  activate(nodes) %>% 
  mutate(label=name)
# project in padagraph
tg %>% histtext::in_padagraph("RotaryNetwork") 


The function get_padagraph_url directly returns the URL for displaying the graph:

tg %>% histtext::get_padagraph_url("RotaryNetwork") 

Rotary Network in ProQuest
The URL for the graph is created as a permanent link that can be used in third applications or to return to the network.

The function load_in_padagraph() serves to import a tidygraph object into “Padagraph”. The function is composed of three arguments:

  • filepath: the path to a file which contains a tidygraph (created with the function ‘save_graph’ included in histtext)
  • name: the name to be given to the graph
  • show_graph: if TRUE, show the graph in a RStudio viewer

The function returns an URL that displays the tidygraph object in Padagraph.

save_graph(graph, filepath) 
load_in_padagraph(filepath, name, show_graph = TRUE) 

5.3 NER on external documents

HistText includes two functions to implement NER on external documents, i.e. that can be uploaded directly into R Studio:

  1. run_ner() : to be applied on a string of character
  2. ner_on_df(): to be applied to a dataframe (e.g. a text extracted from a PDF document or scraped from the web, in a dataframe format)

In addition, HistText includes a function that transform a pre-ocerized PDF document into a dataframe.

The function run_ner() contained two arguments: the text to be analyzed (text) and the model to be used (model) (use ‘list_ner_models()’ to obtain the available models):

run_ner(text, model = "spacy:en:ner", verbose = FALSE)

The function ner_on_df requires at least two arguments: the name of the variable that contains the text (Text) and the language model that will apply (English or Chinese). In the example below, we add an argument to specify the page number with it s identifier (document). We also choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document.

ner_df <- ner_on_df(dplyr::slice(docs_eng_ft, 1:10), "Text", id_column="DocId", model = "spacy:en:ner") 
## 1/10
ner_df


If applied to a PDF document, the function ner_on_df() consists in three steps:

  1. Upload the PDF document (if based on a PDF document)
  2. Convert the text into a dataframe (df) using the function load_pdf_as_df
  3. Apply the function to the dataframe.

In the example below, we choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document. To eliminate the pages that do not contain text of that contain little text (pages with images), we set the minimum limit to at least 100 characters per page. The identifier argument serves to assign the name of the document (as you choose to label it) to the numbered pages (by default, pages appear simply as numbers)

dplyr::slice(docs_eng_ft, 1:10)

6 Question & Answer

The implementation of question-and-answer (Q&A) queries is another valuable functionality provided by the HistText library. This feature enables researchers to target and extract specific content from natural-language texts based on user-defined queries. By formulating questions or prompts, researchers can use the Q&A feature to extract data from documents in natural language. Q&A functions in HistText are particularly effective for retrieving biographical information.

Two models are currently available in HistText: one for Chinese and one for English. You can use the list_qa_models() to list the available models:

histtext::list_qa_models()

6.1 Basic usage

The most basic use is to ask a single question:

imh_en_df <- histtext::search_documents('"member of party"', "imh-en")

histtext::qa_on_corpus(imh_en_df, "What is his full name?", "imh-en")


Alternatively, you can ask multiple variants of a question:

histtext::qa_on_corpus(imh_en_df, c("What is his full name?", "What name?"), "imh-en")

6.2 More complex usage

A more advanced usage of Q&A can be achieved when questions depend on previous questions:

questions <- list("name:full" = c("What is his full name?", "What name?"),
                  "education:location" = c("Where {name:full} study at?", "Where study at?"))
histtext::qa_on_corpus(imh_en_df, questions, "imh-en")


ou can also specify the number of answers that a question should be allowed to produce:

histtext::qa_on_corpus(imh_en_df, questions, "imh-en", max_answers = list("education:location" = 2))


Examples of questions on which models where trained with can be accessed using the following functions:

histtext::biography_questions("en")
histtext::biography_questions("zh")

7 Chinese-specific functions

A key feature of HistText is to provide a set of functions designed to process documents in “transitional Chinese”, a term that we coined to refer to the Chinese language as it evolved from the near-classical language of the administration and imperial publications from the 1850s to the near-contemporary Chinese of the late 1940s (Blouin et al. 2023).

7.1 Tokenization

Tokenization refers to the operation of segmenting a text into tokens, which are the most elementary semantic units in a text. This is a crucial step for text analysis. Currently, two models are available in HistText, as listed below. The trftc_shunpao:zh:cws model is based on the initial annotation campaign conducted by Huang Hen-hsen (Academia Sinica) in 2021. The trftc_shunpao_23:zh:cws model is a refined model based on a second annotation campaign conducted by the ENP-China project in 2023 (Blouin et al. 2023).

list_cws_models()


The tokenizer can be applied to a corpus built with HistText (see section X) using the function cws_on_corpus, or it can be used directly on a specific data frame provided by the researcher using the function cws_on_df.

Below we provide an example for each case.

7.1.1 Corpus

The cws_on_corpus function includes the following arguments:

  • docids: the ‘DocId’ column returned with the search_document function.
  • corpus: the corpus to be used, chosen from the available corpora in MCTB.
  • model: allows you to select the specific model to be used. If not specified, it defaults to the model set for the chosen corpus.
  • field: input text to be tokenized.
  • detailed_output: When set to TRUE, it returns a data frame with one row per token, including the position in the text and confidence scores.
  • token_separator: specifies the character used to separate each token. By default, a single white space is used.
cws_on_corpus(
  docids,
  corpus = "__",
  model = "__default__",
  field = "__default__",
  detailed_output = FALSE,
  token_separator = " ",
  batch_size = 10,
  verbose = TRUE
)



Below is an example to illustrate how the function works:

# create sample corpus 
sample_corpus <- histtext::search_documents('"共產黨員"', "imh-zh")

# tokenize the corpus
tokenized_corpus <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)

tokenized_corpus

kable(tokenized_corpus, caption = "Tokenized corpus") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")


If you wish to display the detailed output:

tokenized_corpus_detailed <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)

kable(tokenized_corpus_detailed, caption = "Tokenized corpus with detailed output") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.1.2 Data frame

The function cws_on_df follows a similar structure:

df: data frame which contains texts to tokenize text_column: name of the column which contains the input text to be tokenized. id_column: id of the column that is used to associate ids to the CWS outputs. By default, uses the row index in ‘df’. model: selection of the model to be used (use ‘list_cws_models()’ to get available models) detailed_output: if TRUE, return a dataframe with one row per token (with positions and confidence scores) token_separator: the character to use to separate each token (default uses a normal white space)

cws_on_df(
  df,
  text_column,
  id_column = NULL,
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)

To illustrate how the tokenizer functions on a data frame, we provide a sample dataset (sample_df) that contains four documents extracted from the Shenbao between 1874 and 1889. The data frame includes five columns: DocId, Date, Title, Source, Text.

kable(sample_df, caption = "Sample dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")


We apply the tokenizer on the “Text” column of the sample data frame:

tokenized_df <- cws_on_df(
  sample_df,
  text_column = "Text",
  id_column = "DocId",
  model = "trftc_shunpao_23:zh:cws",
  detailed_output = FALSE,
  token_separator = " ",
  verbose = TRUE
)


The function returns a data frame with two columns, one with the Tokenized text (Text) and another with the original ids of the documents (DocId):

kable(tokenized_df, caption = "Tokenized dataframe") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

7.2 Conversions

Historical sources use various forms of transliteration of Chinese characters. The function wade_to_py serves to convert the standard (but obsolete) Wade-Giles transliteration system into pinyin. In the example below, there remain some issues in the conversion operators that we correct with lines of code. These issues will be addressed in the near future.

library(readr)
wgconv <- read_csv("wgconv.csv")

wgconv %>% mutate(NameWG = wade_to_py(Original)) %>%
  mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
  mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
  mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>% 
  mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che")) %>% 
  select(Original, NameWG5) %>% 
  rename(output = NameWG5)

8 Additional features

8.1 Regular Expressions

The function extract_regexps_from_subcorpus() is designed to search a list of regular expressions in a corpus of documents. The function is composed of two arguments:

  • “corpus”: consists of a table with the documents (must include DocId, Title and Text columns), typically the output of the “get_documents()” function (e.g. “docs_eng_ft”).
  • “regexps”: a table that indicates the pattern(s) to look for (including two columns “Regexp” and “Type”)

The function returns a three-column table: the document id in which the pattern was found, the type of pattern, and the matched term (pattern).

regexpample <- read_delim("regexps.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE) # load example table with regular expressions to search
regexp_output <- extract_regexps_from_subcorpus(docs_eng_ft, regexpample)

8.2 Data transformation


This document presents a substantial workflow for the transformation of data.

9 Appendix

This table describes the 42 functions available in HistText as of November 2023.


Control Functions
   
   
   
accepts_date_queries   
   
Check if a corpus accepts date   queries   
   
get_default_ner_model   
   
Get the name of the default NER   model for a given corpus   
   
get_error_status   
   
Retrieve the error status of a   response.   
   
get_server_status   
   
Get the status of the server   
   
list_corpora   
   
List available collections in SolR   
   
Query Functions   
   
   
   
search_documents   
   
Search for documents   
   
search_documents_ex   
   
Extended Search for documents   
   
search_concordance   
   
KWIC Search In ENP Corpora   
   
search_concordance_ex   
   
Extended KWIC Search In ENP   Corpora   
   
search_concordance_on_df   
   
KWIC search in a custom dataframe   
   
get_documents   
   
Retrieve document from ID   
   
count_documents   
   
Get the number of articles   matching a query, by date   
   
count_search_documents   
   
Count the number of documents that   can be returned by a query   
   
view_document   
   
View a single document in RStudio   
   
Data extraction functions   
   
   
   
ner_on_corpus   
   
Apply Named Entity Recognition on   a corpus   
   
ner_on_df   
   
Apply Named Entity Recognition on   the specified column of a dataframe   
   
run_ner   
   
Apply Named Entity Recognition on   a string   
   
run_qa   
   
Apply Question-Answering on a   string   
   
qa_on_corpus   
   
Apply Named Entity Recognition on   a corpus   
   
qa_on_df   
   
Apply Named Entity Recognition on   the specified column of a dataframe   
   
extract_regexps_from_subcorpus   
   
apply a collection of Regexps to a   collection of documents   
   
Advanced functions   
   
   
   
list_search_fields   
   
List possible search fields for a   given corpus   
   
get_search_fields_content   
   
Retrieve the content associated   with each search field   
   
list_filter_fields   
   
List possible filter fields for a   given corpus   
   
list_ner_models   
   
List available NER models on the   server   
   
list_possible_filters   
   
List possible filter values for a   given filter field   
   
list_precomputed_corpora   
   
List corpora with precomputed   annotations   
   
list_precomputed_fields   
   
List fields of a given corpus that   have precomputed annotations   
   
list_qa_models   
   
List available NER models on the   server   
   
load_pdf_as_df   
   
Load the text from a PDF into a   data frame   
   
proquest_view   
   
Display an entry from ProQuest Corpus   
   
Chinese-specific functions   
   
   
   
list_cws_models   
   
List available CWS models on the   server   
   
run_cws   
   
Apply Chinese Word Segmentation on   a string   
   
get_default_cws_model   
   
Get the name of the default CWS   model for a given corpus   
   
cws_on_corpus   
   
Apply Chinese Word Segmentation on   a corpus   
   
cws_on_df   
   
Apply Chinese Word Segmentation on   the specified column of a dataframe   
   
sinograms_to_py   
   
sinograms(漢字) to pinyin conversion   
   
wade_to_py   
   
wade-giles to pinyin conversion   
   
Graph functions   
   
   
   
get_padagraph_url   
   
Send a tidygraph to padagraph and   return the URL   
   
in_padagraph   
   
Send a tidygraph to padagraph and   displays it   
   
load_in_padagraph   
   
Load and send a previously saved   graph object into padagraph   
   
save_graph   
   
Save a tidygraph into a file   
   
Server functions   
   
   
   
query_server_get   
   
GET a resource from the server   
   
query_server_post   
   
POST a file to the server   
   
set_config_file   
   
Sets the config file in order to   specify the server URL to use (+ other needed information).   

10 Further documentation

ENP-China R package: Update 0.2.5. 08-06-2021, by Jeremy Auguste. HistText Updates (1.0.0). 07-10-2021. 08-06-2021, by Jeremy Auguste. HistText 1.6.2: Updates & News. 27-09-2022, by Jeremy Auguste.

Bibliography

Blouin, Baptiste, Christian Henriot, and Cécile Armand. 2023. HistText: An Application for Leveraging Large-Scale Historical Textbases.” Journal of Data Mining and Digital Humanities. https://shs.hal.science/halshs-04178820.
Blouin, Baptiste, Hen-Hsen Huang, Christian Henriot, and Cécile Armand. 2023. “Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts, 2023 (with Blouin, Baptiste, Hen-Hsen Huang, and Christian Henriot).” Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities & 8th International Workshop on Computational Linguistics for Uralic Languages.