Abstract
This manual presents the functionalities of the HistText R library. It consists of two main sections: (1) search and build the corpus (2) more advanced exploration.
This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.
HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.
The main artisans of the HistText package are:
We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:
Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.
The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.
The HistText library is also available on Gitlab [https://gitlab.com/histtext] for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.
For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).
The complete list of functions is available in the appendix.
devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")
Configuration of the package (replace fields with actual server information)
histtext::set_config_file(domain = "https://rapi.enpchina.eu",
user = "user_info", password = "user_info_password")
If successfully configured, the following command will return “OKâ€
histtext::get_server_status()
Now you can upload the library
library(histtext)
The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):
histtext::list_corpora()
## [1] "archives" "chinajournal-pages" "csmo-pages"
## [4] "dongfangzz" "elder_workers" "elder_workers_format"
## [7] "imh-en" "imh-zh" "kmt9k"
## [10] "ncbras" "proquest" "reports-en"
## [13] "reports-fr" "scmp-recent" "shimingru-diary"
## [16] "shunpao" "shunpao-revised" "shunpao-tok"
## [19] "waiguozaihua" "wikibio-en" "wikibio-zh"
## [22] "zhanggangdiary"
### Brief description
Periodicals:
Other printed sources:
Archives:
Wikipedia:
Diaries and Memoirs
The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.
Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).
The function search_documents serves to find the documents based on one or several terms. The function is composed of two main arguments: the queried term(s) and the targeted corpus. If the term consists of just one word or character, use the double quotation marks. For compound words, in English or Chinese, add simple quotations marks as in the example below:
search_documents("Rotary", "proquest")
search_documents('"Rotary Club"', "proquest")
search_documents('"扶輪社"', "shunpao-revised")
It is also possible to run a query with multiple terms using
Boolean operators, as in the example below (here | = OR). For a detailed
list of possible operators in R, see this document:
search_documents('"Rotary" | "Rotary Club"', "proquest")
search_documents('"扶輪社" | "上海扶輪社"', "shunpao-revised")
The function generates a table with four columns indicating the
unique identifier of each document (DocId), the date of publication
(Date, in YYYMMDD format), the title of the article (Title), and the
source (Source), e.g. the name of the periodical in the ProQuest
collection. In the table below, each row represents a unique
document:
docs_eng <- search_documents('"Shanghai Rotary Club"', "proquest")
docs <- search_documents('"扶輪社"', "shunpao-revised")
docs
The count_search_documents function allows researchers
to determine the number of documents that can be returned by a
particular query without retrieving the actual documents. This function
aids researchers in understanding the potential size and scale of their
query results, enabling them to gauge the feasibility and magnitude of
their research endeavors before executing resource-intensive
queries.
histtext::count_search_documents('"上海"', "shunpao-revised")
The function count_documents serves to visualize the
distribution of documents matching a query over time:
histtext::count_documents('"上海扶輪社"', "shunpao-revised") %>%
mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1920) %>%
ggplot(aes(Year,N)) + geom_col(alpha = 0.8) +
labs(title = "The Rotary Club of Shanghai in the Shenbao",
subtitle = "Number of articles mentioning 上海扶輪社",
x = "Year",
y = "Number of articles")
Counting can also be applied to specific fields that vary according
to the queried corpora. The list of possible fields can be obtained with
the function list_filter_fields():
list_filter_fields("proquest")
## [1] "publisher" "category"
list_filter_fields("dongfangzz")
## [1] "category" "volume" "issue"
For example, if we want to count the number of documents in
Dongfang zazhi for the queried term “調查†in the category
field:
histtext::count_documents('"調查"', "dongfangzz", by_field = "category") %>%
arrange(desc(N))
In the example below, we count the number of documents by
publisher in the “ProQuest†corpus:
histtext::count_documents('"Rotary Club"', "proquest", by_field = "publisher") %>%
arrange(desc(N))
The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =â€) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社†in the Shenbao corpus, and we set 100 as the desired context:
concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)
concs
The output is similar to the table generated by the
search_documents function, with three additional columns
containing the queried term (Matching) and the text before (Before) and
after (After). In the table above, each row no longer represents a
unique document, but each occurrence of the queried term in the
documents. The concordance table usually contains more rows than the
table of documents, since the queried term may appear several times in
the same document.
The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):
docs_ft <- histtext::get_documents(docs, "shunpao")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
docs_eng_ft <- histtext::get_documents(docs_eng, "proquest")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
## [1] 511
## [1] 521
## [1] 531
## [1] 541
## [1] 551
## [1] 561
## [1] 571
## [1] 581
## [1] 591
## [1] 601
## [1] 611
## [1] 621
## [1] 631
## [1] 641
## [1] 651
## [1] 661
## [1] 671
## [1] 681
## [1] 691
## [1] 701
## [1] 711
## [1] 721
## [1] 731
## [1] 741
## [1] 751
## [1] 761
## [1] 771
## [1] 781
docs_ft