Abstract
This manual presents the functionalities of the HistText R library. It consists of two main sections: (1) search and build the corpus (2) more advanced exploration.
This document is conceived as a practical guide for the R HistText library. The HistText library (or package) is a set of functions developed by the ENP-China Project in the R programming language designed for the exploration and data mining of Chinese-language and English-language digital corpora. Its main purpose is to place in the hands of historians, and more generally humanists, a set of ready-made tools to search, extract, and visualize textual data from large-scale multilingual corpora.
HistText represents the culmination of a longstanding and fruitful collaboration between historians and computer scientists that aimed at exploring machine learning in historical research. This symbiotic partnership has been instrumental in achieving optimized implementations, enhanced performance, and improved usability of HistText.
The main artisans of the HistText package are:
We initially developed this library to facilitate the exploration and extraction of data from the resources collected in the course of the ENP-China project, but as the HistText library developed, we realize that it could have a broader usage. Its functions can be applied to any corpus of a similar nature, provided it meets three basic requirements:
Basically, we developed the HistText library because the providers of historical sources, even when they are available online, provide only very limited search functions. The HistText package features advanced capabilities for querying and extracting data from any available field (title,text, section) in a given corpus and filtering by type of field and by date.
The use of the HistText library requires basic skills in R such as the basic notions for creating a project, uploading libraries, and running a script. But the heavy work is done by the HistText library. All the users needs to do is substitute the terms of the queries in the script that we provide below. Because we wanted to provide concrete examples of how the functions operate and how to write a proper script, we elaborate this manual as a Markdown document. It allows the user to just copy and paste the code to reproduce the proposed operations.
The HistText library is also available on Gitlab [https://gitlab.com/histtext] for those who would be interested in implementing the same set of functions and to apply them to their own corpora. This requires skills to connect the library to the relevant corpora on a given server. Yet, this can be done easily with the help of a computer scientist.
For an in-depth presentation and discussion of HistText’s history, architecture, and broader contribution to field of computational humanities, please refer to our paper (Blouin, Henriot, and Armand 2023).
The complete list of functions is available in the appendix.
devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")
Configuration of the package (replace fields with actual server information)
histtext::set_config_file(domain = "https://rapi.enpchina.eu",
user = "user_info", password = "user_info_password")
If successfully configured, the following command will return “OK”
histtext::get_server_status()
Now you can upload the library
library(histtext)
The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):
histtext::list_corpora()
## [1] "archives" "chinajournal-pages" "csmo-pages"
## [4] "dongfangzz" "elder_workers" "elder_workers_format"
## [7] "imh-en" "imh-zh" "kmt9k"
## [10] "ncbras" "proquest" "reports-en"
## [13] "reports-fr" "scmp-recent" "shimingru-diary"
## [16] "shunpao" "shunpao-revised" "shunpao-tok"
## [19] "waiguozaihua" "wikibio-en" "wikibio-zh"
## [22] "zhanggangdiary"
### Brief description
Periodicals:
Other printed sources:
Archives:
Wikipedia:
Diaries and Memoirs
The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.
Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).
The function search_documents serves to find the documents based on one or several terms. The function is composed of two main arguments: the queried term(s) and the targeted corpus. If the term consists of just one word or character, use the double quotation marks. For compound words, in English or Chinese, add simple quotations marks as in the example below:
search_documents("Rotary", "proquest")
search_documents('"Rotary Club"', "proquest")
search_documents('"扶輪社"', "shunpao-revised")
It is also possible to run a query with multiple terms using
Boolean operators, as in the example below (here | = OR). For a detailed
list of possible operators in R, see this document:
search_documents('"Rotary" | "Rotary Club"', "proquest")
search_documents('"扶輪社" | "上海扶輪社"', "shunpao-revised")
The function generates a table with four columns indicating the
unique identifier of each document (DocId), the date of publication
(Date, in YYYMMDD format), the title of the article (Title), and the
source (Source), e.g. the name of the periodical in the ProQuest
collection. In the table below, each row represents a unique
document:
docs_eng <- search_documents('"Shanghai Rotary Club"', "proquest")
docs <- search_documents('"扶輪社"', "shunpao-revised")
docs
The count_search_documents function allows researchers
to determine the number of documents that can be returned by a
particular query without retrieving the actual documents. This function
aids researchers in understanding the potential size and scale of their
query results, enabling them to gauge the feasibility and magnitude of
their research endeavors before executing resource-intensive
queries.
histtext::count_search_documents('"上海"', "shunpao-revised")
The function count_documents serves to visualize the
distribution of documents matching a query over time:
histtext::count_documents('"上海扶輪社"', "shunpao-revised") %>%
mutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1920) %>%
ggplot(aes(Year,N)) + geom_col(alpha = 0.8) +
labs(title = "The Rotary Club of Shanghai in the Shenbao",
subtitle = "Number of articles mentioning 上海扶輪社",
x = "Year",
y = "Number of articles")
Counting can also be applied to specific fields that vary according
to the queried corpora. The list of possible fields can be obtained with
the function list_filter_fields():
list_filter_fields("proquest")
## [1] "publisher" "category"
list_filter_fields("dongfangzz")
## [1] "category" "volume" "issue"
For example, if we want to count the number of documents in
Dongfang zazhi for the queried term “調查” in the category
field:
histtext::count_documents('"調查"', "dongfangzz", by_field = "category") %>%
arrange(desc(N))
In the example below, we count the number of documents by
publisher in the “ProQuest” corpus:
histtext::count_documents('"Rotary Club"', "proquest", by_field = "publisher") %>%
arrange(desc(N))
The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =”) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社” in the Shenbao corpus, and we set 100 as the desired context:
concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)
concs
The output is similar to the table generated by the
search_documents function, with three additional columns
containing the queried term (Matching) and the text before (Before) and
after (After). In the table above, each row no longer represents a
unique document, but each occurrence of the queried term in the
documents. The concordance table usually contains more rows than the
table of documents, since the queried term may appear several times in
the same document.
The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):
docs_ft <- histtext::get_documents(docs, "shunpao")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
docs_eng_ft <- histtext::get_documents(docs_eng, "proquest")
## [1] 1
## [1] 11
## [1] 21
## [1] 31
## [1] 41
## [1] 51
## [1] 61
## [1] 71
## [1] 81
## [1] 91
## [1] 101
## [1] 111
## [1] 121
## [1] 131
## [1] 141
## [1] 151
## [1] 161
## [1] 171
## [1] 181
## [1] 191
## [1] 201
## [1] 211
## [1] 221
## [1] 231
## [1] 241
## [1] 251
## [1] 261
## [1] 271
## [1] 281
## [1] 291
## [1] 301
## [1] 311
## [1] 321
## [1] 331
## [1] 341
## [1] 351
## [1] 361
## [1] 371
## [1] 381
## [1] 391
## [1] 401
## [1] 411
## [1] 421
## [1] 431
## [1] 441
## [1] 451
## [1] 461
## [1] 471
## [1] 481
## [1] 491
## [1] 501
## [1] 511
## [1] 521
## [1] 531
## [1] 541
## [1] 551
## [1] 561
## [1] 571
## [1] 581
## [1] 591
## [1] 601
## [1] 611
## [1] 621
## [1] 631
## [1] 641
## [1] 651
## [1] 661
## [1] 671
## [1] 681
## [1] 691
## [1] 701
## [1] 711
## [1] 721
## [1] 731
## [1] 741
## [1] 751
## [1] 761
## [1] 771
## [1] 781
docs_ft
The function generates the same table as search_documents, with
an additional column that contains the full text of the document
(Text).
If you want to read more closely individual documents, use the view_document function. The document will appear in your Viewer panel.
view_document("SPSP193602290401", "shunpao-revised")
If you want to display a term of interest in the document, add
the term after the “query” argument as indicated below:
view_document("SPSP193602290401", "shunpao-revised", query = '"扶輪社"')
The function proquest_view() is designed to display the original document on the ProQuest platform, based on the ID of the document (for ProQuest subscribers only):
proquest_view(1371585080)
The function view_document() is designed to display the full text of a single document with the functionality of highlighting selected key words in the text. The function is composed of three arguments:
view_document(1371585080, "proquest", query = c("Nanking", "club"))
view_document("SPSP193011141210", "shunpao", query = c("扶輪社", "飯店"))
The package provides more advanced functions to perform multi-field queries.
The function search_documents serves to query terms in fields other than the content of articles. For instance, one can perform queries based on the title of articles and/or date of publication. It is possible to search several fields at the same time. The function is composed of five arguments, as described below:
search_documents <- function(q,
corpus="shunpao",
search_fields=c(),
start=0,
dates=c())
To obtain the list of searchable fields available in the
targeted corpus, use the function list_search_fields.
For instance, the example below shows that three fields can be searched
in the Dongfang zazhi 東方雜誌 - the title, the author, or the
full text of the article:
list_search_fields("dongfangzz")
## [1] "text" "title" "authors"
If no field is selected explicitly, queries will be performed in
all fields.
The function list_filter_fields() lists the other fields that are available but are not searchable (technically searchable, but not relevant):
list_filter_fields("dongfangzz")
## [1] "category" "volume" "issue"
In the above example, you can filter the results of your search
in the Dongfang zazhi by the category of article, the volume
and issue.
The function list_possible_filters() can be applied to any filterable field to display its contents:
list_possible_filters("dongfangzz", "category") %>% arrange(desc(N))
The first column “Value” contains the possible filters that you
can use. The second column “N” indicates the number of documents in the
corresponding filter value.
Use accepts_date_queries to test whether the targeted corpus supports date filters. For instance, the example below shows that it is possible to search dates in the Shenbao or Dongfang zazhi, but not in Wikipedia (Chinese or English):
histtext::accepts_date_queries("shunpao")
## [1] TRUE
histtext::accepts_date_queries("dongfangzz")
## [1] TRUE
histtext::accepts_date_queries("wikibio-zh")
## [1] FALSE
histtext::accepts_date_queries("wikibio-en")
## [1] FALSE
For instance, search the term 扶輪社 in the Shenbao in the title only:
search_documents('"扶輪社"', corpus="shunpao", search_fields= "title")
Search the same term 扶輪社 in the Shenbao in both
title and full text :
search_documents('"扶輪社"', corpus="shunpao", search_fields=c("title", "text"))
Search the term 扶輪社 in the Shenbao in all possible fields for the year 1933 only:
search_documents('"扶輪社"', corpus="shunpao", dates="1933")
Search the term 扶輪社 in the Shenbao in all possible
fields for March 1933 only:
search_documents('"扶輪社"', corpus="shunpao", dates="1933-03")
Search the term 扶輪社 in the Shenbao in all possible fields for the two years 1933 and 1940:
search_documents('"扶輪社"', corpus="shunpao", dates=c("1933", "1940"))
Search the term 扶輪社 in the Shenbao in all possible
fields from 1930 to 1940:
search_documents('"扶輪社"', corpus="shunpao", dates="[1930 TO 1940]")
Search the term 扶輪社 in the Shenbao in all possible
fields from 1930 to 1940 and 1945:
search_documents('"扶輪社"', corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
Combined query on different fields and dates:
combined_search <- search_documents('"扶輪社"', corpus="shunpao",
search_fields="title",
dates=c("[1930 TO 1940]", "1945"))
combined_search
The extended search function also applies to concordance with the function search_concordance. The search_concordance is composed of six arguments, the same as in search_documents, plus an additional argument for the context size:
search_concordance <- function(q,
corpus="shunpao",
search_fields=c(),
context_size=30,
start=0,
dates=c())
The advanced functions that we described above -
list_search_fields(),
list_filter_fields(), and
list_possible_filters() - can also be applied in
combination with the search_concordance() function.
In HistText, pre-computed word embeddings (i.e., learned representations for text where words with similar meanings have similar representations), can be utilized to enhance queries by incorporating similar terms. Please note that this function is currently under construction. However, it is already accessible through the HistText Search interface.
The function stats_count provides statistics on the queried term(s). With the classic parameters, you can obtain the number of occurrences of this/these keyword(s) per article.
stats_count(df,"noble")
By adding the over_time=TRUE
parameter, you can obtain
the average values at which this keyword appears over the years
stats_count(df,"noble",over_time=TRUE)
By adding the by_char=TRUE
parameter, we can contrast
the number of occurrences of the queried term with the overall number of
tokens in the text per year.
stats_count(df,"noble",over_time=TRUE,by_char=TRUE)
For all statistical functions, it is possible to retrieve the
statistics dataframe without generating the graph by setting
to_plot=FALSE
.
stats_count(df,"noble",over_time=TRUE,by_char=TRUE,to_plot=FALSE)
All visualizations are created using Plotly to make them interactive.
However, it is possible to generate a non-interactive graph by
deactivating ly=FALSE
.
stats_count(df,"noble",over_time=TRUE,by_char=TRUE,ly=FALSE)
The function count_character_bins counts the number of characters in the text and then groups them by “bins”, i.e. series of ranges of numerical value into which data are sorted in statistical analysis. It is possible to manually set the number of bins using the nb_bins argument, as shown below.
count_character_bins(df)
count_character_bins(df,nb_bins=20)
The classic_dtm function allows you to explore the semantic content of the text in more depth. To use this function, you need to have the following packages installed:
install.packages("stopwords")
install.packages("quanteda")
install.packages("quanteda.textstats")
The classic configuration allows you to see the top ‘n’ words present in the given texts. You can set the number of desired words manually using the top argument:
classic_dtm(df,top=20)
By adding over_time=TRUE
it is possible to display the
frequency of these top “n” words over time:
classic_dtm(df,top=20,over_time=TRUE)
By adding doc_simi=TRUE
it is possible to calculate the
similarity between documents with respect to the frequencies of the
words they use.
install.packages("ggdendro")
classic_dtm(df,top=20,doc_simi=TRUE)
Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. In histText, the function ner_on_corpus applies a default NER model that varies according to the language (English or Chinese) or the nature (nature of the language, OCR, etc.) of the queried corpus.
The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into eight main types: persons (PERS), organizations (ORG), locations (LOC), geopolitical entities (GPE) (countries, administrative regions…), temporal entities (DATE, TIME), numerical entities (MONEY, PERCENTAGE), and miscellaneous entities (MISC).
The function ner_on_corpus is composed of two arguments: the collection of documents with their full text (output of the “get_documents()” function), and the targeted corpus.
Example in English (ProQuest:
docs_ner_eng <- ner_on_corpus(docs_eng, corpus = "proquest")
## 1/785
## 11/785
## 21/785
## 31/785
## 41/785
## 51/785
## 61/785
## 71/785
## 81/785
## 91/785
## 101/785
## 111/785
## 121/785
## 131/785
## 141/785
## 151/785
## 161/785
## 171/785
## 181/785
## 191/785
## 201/785
## 211/785
## 221/785
## 231/785
## 241/785
## 251/785
## 261/785
## 271/785
## 281/785
## 291/785
## 301/785
## 311/785
## 321/785
## 331/785
## 341/785
## 351/785
## 361/785
## 371/785
## 381/785
## 391/785
## 401/785
## 411/785
## 421/785
## 431/785
## 441/785
## 451/785
## 461/785
## 471/785
## 481/785
## 491/785
## 501/785
## 511/785
## 521/785
## 531/785
## 541/785
## 551/785
## 561/785
## 571/785
## 581/785
## 591/785
## 601/785
## 611/785
## 621/785
## 631/785
## 641/785
## 651/785
## 661/785
## 671/785
## 681/785
## 691/785
## 701/785
## 711/785
## 721/785
## 731/785
## 741/785
## 751/785
## 761/785
## 771/785
## 781/785
docs_ner_eng
Example in Chinese (Shenbao):
docs_ner_zh <- ner_on_corpus(docs_ft, corpus = "shunpao")
## 1/507
## 11/507
## 21/507
## 31/507
## 41/507
## 51/507
## 61/507
## 71/507
## 81/507
## 91/507
## 101/507
## 111/507
## 121/507
## 131/507
## 141/507
## 151/507
## 161/507
## 171/507
## 181/507
## 191/507
## 201/507
## 211/507
## 221/507
## 231/507
## 241/507
## 251/507
## 261/507
## 271/507
## 281/507
## 291/507
## 301/507
## 311/507
## 321/507
## 331/507
## 341/507
## 351/507
## 361/507
## 371/507
## 381/507
## 391/507
## 401/507
## 411/507
## 421/507
## 431/507
## 441/507
## 451/507
## 461/507
## 471/507
## 481/507
## 491/507
## 501/507
docs_ner_zh
The function returns a table with five columns:
The histext package includes alternative models tailored to the nature of the text in the different corpora. Mostly, it provides specific models for text with noisy Optical Character Recognition (OCR) and/or missing punctuation.
The function list_ner_models() lists all the models available in the package:
list_ner_models()
## [1] "spacy_sm:en:ner" "spacy:en:ner"
## [3] "spacy_sm:zh:ner" "spacy:zh:ner"
## [5] "trftc_noisy_nopunct:en:ner" "trftc_noisy:en:ner"
## [7] "trftc_nopunct:zh:ner" "trftc_camembert:fr:ner"
## [9] "trftc_person_1class:zh:ner" "trftc_person_4class:zh:ner"
There are two main models: spaCy and trftc
(Transformer Token Classification) - a model specifically developed by
the ENP-China team (Baptiste Blouin, Jeremy Auguste) to handle specific
issues that pertain to historical corpora in multiple languages.
Name | Model | Language | Features |
---|---|---|---|
spacy_sm:en:ner | spaCy | English | Default model for large corpora (faster but less reliable) |
spacy_sm:zh:ner | spaCy | Chinese | Default model for large corpora (faster but less reliable) |
spacy:en:ner | spaCy | English | Default model for large or small corpora (slower but more reliable) |
spacy:zh:ner | spaCy | Chinese | Default model for large or small corpora (slower but more reliable) |
trftc_noisy_nopunct:en:ner | TrfTC | English | Model for texts with noisy OCR and missing punctuation |
trftc_noisy:en:ner | TrfTC | English | Model for texts with noisy OCR |
trftc_nopunct:zh:ner | TrfTC | Chinese | Model for texts with no punctuation |
trftc_camembert:fr:ner | TrfTC | French | Models for texts in French (based on BERT) |
trftc_person_1class:zh:ner | TrfTC | Chinese | Model trained to identify persons in any category |
trftc_person_4class:zh:ner | TrfTC | Chinese | Model trained to detect persons’ names with embedded titles (王少年 for 王局長少年, 王君少年) |
To know which default model is used for a given corpus, use the function get_default_ner_model():
get_default_ner_model("proquest")
get_default_ner_model("imh-zh")
To specify the model you want to apply, use the argument “model
=” in the ner_on_corpus() function:
ner_on_corpus(docs_ft, corpus = "proquest", model = "trftc_noisy_nopunct:en:ner")
ner_on_corpus(docs_ft, corpus = "shunpao", model = "trftc_person_4class:zh:ner")
The data extracted through NER can be visualized as a network graph using Padagraph. To enable this function, one needs to create first a network object using libraries such as igraph or tidygraph. The following lines of code detail the successive steps to transform the results of NER into an edge list and eventually a network object. The last step consists in applying the function in_padagraph() to project the tidygraph object in Padagraph. In the following example, we propose to build a two-mode network linking persons or organizations with the documents in which they appear:
# create the edge list linking documents with persons or organizations
edge_ner_eng <- docs_ner_eng %>%
filter(Type %in% c("PERSON", "ORG")) %>% # select persons and organizations
select(DocID, Text) %>% # retain only the relevant variables (from/to)
filter(!Text %in% c("He","She","His","Her","Him", "Hers", "Chin", "Mrs", "Mr", "Gen", "Madame", "Madam", "he","she","his","her","him", "hers")) # remove personal pronouns
# retain entities that appeared at least twice
topnode_docs_eng <- edge_ner_eng %>%
group_by(Text) %>%
add_tally() %>%
filter(n>1) %>%
distinct(Text)
# build the network with igraph/tidygraph
topedge <- edge_ner_eng %>%
filter(Text %in% topnode_docs_eng$Text) %>%
rename(from = DocID, to = Text)
ig <- graph_from_data_frame(d=topedge, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig) %>%
activate(nodes) %>%
mutate(label=name)
# project in padagraph
tg %>% histtext::in_padagraph("RotaryNetwork")
The function get_padagraph_url directly returns the URL
for displaying the graph:
tg %>% histtext::get_padagraph_url("RotaryNetwork")
The URL
for the graph is created as a permanent link that can be used in third
applications or to return to the network.
The function load_in_padagraph() serves to import a tidygraph object into “Padagraph”. The function is composed of three arguments:
The function returns an URL that displays the tidygraph object in Padagraph.
save_graph(graph, filepath)
load_in_padagraph(filepath, name, show_graph = TRUE)
HistText includes two functions to implement NER on external documents, i.e. that can be uploaded directly into R Studio:
In addition, HistText includes a function that transform a pre-ocerized PDF document into a dataframe.
The function run_ner() contained two arguments: the text to be analyzed (text) and the model to be used (model) (use ‘list_ner_models()’ to obtain the available models):
run_ner(text, model = "spacy:en:ner", verbose = FALSE)
The function ner_on_df requires at least two arguments: the name of the variable that contains the text (Text) and the language model that will apply (English or Chinese). In the example below, we add an argument to specify the page number with it s identifier (document). We also choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document.
ner_df <- ner_on_df(dplyr::slice(docs_eng_ft, 1:10), "Text", id_column="DocId", model = "spacy:en:ner")
## 1/10
ner_df
If applied to a PDF document, the function ner_on_df() consists in three steps:
In the example below, we choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document. To eliminate the pages that do not contain text of that contain little text (pages with images), we set the minimum limit to at least 100 characters per page. The identifier argument serves to assign the name of the document (as you choose to label it) to the numbered pages (by default, pages appear simply as numbers)
dplyr::slice(docs_eng_ft, 1:10)
The implementation of question-and-answer (Q&A) queries is another valuable functionality provided by the HistText library. This feature enables researchers to target and extract specific content from natural-language texts based on user-defined queries. By formulating questions or prompts, researchers can use the Q&A feature to extract data from documents in natural language. Q&A functions in HistText are particularly effective for retrieving biographical information.
Two models are currently available in HistText: one for Chinese and one for English. You can use the list_qa_models() to list the available models:
histtext::list_qa_models()
The most basic use is to ask a single question:
imh_en_df <- histtext::search_documents('"member of party"', "imh-en")
histtext::qa_on_corpus(imh_en_df, "What is his full name?", "imh-en")
Alternatively, you can ask multiple variants of a question:
histtext::qa_on_corpus(imh_en_df, c("What is his full name?", "What name?"), "imh-en")
A more advanced usage of Q&A can be achieved when questions depend on previous questions:
questions <- list("name:full" = c("What is his full name?", "What name?"),
"education:location" = c("Where {name:full} study at?", "Where study at?"))
histtext::qa_on_corpus(imh_en_df, questions, "imh-en")
ou can also specify the number of answers that a question should
be allowed to produce:
histtext::qa_on_corpus(imh_en_df, questions, "imh-en", max_answers = list("education:location" = 2))
Examples of questions on which models where trained with can be
accessed using the following functions:
histtext::biography_questions("en")
histtext::biography_questions("zh")
A key feature of HistText is to provide a set of functions designed to process documents in “transitional Chinese”, a term that we coined to refer to the Chinese language as it evolved from the near-classical language of the administration and imperial publications from the 1850s to the near-contemporary Chinese of the late 1940s (Blouin et al. 2023).
Tokenization refers to the operation of segmenting a text into tokens, which are the most elementary semantic units in a text. This is a crucial step for text analysis. Currently, two models are available in HistText, as listed below. The trftc_shunpao:zh:cws model is based on the initial annotation campaign conducted by Huang Hen-hsen (Academia Sinica) in 2021. The trftc_shunpao_23:zh:cws model is a refined model based on a second annotation campaign conducted by the ENP-China project in 2023 (Blouin et al. 2023).
list_cws_models()
The tokenizer can be applied to a corpus built with HistText (see section X) using the function cws_on_corpus, or it can be used directly on a specific data frame provided by the researcher using the function cws_on_df.
Below we provide an example for each case.
The cws_on_corpus function includes the following arguments:
cws_on_corpus(
docids,
corpus = "__",
model = "__default__",
field = "__default__",
detailed_output = FALSE,
token_separator = " ",
batch_size = 10,
verbose = TRUE
)
Below is an example to illustrate how the function works:
# create sample corpus
sample_corpus <- histtext::search_documents('"共產黨員"', "imh-zh")
# tokenize the corpus
tokenized_corpus <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)
tokenized_corpus
kable(tokenized_corpus, caption = "Tokenized corpus") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
If you wish to display the detailed output:
tokenized_corpus_detailed <- histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)
kable(tokenized_corpus_detailed, caption = "Tokenized corpus with detailed output") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
The function cws_on_df follows a similar structure:
df: data frame which contains texts to tokenize text_column: name of the column which contains the input text to be tokenized. id_column: id of the column that is used to associate ids to the CWS outputs. By default, uses the row index in ‘df’. model: selection of the model to be used (use ‘list_cws_models()’ to get available models) detailed_output: if TRUE, return a dataframe with one row per token (with positions and confidence scores) token_separator: the character to use to separate each token (default uses a normal white space)
cws_on_df(
df,
text_column,
id_column = NULL,
model = "trftc_shunpao_23:zh:cws",
detailed_output = FALSE,
token_separator = " ",
verbose = TRUE
)
To illustrate how the tokenizer functions on a data frame, we provide a sample dataset (sample_df) that contains four documents extracted from the Shenbao between 1874 and 1889. The data frame includes five columns: DocId, Date, Title, Source, Text.
kable(sample_df, caption = "Sample dataframe") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
We apply the tokenizer on the “Text” column of the sample data
frame:
tokenized_df <- cws_on_df(
sample_df,
text_column = "Text",
id_column = "DocId",
model = "trftc_shunpao_23:zh:cws",
detailed_output = FALSE,
token_separator = " ",
verbose = TRUE
)
The function returns a data frame with two columns, one with the
Tokenized text (Text) and another with the original ids of the documents
(DocId):
kable(tokenized_df, caption = "Tokenized dataframe") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
Historical sources use various forms of transliteration of Chinese characters. The function wade_to_py serves to convert the standard (but obsolete) Wade-Giles transliteration system into pinyin. In the example below, there remain some issues in the conversion operators that we correct with lines of code. These issues will be addressed in the near future.
library(readr)
wgconv <- read_csv("wgconv.csv")
wgconv %>% mutate(NameWG = wade_to_py(Original)) %>%
mutate(NameWG2 = str_remove_all(NameWG, "-")) %>%
mutate(NameWG3 = str_remove_all(NameWG2, "[:punct:]")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG4 = str_replace(NameWG3, "uê", "We")) %>%
mutate(NameWG5 = str_replace(NameWG4, "Chê", "Che")) %>%
select(Original, NameWG5) %>%
rename(output = NameWG5)
The function extract_regexps_from_subcorpus() is designed to search a list of regular expressions in a corpus of documents. The function is composed of two arguments:
The function returns a three-column table: the document id in which the pattern was found, the type of pattern, and the matched term (pattern).
regexpample <- read_delim("regexps.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE) # load example table with regular expressions to search
regexp_output <- extract_regexps_from_subcorpus(docs_eng_ft, regexpample)
This document
presents a substantial workflow for the transformation of data.
This table describes the 42 functions available in HistText as of November 2023.
Control Functions |
|
---|---|
accepts_date_queries |
Check if a corpus accepts date queries |
get_default_ner_model |
Get the name of the default NER model for a given corpus |
get_error_status |
Retrieve the error status of a response. |
get_server_status |
Get the status of the server |
list_corpora |
List available collections in SolR |
Query Functions |
|
search_documents |
Search for documents |
search_documents_ex |
Extended Search for documents |
search_concordance |
KWIC Search In ENP Corpora |
search_concordance_ex |
Extended KWIC Search In ENP Corpora |
search_concordance_on_df |
KWIC search in a custom dataframe |
get_documents |
Retrieve document from ID |
count_documents |
Get the number of articles matching a query, by date |
count_search_documents |
Count the number of documents that can be returned by a query |
view_document |
View a single document in RStudio |
Data extraction functions |
|
ner_on_corpus |
Apply Named Entity Recognition on a corpus |
ner_on_df |
Apply Named Entity Recognition on the specified column of a dataframe |
run_ner |
Apply Named Entity Recognition on a string |
run_qa |
Apply Question-Answering on a string |
qa_on_corpus |
Apply Named Entity Recognition on a corpus |
qa_on_df |
Apply Named Entity Recognition on the specified column of a dataframe |
extract_regexps_from_subcorpus |
apply a collection of Regexps to a collection of documents |
Advanced functions |
|
list_search_fields |
List possible search fields for a given corpus |
get_search_fields_content |
Retrieve the content associated with each search field |
list_filter_fields |
List possible filter fields for a given corpus |
list_ner_models |
List available NER models on the server |
list_possible_filters |
List possible filter values for a given filter field |
list_precomputed_corpora |
List corpora with precomputed annotations |
list_precomputed_fields |
List fields of a given corpus that have precomputed annotations |
list_qa_models |
List available NER models on the server |
load_pdf_as_df |
Load the text from a PDF into a data frame |
proquest_view |
Display an entry from ProQuest Corpus |
Chinese-specific functions |
|
list_cws_models |
List available CWS models on the server |
run_cws |
Apply Chinese Word Segmentation on a string |
get_default_cws_model |
Get the name of the default CWS model for a given corpus |
cws_on_corpus |
Apply Chinese Word Segmentation on a corpus |
cws_on_df |
Apply Chinese Word Segmentation on the specified column of a dataframe |
sinograms_to_py |
sinograms(漢字) to pinyin conversion |
wade_to_py |
wade-giles to pinyin conversion |
Graph functions |
|
get_padagraph_url |
Send a tidygraph to padagraph and return the URL |
in_padagraph |
Send a tidygraph to padagraph and displays it |
load_in_padagraph |
Load and send a previously saved graph object into padagraph |
save_graph |
Save a tidygraph into a file |
Server functions |
|
query_server_get |
GET a resource from the server |
query_server_post |
POST a file to the server |
set_config_file |
Sets the config file in order to specify the server URL to use (+ other needed information). |
ENP-China R package: Update 0.2.5. 08-06-2021, by Jeremy Auguste. HistText Updates (1.0.0). 07-10-2021. 08-06-2021, by Jeremy Auguste. HistText 1.6.2: Updates & News. 27-09-2022, by Jeremy Auguste.