3 Query functions
3.1 Basic search
The function search_documents serves to find the documents based on one or several terms. The function is composed of two main arguments: the queried term(s) and the targeted corpus. If the term consists of just one word or character, use the double quotation marks. For compound words, in English or Chinese, add simple quotations marks as in the example below:
search_documents("Rotary", "proquest")
search_documents('"Rotary Club"', "proquest")
search_documents('"扶輪社"', "shunpao-revised")
It is also possible to run a query with multiple terms using Boolean operators, as in the example below (here | = OR). For a detailed list of possible operators in R, see this document:
search_documents('"Rotary" | "Rotary Club"', "proquest")
search_documents('"扶輪社" | "上海扶輪社"', "shunpao-revised")
The function generates a table with four columns indicating the unique identifier of each document (DocId), the date of publication (Date, in YYYMMDD format), the title of the article (Title), and the source (Source), e.g. the name of the periodical in the ProQuest collection. In the table below, each row represents a unique document:
<- search_documents('"Shanghai Rotary Club"', "proquest")
docs_eng <- search_documents('"扶輪社"', "shunpao-revised")
docs
docs
## # A tibble: 507 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao
## 3 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 4 SPSP193001161619 19300116 扶輪社今午聚餐 shunpao
## 5 SPSP194812170412 19481217 滬西扶輪社成立 shunpao
## 6 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 7 SPSP192902151513 19290215 扶輪社昨午聚餐會 shunpao
## 8 SPSP194701290425 19470129 扶輪社創立人 海理斯逝世 shunpao
## 9 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
## 10 SPSP193002201422 19300220 扶輪社今午聚餐 shunpao
## # ℹ 497 more rows
The count_search_documents function allows researchers to determine the number of documents that can be returned by a particular query without retrieving the actual documents. This function aids researchers in understanding the potential size and scale of their query results, enabling them to gauge the feasibility and magnitude of their research endeavors before executing resource-intensive queries.
::count_search_documents('"上海"', "shunpao-revised") histtext
The function count_documents serves to visualize the distribution of documents matching a query over time:
::count_documents('"上海扶輪社"', "shunpao-revised") %>%
histtextmutate(Date=lubridate::as_date(Date,"%y%m%d")) %>%
mutate(Year= year(Date)) %>%
group_by(Year) %>% summarise(N=sum(N)) %>%
filter (Year>=1920) %>%
ggplot(aes(Year,N)) + geom_col(alpha = 0.8) +
labs(title = "The Rotary Club of Shanghai in the Shenbao",
subtitle = "Number of articles mentioning 上海扶輪社",
x = "Year",
y = "Number of articles")
Counting can also be applied to specific fields that vary according to the queried corpora. The list of possible fields can be obtained with the function list_filter_fields():
list_filter_fields("proquest")
## [1] "publisher" "category"
list_filter_fields("dongfangzz")
## [1] "category" "volume" "issue"
For example, if we want to count the number of documents in Dongfang zazhi for the queried term “調查” in the category field:
::count_documents('"調查"', "dongfangzz", by_field = "category") %>%
histtextarrange(desc(N))
## # A tibble: 295 × 2
## GroupField N
## <chr> <int>
## 1 — 1865
## 2 圖書廣告 591
## 3 內外時報 277
## 4 時事日誌 220
## 5 現代史料 145
## 6 補白 115
## 7 內務 92
## 8 國際 85
## 9 法令 82
## 10 中國大事記 81
## # ℹ 285 more rows
In the example below, we count the number of documents by publisher in the “ProQuest” corpus:
::count_documents('"Rotary Club"', "proquest", by_field = "publisher") %>%
histtextarrange(desc(N))
## # A tibble: 13 × 2
## GroupField N
## <chr> <int>
## 1 South China Morning Post Ltd. 4612
## 2 The China Press 1426
## 3 The North China Herald 689
## 4 The China Weekly Review 321
## 5 The Shanghai Times 15
## 6 The Shanghai Gazette 14
## 7 The Chinese Recorder 11
## 8 The Canton Times 1
## 9 Peking Daily News 0
## 10 Peking Gazette 0
## 11 The China Critic 0
## 12 The Chinese Repository 0
## 13 The Peking Leader 0
3.2 Basic concordance
The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =”) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社” in the Shenbao corpus, and we set 100 as the desired context:
<- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)
concs
concs
## # A tibble: 776 × 7
## DocId Date Title Source Before Matched After
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao "本埠" 扶輪社 "前…
## 2 SPSP193010291607 19301029 扶輪社 shunpao "" 扶輪社 ""
## 3 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "國際" 扶輪社 "第…
## 4 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "一日,上午在… 扶輪社 "此…
## 5 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "舉行園遊會,… 扶輪社 "節…
## 6 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "" 扶輪社 "年…
## 7 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "前" 扶輪社 "爲…
## 8 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "學校籌欵事曾… 扶輪社 "限…
## 9 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "多茲扶輪社限… 扶輪社 "事…
## 10 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "" 扶輪社 "奬…
## # ℹ 766 more rows
The output is similar to the table generated by the search_documents function, with three additional columns containing the queried term (Matching) and the text before (Before) and after (After). In the table above, each row no longer represents a unique document, but each occurrence of the queried term in the documents. The concordance table usually contains more rows than the table of documents, since the queried term may appear several times in the same document.
3.3 Full Text Retrieval
The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):
<- histtext::get_documents(docs, "shunpao")
docs_ft <- histtext::get_documents(docs_eng, "proquest") docs_eng_ft
docs_ft
## # A tibble: 507 × 5
## DocId Date Title Source Text
## <chr> <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao 本埠扶輪社前…
## 2 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao 國際扶輪社第九十…
## 3 SPSP193011141210 19301114 扶輪社奬品待領 shunpao 前扶輪社爲俄…
## 4 SPSP193001161619 19300116 扶輪社今午聚餐 shunpao 上海扶輪社 …
## 5 SPSP194812170412 19481217 滬西扶輪社成立 shunpao 上海扶輪社與滬西…
## 6 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao 濟南扶輪社二十一…
## 7 SPSP192902151513 19290215 扶輪社昨午聚餐會 shunpao 昨日中午、上…
## 8 SPSP194701290425 19470129 扶輪社創立人 海理斯逝世 shunpao 一九〇〇年創立全…
## 9 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao 上海扶輪社今…
## 10 SPSP193002201422 19300220 扶輪社今午聚餐 shunpao 上海扶輪社、…
## # ℹ 497 more rows
The function generates the same table as search_documents, with an additional column that contains the full text of the document (Text).
3.4 Close reading
If you want to read more closely individual documents, use the view_document function. The document will appear in your Viewer panel.
view_document("SPSP193602290401", "shunpao-revised")
If you want to display a term of interest in the document, add the term after the “query” argument as indicated below:
view_document("SPSP193602290401", "shunpao-revised", query = '"扶輪社"')
3.4.1 ProQuest Documents
The function proquest_view() is designed to display the original document on the ProQuest platform, based on the ID of the document (for ProQuest subscribers only):
proquest_view(1371585080)
3.4.2 Other documents
The function view_document() is designed to display the full text of a single document with the functionality of highlighting selected key words in the text. The function is composed of three arguments:
- docid = document identifier
- corpus = name of the corpus
- query = list of queried terms (query string or vector of query strings).
view_document(1371585080, "proquest", query = c("Nanking", "club"))
view_document("SPSP193011141210", "shunpao", query = c("扶輪社", "飯店"))
3.5 Advanced functions
The package provides more advanced functions to perform multi-field queries.
The function search_documents serves to query terms in fields other than the content of articles. For instance, one can perform queries based on the title of articles and/or date of publication. It is possible to search several fields at the same time. The function is composed of five arguments, as described below:
- q: the queried term (q)
- corpus: the corpus (within quotation marks)
- search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described below)
- start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
- dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see below).
<- function(q,
search_documents corpus="shunpao",
search_fields=c(),
start=0,
dates=c())
To obtain the list of searchable fields available in the targeted corpus, use the function list_search_fields. For instance, the example below shows that three fields can be searched in the Dongfang zazhi 東方雜誌 - the title, the author, or the full text of the article:
list_search_fields("dongfangzz")
## [1] "text" "title" "authors"
If no field is selected explicitly, queries will be performed in all fields.
The function list_filter_fields() lists the other fields that are available but are not searchable (technically searchable, but not relevant):
list_filter_fields("dongfangzz")
## [1] "category" "volume" "issue"
In the above example, you can filter the results of your search in the Dongfang zazhi by the category of article, the volume and issue.
The function list_possible_filters() can be applied to any filterable field to display its contents:
list_possible_filters("dongfangzz", "category") %>% arrange(desc(N))
## # A tibble: 295 × 2
## Value N
## <chr> <int>
## 1 圖書廣告 26567
## 2 普通廣告 22830
## 3 — 8040
## 4 東方画報 4292
## 5 東方畫報 4212
## 6 補白 1439
## 7 文苑 1298
## 8 圖書廣吿 1290
## 9 現代史料 1137
## 10 內外時報 1108
## # ℹ 285 more rows
The first column “Value” contains the possible filters that you can use. The second column “N” indicates the number of documents in the corresponding filter value.
Use accepts_date_queries to test whether the targeted corpus supports date filters. For instance, the example below shows that it is possible to search dates in the Shenbao or Dongfang zazhi, but not in Wikipedia (Chinese or English):
::accepts_date_queries("shunpao") histtext
## [1] TRUE
::accepts_date_queries("dongfangzz") histtext
## [1] TRUE
::accepts_date_queries("wikibio-zh") histtext
## [1] FALSE
::accepts_date_queries("wikibio-en") histtext
## [1] FALSE
3.5.1 Multifield queries
For instance, search the term 扶輪社 in the Shenbao in the title only:
search_documents('"扶輪社"', corpus="shunpao", search_fields= "title")
## # A tibble: 137 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP193603291401 19360329 國際扶輪社 shunpao
## 3 SPSP194108120708 19410812 扶輪社常會 shunpao
## 4 SPSP194109100721 19410910 扶輪社常會 shunpao
## 5 SPSP192806221513 19280622 扶輪社開會紀 shunpao
## 6 SPSP194010170804 19401017 扶輪社婦女會 shunpao
## 7 SPSP192906131512 19290613 扶輪社今午聚餐 shunpao
## 8 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 9 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
## 10 SPSP192910171508 19291017 扶輪社今午聚餐 shunpao
## # ℹ 127 more rows
Search the same term 扶輪社 in the Shenbao in both title and full text :
search_documents('"扶輪社"', corpus="shunpao", search_fields=c("title", "text"))
## # A tibble: 507 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao
## 3 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 4 SPSP193001161619 19300116 扶輪社今午聚餐 shunpao
## 5 SPSP194812170412 19481217 滬西扶輪社成立 shunpao
## 6 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 7 SPSP192902151513 19290215 扶輪社昨午聚餐會 shunpao
## 8 SPSP194701290425 19470129 扶輪社創立人 海理斯逝世 shunpao
## 9 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
## 10 SPSP193002201422 19300220 扶輪社今午聚餐 shunpao
## # ℹ 497 more rows
3.5.2 Date filtering
Search the term 扶輪社 in the Shenbao in all possible fields for the year 1933 only:
search_documents('"扶輪社"', corpus="shunpao", dates="1933")
## # A tibble: 12 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 2 SPSP193303161321 19330316 扶輪社今午聚餐會 shunpao
## 3 SPSP193310141010 19331014 吳市長發展大上海願望 前日在上海扶輪社演說… shunpao
## 4 SPSP193304301314 19330430 長江流域麻瘋問題之嚴重 shunpao
## 5 SPSP193308271813 19330827 今日單打有劇戰 shunpao
## 6 SPSP193303081109 19330308 中外人士主辦之上海電影皇后競賽 shunpao
## 7 SPSP193309021701 19330902 華絲直接貿易 繆鐘秀在紐約接洽 以免再… shunpao
## 8 SPSP193310042801 19331004 汽車新聞 shunpao
## 9 SPSP193303281601 19330328 美國承認蘇俄具有兩項條件 收回戰債與禁止… shunpao
## 10 SPSP193303091201 19330309 婦女團體昨紀念三八婦女節 shunpao
## 11 SPSP193303290901 19330329 本市新聞朱子橋報告作戰經過 地方協會 shunpao
## 12 SPSP193303161301 19330316 黑省抗日將領徐子鶴等明日北返 各界設宴歡… shunpao
Search the term 扶輪社 in the Shenbao in all possible fields for March 1933 only:
search_documents('"扶輪社"', corpus="shunpao", dates="1933-03")
## # A tibble: 6 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193303161321 19330316 扶輪社今午聚餐會 shunp…
## 2 SPSP193303081109 19330308 中外人士主辦之上海電影皇后競賽 shunp…
## 3 SPSP193303281601 19330328 美國承認蘇俄具有兩項條件 收回戰債與禁止宣… shunp…
## 4 SPSP193303091201 19330309 婦女團體昨紀念三八婦女節 shunp…
## 5 SPSP193303290901 19330329 本市新聞朱子橋報告作戰經過 地方協會 shunp…
## 6 SPSP193303161301 19330316 黑省抗日將領徐子鶴等明日北返 各界設宴歡送… shunp…
Search the term 扶輪社 in the Shenbao in all possible fields for the two years 1933 and 1940:
search_documents('"扶輪社"', corpus="shunpao", dates=c("1933", "1940"))
## # A tibble: 57 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 2 SPSP194011301015 19401130 扶輪社玩具醫院 徴爾兒童玩具 shunpao
## 3 SPSP194006091007 19400609 救濟全市乞丐難民由救世軍負責 工部局扶輪社… shunpao
## 4 SPSP194010170804 19401017 扶輪社婦女會 shunpao
## 5 SPSP194007020710 19400702 扶輪社 明午常會 shunpao
## 6 SPSP194007091105 19400709 扶輪社本週常會 shunpao
## 7 SPSP194011270807 19401127 扶輪社本週常會 shunpao
## 8 SPSP194011050710 19401105 扶輪社本週常會 shunpao
## 9 SPSP193303161321 19330316 扶輪社今午聚餐會 shunpao
## 10 SPSP194005070707 19400507 上海扶輪社 慶祝母親節 shunpao
## # ℹ 47 more rows
Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940:
search_documents('"扶輪社"', corpus="shunpao", dates="[1930 TO 1940]")
## # A tibble: 231 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 3 SPSP193001161619 19300116 扶輪社今午聚餐 shunpao
## 4 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 5 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
## 6 SPSP193002201422 19300220 扶輪社今午聚餐 shunpao
## 7 SPSP193002151512 19300215 扶輪社昨晚年宴盛况 shunpao
## 8 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
## 9 SPSP193012301428 19301230 扶輪社今午聚餐 shunpao
## 10 SPSP193005221419 19300522 扶輪社今日聚餐 shunpao
## # ℹ 221 more rows
Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940 and 1945:
search_documents('"扶輪社"', corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
## # A tibble: 231 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 3 SPSP193001161619 19300116 扶輪社今午聚餐 shunpao
## 4 SPSP193301230735 19330123 濟扶輪社註册紀念 shunpao
## 5 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
## 6 SPSP193002201422 19300220 扶輪社今午聚餐 shunpao
## 7 SPSP193002151512 19300215 扶輪社昨晚年宴盛况 shunpao
## 8 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
## 9 SPSP193012301428 19301230 扶輪社今午聚餐 shunpao
## 10 SPSP193005221419 19300522 扶輪社今日聚餐 shunpao
## # ℹ 221 more rows
Combined query on different fields and dates:
<- search_documents('"扶輪社"', corpus="shunpao",
combined_search search_fields="title",
dates=c("[1930 TO 1940]", "1945"))
combined_search
## # A tibble: 68 × 4
## DocId Date Title Source
## <chr> <chr> <chr> <chr>
## 1 SPSP193010291607 19301029 扶輪社 shunpao
## 2 SPSP193603291401 19360329 國際扶輪社 shunpao
## 3 SPSP194010170804 19401017 扶輪社婦女會 shunpao
## 4 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
## 5 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
## 6 SPSP193006191417 19300619 扶輪社今日聚餐 shunpao
## 7 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
## 8 SPSP193005221419 19300522 扶輪社今日聚餐 shunpao
## 9 SPSP193006051424 19300605 扶輪社今午聚餐 shunpao
## 10 SPSP193012301428 19301230 扶輪社今午聚餐 shunpao
## # ℹ 58 more rows
3.5.3 Concordance
The extended search function also applies to concordance with the function search_concordance. The search_concordance is composed of six arguments, the same as in search_documents, plus an additional argument for the context size:
- q: the queried term (q)
- corpus: the corpus (within quotation marks)
- search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described above)
- context_size: the size of the context (number of characters before/after)
- start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
- dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see above).
<- function(q,
search_concordance corpus="shunpao",
search_fields=c(),
context_size=30,
start=0,
dates=c())
The advanced functions that we described above - list_search_fields(), list_filter_fields(), and list_possible_filters() - can also be applied in combination with the search_concordance() function.
3.5.4 Word embeddings
In HistText, pre-computed word embeddings (i.e., learned representations for text where words with similar meanings have similar representations), can be utilized to enhance queries by incorporating similar terms. Please note that this function is currently under construction. However, it is already accessible through the HistText Search interface.