3 Query functions

3.2 Basic concordance

The function search_concordance serves to explore the queried terms in their context. The function is composed of three arguments: the queried term(s) (within quotations marks as in the search_documents function), the targeted corpus (preceded by “corpus =”) and an optional argument to determine the size of the context (number of characters on each side of the queried term). In the example below, we search “扶輪社” in the Shenbao corpus, and we set 100 as the desired context:

concs <- search_concordance('"扶輪社"', corpus = "shunpao", context_size = 100)

concs
## # A tibble: 776 × 7
##    DocId            Date     Title          Source  Before         Matched After
##    <chr>            <chr>    <chr>          <chr>   <chr>          <chr>   <chr>
##  1 SPSP193010291607 19301029 扶輪社         shunpao "本埠"         扶輪社  "前… 
##  2 SPSP193010291607 19301029 扶輪社         shunpao ""             扶輪社  ""   
##  3 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "國際"         扶輪社  "第… 
##  4 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "一日,上午在… 扶輪社  "此… 
##  5 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao "舉行園遊會,… 扶輪社  "節… 
##  6 SPSP194704140406 19470414 扶輪社年會閉幕 shunpao ""             扶輪社  "年… 
##  7 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "前"           扶輪社  "爲… 
##  8 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "學校籌欵事曾… 扶輪社  "限… 
##  9 SPSP193011141210 19301114 扶輪社奬品待領 shunpao "多茲扶輪社限… 扶輪社  "事… 
## 10 SPSP193011141210 19301114 扶輪社奬品待領 shunpao ""             扶輪社  "奬… 
## # ℹ 766 more rows


The output is similar to the table generated by the search_documents function, with three additional columns containing the queried term (Matching) and the text before (Before) and after (After). In the table above, each row no longer represents a unique document, but each occurrence of the queried term in the documents. The concordance table usually contains more rows than the table of documents, since the queried term may appear several times in the same document.

3.3 Full Text Retrieval

The function get_documents serves to retrieve the full text of the documents. The function relies on the results of the search_documents function. It is composed of two arguments: the name of the variable (table of results) and the targeted corpus (within quotation marks):

docs_ft <- histtext::get_documents(docs, "shunpao")
docs_eng_ft <- histtext::get_documents(docs_eng, "proquest")
docs_ft
## # A tibble: 507 × 5
##    DocId            Date     Title                    Source  Text             
##    <chr>            <chr>    <chr>                    <chr>   <chr>            
##  1 SPSP193010291607 19301029 扶輪社                   shunpao   本埠扶輪社前…
##  2 SPSP194704140406 19470414 扶輪社年會閉幕           shunpao 國際扶輪社第九十…
##  3 SPSP193011141210 19301114 扶輪社奬品待領           shunpao   前扶輪社爲俄…
##  4 SPSP193001161619 19300116 扶輪社今午聚餐           shunpao   上海扶輪社 …
##  5 SPSP194812170412 19481217 滬西扶輪社成立           shunpao 上海扶輪社與滬西…
##  6 SPSP193301230735 19330123 濟扶輪社註册紀念         shunpao 濟南扶輪社二十一…
##  7 SPSP192902151513 19290215 扶輪社昨午聚餐會         shunpao   昨日中午、上…
##  8 SPSP194701290425 19470129 扶輪社創立人 海理斯逝世 shunpao 一九〇〇年創立全…
##  9 SPSP193105211127 19310521 扶輪社今午聚餐           shunpao   上海扶輪社今…
## 10 SPSP193002201422 19300220 扶輪社今午聚餐           shunpao   上海扶輪社、…
## # ℹ 497 more rows


The function generates the same table as search_documents, with an additional column that contains the full text of the document (Text).


3.4 Close reading

If you want to read more closely individual documents, use the view_document function. The document will appear in your Viewer panel.

view_document("SPSP193602290401", "shunpao-revised")


If you want to display a term of interest in the document, add the term after the “query” argument as indicated below:

view_document("SPSP193602290401", "shunpao-revised", query = '"扶輪社"')

3.4.1 ProQuest Documents

The function proquest_view() is designed to display the original document on the ProQuest platform, based on the ID of the document (for ProQuest subscribers only):

proquest_view(1371585080)

3.4.2 Other documents

The function view_document() is designed to display the full text of a single document with the functionality of highlighting selected key words in the text. The function is composed of three arguments:

  • docid = document identifier
  • corpus = name of the corpus
  • query = list of queried terms (query string or vector of query strings).
view_document(1371585080, "proquest", query = c("Nanking", "club"))
view_document("SPSP193011141210", "shunpao", query = c("扶輪社", "飯店"))

3.5 Advanced functions

The package provides more advanced functions to perform multi-field queries.

The function search_documents serves to query terms in fields other than the content of articles. For instance, one can perform queries based on the title of articles and/or date of publication. It is possible to search several fields at the same time. The function is composed of five arguments, as described below:

  • q: the queried term (q)
  • corpus: the corpus (within quotation marks)
  • search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described below)
  • start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
  • dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see below).
search_documents <- function(q,
                                  corpus="shunpao",
                                  search_fields=c(),
                                  start=0,
                                  dates=c())


To obtain the list of searchable fields available in the targeted corpus, use the function list_search_fields. For instance, the example below shows that three fields can be searched in the Dongfang zazhi 東方雜誌 - the title, the author, or the full text of the article:

list_search_fields("dongfangzz")
## [1] "text"    "title"   "authors"


If no field is selected explicitly, queries will be performed in all fields.

The function list_filter_fields() lists the other fields that are available but are not searchable (technically searchable, but not relevant):

list_filter_fields("dongfangzz")
## [1] "category" "volume"   "issue"


In the above example, you can filter the results of your search in the Dongfang zazhi by the category of article, the volume and issue.

The function list_possible_filters() can be applied to any filterable field to display its contents:

list_possible_filters("dongfangzz", "category") %>% arrange(desc(N))
## # A tibble: 295 × 2
##    Value        N
##    <chr>    <int>
##  1 圖書廣告 26567
##  2 普通廣告 22830
##  3 —         8040
##  4 東方画報  4292
##  5 東方畫報  4212
##  6 補白      1439
##  7 文苑      1298
##  8 圖書廣吿  1290
##  9 現代史料  1137
## 10 內外時報  1108
## # ℹ 285 more rows


The first column “Value” contains the possible filters that you can use. The second column “N” indicates the number of documents in the corresponding filter value.

Use accepts_date_queries to test whether the targeted corpus supports date filters. For instance, the example below shows that it is possible to search dates in the Shenbao or Dongfang zazhi, but not in Wikipedia (Chinese or English):

histtext::accepts_date_queries("shunpao")
## [1] TRUE
histtext::accepts_date_queries("dongfangzz")
## [1] TRUE
histtext::accepts_date_queries("wikibio-zh")
## [1] FALSE
histtext::accepts_date_queries("wikibio-en")
## [1] FALSE


3.5.1 Multifield queries

For instance, search the term 扶輪社 in the Shenbao in the title only:

search_documents('"扶輪社"', corpus="shunpao", search_fields= "title")
## # A tibble: 137 × 4
##    DocId            Date     Title          Source 
##    <chr>            <chr>    <chr>          <chr>  
##  1 SPSP193010291607 19301029 扶輪社         shunpao
##  2 SPSP193603291401 19360329 國際扶輪社     shunpao
##  3 SPSP194108120708 19410812 扶輪社常會     shunpao
##  4 SPSP194109100721 19410910 扶輪社常會     shunpao
##  5 SPSP192806221513 19280622 扶輪社開會紀   shunpao
##  6 SPSP194010170804 19401017 扶輪社婦女會   shunpao
##  7 SPSP192906131512 19290613 扶輪社今午聚餐 shunpao
##  8 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
##  9 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
## 10 SPSP192910171508 19291017 扶輪社今午聚餐 shunpao
## # ℹ 127 more rows


Search the same term 扶輪社 in the Shenbao in both title and full text :

search_documents('"扶輪社"', corpus="shunpao", search_fields=c("title", "text"))
## # A tibble: 507 × 4
##    DocId            Date     Title                    Source 
##    <chr>            <chr>    <chr>                    <chr>  
##  1 SPSP193010291607 19301029 扶輪社                   shunpao
##  2 SPSP194704140406 19470414 扶輪社年會閉幕           shunpao
##  3 SPSP193011141210 19301114 扶輪社奬品待領           shunpao
##  4 SPSP193001161619 19300116 扶輪社今午聚餐           shunpao
##  5 SPSP194812170412 19481217 滬西扶輪社成立           shunpao
##  6 SPSP193301230735 19330123 濟扶輪社註册紀念         shunpao
##  7 SPSP192902151513 19290215 扶輪社昨午聚餐會         shunpao
##  8 SPSP194701290425 19470129 扶輪社創立人 海理斯逝世 shunpao
##  9 SPSP193105211127 19310521 扶輪社今午聚餐           shunpao
## 10 SPSP193002201422 19300220 扶輪社今午聚餐           shunpao
## # ℹ 497 more rows

3.5.2 Date filtering

Search the term 扶輪社 in the Shenbao in all possible fields for the year 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933")
## # A tibble: 12 × 4
##    DocId            Date     Title                                       Source 
##    <chr>            <chr>    <chr>                                       <chr>  
##  1 SPSP193301230735 19330123 濟扶輪社註册紀念                            shunpao
##  2 SPSP193303161321 19330316 扶輪社今午聚餐會                            shunpao
##  3 SPSP193310141010 19331014 吳市長發展大上海願望 前日在上海扶輪社演說… shunpao
##  4 SPSP193304301314 19330430 長江流域麻瘋問題之嚴重                      shunpao
##  5 SPSP193308271813 19330827 今日單打有劇戰                              shunpao
##  6 SPSP193303081109 19330308 中外人士主辦之上海電影皇后競賽              shunpao
##  7 SPSP193309021701 19330902 華絲直接貿易  繆鐘秀在紐約接洽  以免再… shunpao
##  8 SPSP193310042801 19331004 汽車新聞                                    shunpao
##  9 SPSP193303281601 19330328 美國承認蘇俄具有兩項條件  收回戰債與禁止… shunpao
## 10 SPSP193303091201 19330309 婦女團體昨紀念三八婦女節                    shunpao
## 11 SPSP193303290901 19330329 本市新聞朱子橋報告作戰經過  地方協會      shunpao
## 12 SPSP193303161301 19330316 黑省抗日將領徐子鶴等明日北返  各界設宴歡… shunpao


Search the term 扶輪社 in the Shenbao in all possible fields for March 1933 only:

search_documents('"扶輪社"', corpus="shunpao", dates="1933-03")
## # A tibble: 6 × 4
##   DocId            Date     Title                                         Source
##   <chr>            <chr>    <chr>                                         <chr> 
## 1 SPSP193303161321 19330316 扶輪社今午聚餐會                              shunp…
## 2 SPSP193303081109 19330308 中外人士主辦之上海電影皇后競賽                shunp…
## 3 SPSP193303281601 19330328 美國承認蘇俄具有兩項條件  收回戰債與禁止宣… shunp…
## 4 SPSP193303091201 19330309 婦女團體昨紀念三八婦女節                      shunp…
## 5 SPSP193303290901 19330329 本市新聞朱子橋報告作戰經過  地方協會        shunp…
## 6 SPSP193303161301 19330316 黑省抗日將領徐子鶴等明日北返  各界設宴歡送… shunp…


Search the term 扶輪社 in the Shenbao in all possible fields for the two years 1933 and 1940:

search_documents('"扶輪社"', corpus="shunpao", dates=c("1933", "1940"))
## # A tibble: 57 × 4
##    DocId            Date     Title                                       Source 
##    <chr>            <chr>    <chr>                                       <chr>  
##  1 SPSP193301230735 19330123 濟扶輪社註册紀念                            shunpao
##  2 SPSP194011301015 19401130 扶輪社玩具醫院 徴爾兒童玩具                shunpao
##  3 SPSP194006091007 19400609 救濟全市乞丐難民由救世軍負責 工部局扶輪社… shunpao
##  4 SPSP194010170804 19401017 扶輪社婦女會                                shunpao
##  5 SPSP194007020710 19400702 扶輪社 明午常會                            shunpao
##  6 SPSP194007091105 19400709 扶輪社本週常會                              shunpao
##  7 SPSP194011270807 19401127 扶輪社本週常會                              shunpao
##  8 SPSP194011050710 19401105 扶輪社本週常會                              shunpao
##  9 SPSP193303161321 19330316 扶輪社今午聚餐會                            shunpao
## 10 SPSP194005070707 19400507 上海扶輪社 慶祝母親節                      shunpao
## # ℹ 47 more rows


Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940:

search_documents('"扶輪社"', corpus="shunpao", dates="[1930 TO 1940]")
## # A tibble: 231 × 4
##    DocId            Date     Title              Source 
##    <chr>            <chr>    <chr>              <chr>  
##  1 SPSP193010291607 19301029 扶輪社             shunpao
##  2 SPSP193011141210 19301114 扶輪社奬品待領     shunpao
##  3 SPSP193001161619 19300116 扶輪社今午聚餐     shunpao
##  4 SPSP193301230735 19330123 濟扶輪社註册紀念   shunpao
##  5 SPSP193105211127 19310521 扶輪社今午聚餐     shunpao
##  6 SPSP193002201422 19300220 扶輪社今午聚餐     shunpao
##  7 SPSP193002151512 19300215 扶輪社昨晚年宴盛况 shunpao
##  8 SPSP193002131629 19300213 扶輪社明晚年宴     shunpao
##  9 SPSP193012301428 19301230 扶輪社今午聚餐     shunpao
## 10 SPSP193005221419 19300522 扶輪社今日聚餐     shunpao
## # ℹ 221 more rows


Search the term 扶輪社 in the Shenbao in all possible fields from 1930 to 1940 and 1945:

search_documents('"扶輪社"', corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
## # A tibble: 231 × 4
##    DocId            Date     Title              Source 
##    <chr>            <chr>    <chr>              <chr>  
##  1 SPSP193010291607 19301029 扶輪社             shunpao
##  2 SPSP193011141210 19301114 扶輪社奬品待領     shunpao
##  3 SPSP193001161619 19300116 扶輪社今午聚餐     shunpao
##  4 SPSP193301230735 19330123 濟扶輪社註册紀念   shunpao
##  5 SPSP193105211127 19310521 扶輪社今午聚餐     shunpao
##  6 SPSP193002201422 19300220 扶輪社今午聚餐     shunpao
##  7 SPSP193002151512 19300215 扶輪社昨晚年宴盛况 shunpao
##  8 SPSP193002131629 19300213 扶輪社明晚年宴     shunpao
##  9 SPSP193012301428 19301230 扶輪社今午聚餐     shunpao
## 10 SPSP193005221419 19300522 扶輪社今日聚餐     shunpao
## # ℹ 221 more rows


Combined query on different fields and dates:

combined_search <- search_documents('"扶輪社"', corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))

combined_search
## # A tibble: 68 × 4
##    DocId            Date     Title          Source 
##    <chr>            <chr>    <chr>          <chr>  
##  1 SPSP193010291607 19301029 扶輪社         shunpao
##  2 SPSP193603291401 19360329 國際扶輪社     shunpao
##  3 SPSP194010170804 19401017 扶輪社婦女會   shunpao
##  4 SPSP193011141210 19301114 扶輪社奬品待領 shunpao
##  5 SPSP193002131629 19300213 扶輪社明晚年宴 shunpao
##  6 SPSP193006191417 19300619 扶輪社今日聚餐 shunpao
##  7 SPSP193105211127 19310521 扶輪社今午聚餐 shunpao
##  8 SPSP193005221419 19300522 扶輪社今日聚餐 shunpao
##  9 SPSP193006051424 19300605 扶輪社今午聚餐 shunpao
## 10 SPSP193012301428 19301230 扶輪社今午聚餐 shunpao
## # ℹ 58 more rows

3.5.3 Concordance

The extended search function also applies to concordance with the function search_concordance. The search_concordance is composed of six arguments, the same as in search_documents, plus an additional argument for the context size:

  • q: the queried term (q)
  • corpus: the corpus (within quotation marks)
  • search_fields: the selected field(s) (use the function list_search_fields to obtain the list of searchable fields available in the targeted corpus, as described above)
  • context_size: the size of the context (number of characters before/after)
  • start: in case when the query returns more than 100,000 results, this argument allows to start a second round starting from a specified row number (e.g. start = 100001)
  • dates: this argument serves to filter the search results by specific dates or date ranges. Use accepts_date_queries to test whether the targeted corpus supports date filters (see above).
search_concordance <- function(q,
                                   corpus="shunpao",
                                   search_fields=c(),
                                   context_size=30,
                                   start=0,
                                   dates=c())


The advanced functions that we described above - list_search_fields(), list_filter_fields(), and list_possible_filters() - can also be applied in combination with the search_concordance() function.

3.5.4 Word embeddings

In HistText, pre-computed word embeddings (i.e., learned representations for text where words with similar meanings have similar representations), can be utilized to enhance queries by incorporating similar terms. Please note that this function is currently under construction. However, it is already accessible through the HistText Search interface.