• 1 Update 0.2.5
    • 1.1 What’s new ?
    • 1.2 The search_*_ex functions
      • 1.2.1 Purposes of the two functions
      • 1.2.2 Function prototype
      • 1.2.3 Companion functions
      • 1.2.4 Expected date format
      • 1.2.5 How to use the search_*_ex functions ?
    • 1.3 Working with documents not supported by the server
      • 1.3.1 Loading a PDF as a dataframe
      • 1.3.2 Applying NER on a dataframe
    • 1.4 Other related news
    • 1.5 Known issues

1 Update 0.2.5

1.1 What’s new ?

  • Extended versions of the search functions: search_documents_ex and search_concordance_ex
  • Companion functions: list_search_fields and accepts_date_queries
  • Function to apply NER on any dataframe: ner_on_df
  • Function to load the text content of a PDF into a dataframe: load_pdf_as_df
  • Function to directly retrieve a padagraph URL: get_padagraph_url
  • list_corpora now only displays collections than can be queried

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

  • Search in other text fields (i.e. document titles)
  • Filter by date(s)
  • Retrieve more than 100,000 results

1.2.2 Function prototype

search_concordance_ex <- function(q, 
                                  corpus="imh", 
                                  search_fields=c(),
                                  context_size=30, 
                                  start=0,
                                  dates=c())
  • q: the search query (same as in search_concordance)
  • corpus: name of the corpus to search in (use list_corpora to get possible names)
  • search_fields: name(s) of the field(s) to search in (use list_search_fields to get available search fields). If no field is explicitly given, searches by default in all possible fields.
  • context_size: size of the context window (same as in search_concordance)
  • start: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.
  • dates: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries).

1.2.3 Companion functions

List possible search fields in a corpus:

# For the shunpao corpus
enpchina::list_search_fields("shunpao")
## [1] "text"  "title"
# For the imh corpus
enpchina::list_search_fields("imh")
## [1] "book"   "page"   "date"   "story"  "bookno"

Tell if a corpus supports date filters:

# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")
## [1] TRUE
# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")
## [1] FALSE

1.2.4 Expected date format

Two types of dates can be used:

  • Date points: YYYY-MM-DD or YYYY-MM or YYYY
  • Date ranges: [DATE1 TO DATE2] or [DATE1;DATE2] where DATE1 and DATE2 are date points (range is inclusive)

Examples:

  • Date points:
    • 1899-12-22: Only results with the date 22 December 1899 are retrieved .
    • 1899-12: All results in December 1899 are retrieved.
    • 1899: All results in 1899 are retrieved.
  • Date ranges:
    • [1899 TO 1912] : All results from 1899 to 1912 are retrieved.
    • [1899-12 TO 1912-01-15] : All results from December 1899 to the 15th of January 1912 are retrieved.

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

Search for 蔣介石 in the titles of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
Title
<chr>
Source
<chr>
SPSP19330326131419330326蔣介石shunpao
SPSP19320607040319320607蔣介石shunpao
SPSP19280806040419280806蔣介石返京shunpao
SPSP19260901170319260901蔣介石軼事shunpao
SPSP19270919040219270919蔣介石昨抵杭shunpao
SPSP19330329081919330329馮致蔣介石函shunpao
SPSP19320129081119320129蔣介石緩返奉shunpao
SPSP19320114041419320114蔣介石昨蒞杭shunpao
SPSP19291102040319291102蔣介石到許昌shunpao
SPSP19280106070219280106蔣介石過鎭記shunpao

Search for 蔣介石 in the titles OR in the content of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
Title
<chr>
Source
<chr>
SPSP19330326131419330326蔣介石shunpao
SPSP19330326131419330326蔣介石shunpao
SPSP19280806040419280806蔣介石返京shunpao
SPSP19280806040419280806蔣介石返京shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271022070919271022蔣介石在日之行蹤shunpao
SPSP19271022070919271022蔣介石在日之行蹤shunpao

Same result with (because the shunpao only has 2 search fields):

search_concordance_ex("\"蔣介石\"", corpus="shunpao")
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
Title
<chr>
Source
<chr>
SPSP19330326131419330326蔣介石shunpao
SPSP19330326131419330326蔣介石shunpao
SPSP19280806040419280806蔣介石返京shunpao
SPSP19280806040419280806蔣介石返京shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271110042019271110歸國途中之蔣介石shunpao
SPSP19271022070919271022蔣介石在日之行蹤shunpao
SPSP19271022070919271022蔣介石在日之行蹤shunpao

1.2.5.2 Filtering by date

Search for 蔣介石 in documents from 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933")
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19330326131419330326
SPSP19330329081919330329
SPSP19330516110619330516
SPSP19330730030719330730
SPSP19330305085019330305
SPSP19330312060219330312
SPSP19330312060219330312
SPSP19330310030319330310
SPSP19330307230619330307

Search for 蔣介石 in documents from March 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933-03")
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19330326131419330326
SPSP19330329081919330329
SPSP19330305085019330305
SPSP19330312060219330312
SPSP19330312060219330312
SPSP19330310030319330310
SPSP19330307230619330307
SPSP19330328090519330328
SPSP19330330100419330330

Search for 蔣介石 in documents from 1933 OR 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("1933", "1940"))
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19330326131419330326
SPSP19330329081919330329
SPSP19330516110619330516
SPSP19330730030719330730
SPSP19330305085019330305
SPSP19330312060219330312
SPSP19330312060219330312
SPSP19330310030319330310
SPSP19330307230619330307

Search for 蔣介石 in documents from 1930 to 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="[1930 TO 1940]")
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19330326131419330326
SPSP19320114041419320114
SPSP19320114041419320114
SPSP19320415071919320415
SPSP19320415071919320415
SPSP19320314012119320314
SPSP19320314012119320314
SPSP19320211083319320211
SPSP19320211083319320211

Search for 蔣介石 in documents from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19330326131419330326
SPSP19320114041419320114
SPSP19320114041419320114
SPSP19320415071919320415
SPSP19320415071919320415
SPSP19320314012119320314
SPSP19320314012119320314
SPSP19320211083319320211
SPSP19320211083319320211

1.2.5.3 Combining search fields and date filters

Search for 蔣介石 in document titles from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
SPSP19330326131419330326
SPSP19320607040319320607
SPSP19330329081919330329
SPSP19320129081119320129
SPSP19320114041419320114
SPSP19320314012119320314
SPSP19320415071919320415
SPSP19320612040819320612
SPSP19320620041019320620
SPSP19320323061019320323

Search for 蔣介石 in document texts from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="text",
                      dates=c("[1930 TO 1940]", "1945"))
ABCDEFGHIJ0123456789
Id
<chr>
Date
<chr>
Title
<chr>
Source
<chr>
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19450113021419450113宋子文入閣與重慶的煩悶 吉田東祐shunpao
SPSP19370319043019370319旅歐隨筆shunpao
SPSP19370319043019370319旅歐隨筆shunpao
SPSP19370319043019370319旅歐隨筆shunpao
SPSP19450309011619450309舊金山會議 蔣不出席shunpao

1.3 Working with documents not supported by the server

1.3.1 Loading a PDF as a dataframe

Prerequisite: Your PDF file must be in a folder accessible from your R script.

Basic usage:

pdf_df <- load_pdf_as_df("./example-en.pdf")
dplyr::slice(pdf_df, 1:10)
ABCDEFGHIJ0123456789
Page
<int>
1
2
3
7
9
11
12
13
15
16

Only consider pages with a minimum of 100 characters:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100)
dplyr::slice(pdf_df, 1:10)
ABCDEFGHIJ0123456789
Page
<int>
1
2
11
12
13
15
16
17
18
19

Change the identifier prefix:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100, identifier="example-en")
dplyr::slice(pdf_df, 1:10)
ABCDEFGHIJ0123456789
Page
<int>
1
2
11
12
13
15
16
17
18
19

1.3.2 Applying NER on a dataframe

Prerequisite: Have a dataframe that contains at least one column with text.

In the following examples, we’ll use the pdf_df dataframe generated in the previous example.

Basic usage (on english texts):

# The second argument is the name of the column on which to apply NER
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", model = "eng")
## 10/10
ner_df
ABCDEFGHIJ0123456789
Type
<chr>
ORGANIZATION
PERSON
MONEY
MISC
LOCATION
PERSON
PERSON
LOCATION
MISC
LOCATION

Specify the id column:

ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", id_column="Id", model = "eng") 
## 10/10
ner_df
ABCDEFGHIJ0123456789
Type
<chr>
ORGANIZATION
PERSON
MONEY
MISC
LOCATION
PERSON
PERSON
LOCATION
MISC
LOCATION

On chinese:

pdfzh_df <- load_pdf_as_df("./example-zh.pdf", min_text_length = 10, identifier="example-zh")
## PDF error: Can't get Fields array<0a>
nerzh_df <- ner_on_df(dplyr::slice(pdfzh_df, 1:10), "Text", id_column="Id", model = "mdn") 
## 10/10
nerzh_df
ABCDEFGHIJ0123456789
Type
<chr>
PER (0.7877)
PER (0.7245)
PER (0.8904)
PER (0.8638)
PER (0.8142)
PER (0.8904)
PER (0.7676)
PER (0.9126)
PER (0.7965)
ORG (0.6444)

1.5 Known issues

  • ner_on_df can produce duplicates.
  • Timeout issues for very basic queries due to too many results. Mostly happens on proquest. Fix coming soon.
  • Hard limit of 100,000 results. Will be fixed at the same time as the timeout issue.