1 Update 0.2.5

1.1 What’s new ?

  • Extended versions of the search functions: search_documents_ex and search_concordance_ex
  • Companion functions: list_search_fields and accepts_date_queries
  • Function to apply NER on any dataframe: ner_on_df
  • Function to load the text content of a PDF into a dataframe: load_pdf_as_df
  • Function to directly retrieve a padagraph URL: get_padagraph_url
  • list_corpora now only displays collections than can be queried

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

  • Search in other text fields (i.e. document titles)
  • Filter by date(s)
  • Retrieve more than 100,000 results

1.2.2 Function prototype

search_concordance_ex <- function(q, 
                                  corpus="imh", 
                                  search_fields=c(),
                                  context_size=30, 
                                  start=0,
                                  dates=c())
  • q: the search query (same as in search_concordance)
  • corpus: name of the corpus to search in (use list_corpora to get possible names)
  • search_fields: name(s) of the field(s) to search in (use list_search_fields to get available search fields). If no field is explicitly given, searches by default in all possible fields.
  • context_size: size of the context window (same as in search_concordance)
  • start: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.
  • dates: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries).

1.2.3 Companion functions

List possible search fields in a corpus:

# For the shunpao corpus
enpchina::list_search_fields("shunpao")
## [1] "text"  "title"
# For the imh corpus
enpchina::list_search_fields("imh")
## [1] "book"   "page"   "date"   "story"  "bookno"

Tell if a corpus supports date filters:

# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")
## [1] TRUE
# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")
## [1] FALSE

1.2.4 Expected date format

Two types of dates can be used:

  • Date points: YYYY-MM-DD or YYYY-MM or YYYY
  • Date ranges: [DATE1 TO DATE2] or [DATE1;DATE2] where DATE1 and DATE2 are date points (range is inclusive)

Examples:

  • Date points:
    • 1899-12-22: Only results with the date 22 December 1899 are retrieved .
    • 1899-12: All results in December 1899 are retrieved.
    • 1899: All results in 1899 are retrieved.
  • Date ranges:
    • [1899 TO 1912] : All results from 1899 to 1912 are retrieved.
    • [1899-12 TO 1912-01-15] : All results from December 1899 to the 15th of January 1912 are retrieved.

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

Search for 蔣介石 in the titles of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")

Search for 蔣介石 in the titles OR in the content of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))

Same result with (because the shunpao only has 2 search fields):

search_concordance_ex("\"蔣介石\"", corpus="shunpao")

1.2.5.2 Filtering by date

Search for 蔣介石 in documents from 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933")

Search for 蔣介石 in documents from March 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933-03")

Search for 蔣介石 in documents from 1933 OR 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("1933", "1940"))

Search for 蔣介石 in documents from 1930 to 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="[1930 TO 1940]")

Search for 蔣介石 in documents from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))

1.2.5.3 Combining search fields and date filters

Search for 蔣介石 in document titles from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))

Search for 蔣介石 in document texts from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="text",
                      dates=c("[1930 TO 1940]", "1945"))

1.3 Working with documents not supported by the server

1.3.1 Loading a PDF as a dataframe

Prerequisite: Your PDF file must be in a folder accessible from your R script.

Basic usage:

pdf_df <- load_pdf_as_df("./example-en.pdf")
dplyr::slice(pdf_df, 1:10)

Only consider pages with a minimum of 100 characters:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100)
dplyr::slice(pdf_df, 1:10)

Change the identifier prefix:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100, identifier="example-en")
dplyr::slice(pdf_df, 1:10)

1.3.2 Applying NER on a dataframe

Prerequisite: Have a dataframe that contains at least one column with text.

In the following examples, we’ll use the pdf_df dataframe generated in the previous example.

Basic usage (on english texts):

# The second argument is the name of the column on which to apply NER
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", model = "eng")
## 10/10
ner_df

Specify the id column:

ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", id_column="Id", model = "eng") 
## 10/10
ner_df

On chinese:

pdfzh_df <- load_pdf_as_df("./example-zh.pdf", min_text_length = 10, identifier="example-zh")
## PDF error: Can't get Fields array<0a>
nerzh_df <- ner_on_df(dplyr::slice(pdfzh_df, 1:10), "Text", id_column="Id", model = "mdn") 
## 10/10
nerzh_df

1.5 Known issues

  • ner_on_df can produce duplicates.
  • Timeout issues for very basic queries due to too many results. Mostly happens on proquest. Fix coming soon.
  • Hard limit of 100,000 results. Will be fixed at the same time as the timeout issue.