1 Update 0.2.5

1.1 What’s new ?

  • Extended versions of the search functions: search_documents_ex and search_concordance_ex
  • Companion functions: list_search_fields and accepts_date_queries
  • Function to apply NER on any dataframe: ner_on_df
  • Function to load the text content of a PDF into a dataframe: load_pdf_as_df
  • Function to directly retrieve a padagraph URL: get_padagraph_url
  • list_corpora now only displays collections than can be queried

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

  • Search in other text fields (i.e. document titles)
  • Filter by date(s)
  • Retrieve more than 100,000 results

1.2.2 Function prototype

search_concordance_ex <- function(q, 
                                  corpus="imh", 
                                  search_fields=c(),
                                  context_size=30, 
                                  start=0,
                                  dates=c())
  • q: the search query (same as in search_concordance)
  • corpus: name of the corpus to search in (use list_corpora to get possible names)
  • search_fields: name(s) of the field(s) to search in (use list_search_fields to get available search fields). If no field is explicitly given, searches by default in all possible fields.
  • context_size: size of the context window (same as in search_concordance)
  • start: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.
  • dates: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries).

1.2.3 Companion functions

List possible search fields in a corpus:

# For the shunpao corpus
enpchina::list_search_fields("shunpao")
## [1] "text"  "title"
# For the imh corpus
enpchina::list_search_fields("imh")
## [1] "book"   "page"   "date"   "story"  "bookno"

Tell if a corpus supports date filters:

# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")
## [1] TRUE
# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")
## [1] FALSE

1.2.4 Expected date format

Two types of dates can be used:

  • Date points: YYYY-MM-DD or YYYY-MM or YYYY
  • Date ranges: [DATE1 TO DATE2] or [DATE1;DATE2] where DATE1 and DATE2 are date points (range is inclusive)

Examples:

  • Date points:
    • 1899-12-22: Only results with the date 22 December 1899 are retrieved .
    • 1899-12: All results in December 1899 are retrieved.
    • 1899: All results in 1899 are retrieved.
  • Date ranges:
    • [1899 TO 1912] : All results from 1899 to 1912 are retrieved.
    • [1899-12 TO 1912-01-15] : All results from December 1899 to the 15th of January 1912 are retrieved.

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

Search for 蔣介石 in the titles of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")

Search for 蔣介石 in the titles OR in the content of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))