1 Update 0.2.5

1.1 What’s new ?

Extended versions of the search functions: search_documents_ex and search_concordance_ex
Companion functions: list_search_fields and accepts_date_queries
Function to apply NER on any dataframe: ner_on_df
Function to load the text content of a PDF into a dataframe: load_pdf_as_df
Function to directly retrieve a padagraph URL: get_padagraph_url
list_corpora now only displays collections than can be queried

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

Search in other text fields (i.e. document titles)
Filter by date(s)
Retrieve more than 100,000 results

1.2.2 Function prototype

search_concordance_ex <- function(q, 
                                  corpus="imh", 
                                  search_fields=c(),
                                  context_size=30, 
                                  start=0,
                                  dates=c())

q: the search query (same as in search_concordance)
corpus: name of the corpus to search in (use list_corpora to get possible names)
search_fields: name(s) of the field(s) to search in (use list_search_fields to get available search fields). If no field is explicitly given, searches by default in all possible fields.
context_size: size of the context window (same as in search_concordance)
start: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.
dates: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries).

1.2.3 Companion functions

List possible search fields in a corpus:

# For the shunpao corpus
enpchina::list_search_fields("shunpao")

## [1] "text"  "title"

# For the imh corpus
enpchina::list_search_fields("imh")

## [1] "book"   "page"   "date"   "story"  "bookno"

Tell if a corpus supports date filters:

# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")

## [1] TRUE

# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")

## [1] FALSE

1.2.4 Expected date format

Two types of dates can be used:

Date points: YYYY-MM-DD or YYYY-MM or YYYY
Date ranges: [DATE1 TO DATE2] or [DATE1;DATE2] where DATE1 and DATE2 are date points (range is inclusive)

Examples:

Date points:
- 1899-12-22: Only results with the date 22 December 1899 are retrieved .
- 1899-12: All results in December 1899 are retrieved.
- 1899: All results in 1899 are retrieved.
Date ranges:
- [1899 TO 1912] : All results from 1899 to 1912 are retrieved.
- [1899-12 TO 1912-01-15] : All results from December 1899 to the 15th of January 1912 are retrieved.

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

Search for 蔣介石 in the titles of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")

Search for 蔣介石 in the titles OR in the content of the shunpao collection:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))

Same result with (because the shunpao only has 2 search fields):

search_concordance_ex("\"蔣介石\"", corpus="shunpao")

1.2.5.2 Filtering by date

Search for 蔣介石 in documents from 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933")

Search for 蔣介石 in documents from March 1933:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933-03")

Search for 蔣介石 in documents from 1933 OR 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("1933", "1940"))

Search for 蔣介石 in documents from 1930 to 1940:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="[1930 TO 1940]")

Search for 蔣介石 in documents from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))

1.2.5.3 Combining search fields and date filters

Search for 蔣介石 in document titles from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="title",
                      dates=c("[1930 TO 1940]", "1945"))

Search for 蔣介石 in document texts from 1930 to 1940 OR from 1945:

search_concordance_ex("\"蔣介石\"", corpus="shunpao",
                      search_fields="text",
                      dates=c("[1930 TO 1940]", "1945"))

1.3 Working with documents not supported by the server

1.3.1 Loading a PDF as a dataframe

Prerequisite: Your PDF file must be in a folder accessible from your R script.

Basic usage:

pdf_df <- load_pdf_as_df("./example-en.pdf")
dplyr::slice(pdf_df, 1:10)

Only consider pages with a minimum of 100 characters:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100)
dplyr::slice(pdf_df, 1:10)

Change the identifier prefix:

pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100, identifier="example-en")
dplyr::slice(pdf_df, 1:10)

1.3.2 Applying NER on a dataframe

Prerequisite: Have a dataframe that contains at least one column with text.

In the following examples, we’ll use the pdf_df dataframe generated in the previous example.

Basic usage (on english texts):

# The second argument is the name of the column on which to apply NER
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", model = "eng")

## 10/10

ner_df

Specify the id column:

ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", id_column="Id", model = "eng")

## 10/10

ner_df

On chinese:

pdfzh_df <- load_pdf_as_df("./example-zh.pdf", min_text_length = 10, identifier="example-zh")

## PDF error: Can't get Fields array<0a>

nerzh_df <- ner_on_df(dplyr::slice(pdfzh_df, 1:10), "Text", id_column="Id", model = "mdn")

## 10/10

nerzh_df

1.5 Known issues

ner_on_df can produce duplicates.
Timeout issues for very basic queries due to too many results. Mostly happens on proquest. Fix coming soon.
Hard limit of 100,000 results. Will be fixed at the same time as the timeout issue.

ENP-China R package: Update 0.2.5

Jeremy Auguste

08/06/2021

1 Update 0.2.5

1.1 What’s new ?

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

1.2.2 Function prototype

1.2.3 Companion functions

1.2.4 Expected date format

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

1.2.5.2 Filtering by date

1.2.5.3 Combining search fields and date filters

1.3 Working with documents not supported by the server

1.3.1 Loading a PDF as a dataframe

1.3.2 Applying NER on a dataframe

1.5 Known issues

ENP-China R package: Update 0.2.5

Jeremy Auguste

08/06/2021

1 Update 0.2.5

1.1 What’s new ?

1.2 The search_*_ex functions

1.2.1 Purposes of the two functions

1.2.2 Function prototype

1.2.3 Companion functions

1.2.4 Expected date format

1.2.5 How to use the search_*_ex functions ?

1.2.5.1 Specifying the search fields

1.2.5.2 Filtering by date

1.2.5.3 Combining search fields and date filters

1.3 Working with documents not supported by the server

1.3.1 Loading a PDF as a dataframe

1.3.2 Applying NER on a dataframe

1.4 Other related news

1.5 Known issues