search_documents_ex
and search_concordance_ex
list_search_fields
and accepts_date_queries
ner_on_df
load_pdf_as_df
get_padagraph_url
list_corpora
now only displays collections than can be queriedsearch_concordance_ex <- function(q,
corpus="imh",
search_fields=c(),
context_size=30,
start=0,
dates=c())
q
: the search query (same as in search_concordance
)corpus
: name of the corpus to search in (use list_corpora
to get possible names)search_fields
: name(s) of the field(s) to search in (use list_search_fields
to get available search fields). If no field is explicitly given, searches by default in all possible fields.context_size
: size of the context window (same as in search_concordance
)start
: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.dates
: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries
).List possible search fields in a corpus:
# For the shunpao corpus
enpchina::list_search_fields("shunpao")
## [1] "text" "title"
# For the imh corpus
enpchina::list_search_fields("imh")
## [1] "book" "page" "date" "story" "bookno"
Tell if a corpus supports date filters:
# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")
## [1] TRUE
# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")
## [1] FALSE
Two types of dates can be used:
Examples:
Search for 蔣介石 in the titles of the shunpao collection:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")
Search for 蔣介石 in the titles OR in the content of the shunpao collection:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))
Same result with (because the shunpao only has 2 search fields):
search_concordance_ex("\"蔣介石\"", corpus="shunpao")
Search for 蔣介石 in documents from 1933:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933")
Search for 蔣介石 in documents from March 1933:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933-03")
Search for 蔣介石 in documents from 1933 OR 1940:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("1933", "1940"))
Search for 蔣介石 in documents from 1930 to 1940:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="[1930 TO 1940]")
Search for 蔣介石 in documents from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
Search for 蔣介石 in document titles from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao",
search_fields="title",
dates=c("[1930 TO 1940]", "1945"))
Search for 蔣介石 in document texts from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao",
search_fields="text",
dates=c("[1930 TO 1940]", "1945"))
Prerequisite: Your PDF file must be in a folder accessible from your R script.
Basic usage:
pdf_df <- load_pdf_as_df("./example-en.pdf")
dplyr::slice(pdf_df, 1:10)
Only consider pages with a minimum of 100 characters:
pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100)
dplyr::slice(pdf_df, 1:10)
Change the identifier prefix:
pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100, identifier="example-en")
dplyr::slice(pdf_df, 1:10)
Prerequisite: Have a dataframe that contains at least one column with text.
In the following examples, we’ll use the pdf_df
dataframe generated in the previous example.
Basic usage (on english texts):
# The second argument is the name of the column on which to apply NER
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", model = "eng")
## 10/10
ner_df
Specify the id column:
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", id_column="Id", model = "eng")
## 10/10
ner_df
On chinese:
pdfzh_df <- load_pdf_as_df("./example-zh.pdf", min_text_length = 10, identifier="example-zh")
## PDF error: Can't get Fields array<0a>
nerzh_df <- ner_on_df(dplyr::slice(pdfzh_df, 1:10), "Text", id_column="Id", model = "mdn")
## 10/10
nerzh_df
ner_on_df
can produce duplicates.proquest
. Fix coming soon.