search_documents_ex
and search_concordance_ex
list_search_fields
and accepts_date_queries
ner_on_df
load_pdf_as_df
get_padagraph_url
list_corpora
now only displays collections than can be queriedsearch_concordance_ex <- function(q,
corpus="imh",
search_fields=c(),
context_size=30,
start=0,
dates=c())
q
: the search query (same as in search_concordance
)corpus
: name of the corpus to search in (use list_corpora
to get possible names)search_fields
: name(s) of the field(s) to search in (use list_search_fields
to get available search fields). If no field is explicitly given, searches by default in all possible fields.context_size
: size of the context window (same as in search_concordance
)start
: first row index to start retrieving results from. Only useful when search query returns more than 100,000 results.dates
: date(s) used to filter the search results. Can be date ranges. Only works with corpora that include dates (see accepts_date_queries
).List possible search fields in a corpus:
# For the shunpao corpus
enpchina::list_search_fields("shunpao")
## [1] "text" "title"
# For the imh corpus
enpchina::list_search_fields("imh")
## [1] "book" "page" "date" "story" "bookno"
Tell if a corpus supports date filters:
# On the shunpao corpus
enpchina::accepts_date_queries("shunpao")
## [1] TRUE
# On the wikibio corpus
enpchina::accepts_date_queries("wikibio-zh")
## [1] FALSE
Two types of dates can be used:
Examples:
Search for 蔣介石 in the titles of the shunpao collection:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields="title")
Id <chr> | Date <chr> | Title <chr> | Source <chr> | |
---|---|---|---|---|
SPSP193303261314 | 19330326 | 蔣介石 | shunpao | |
SPSP193206070403 | 19320607 | 蔣介石 | shunpao | |
SPSP192808060404 | 19280806 | 蔣介石返京 | shunpao | |
SPSP192609011703 | 19260901 | 蔣介石軼事 | shunpao | |
SPSP192709190402 | 19270919 | 蔣介石昨抵杭 | shunpao | |
SPSP193303290819 | 19330329 | 馮致蔣介石函 | shunpao | |
SPSP193201290811 | 19320129 | 蔣介石緩返奉 | shunpao | |
SPSP193201140414 | 19320114 | 蔣介石昨蒞杭 | shunpao | |
SPSP192911020403 | 19291102 | 蔣介石到許昌 | shunpao | |
SPSP192801060702 | 19280106 | 蔣介石過鎭記 | shunpao |
Search for 蔣介石 in the titles OR in the content of the shunpao collection:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", search_fields=c("title", "text"))
Id <chr> | Date <chr> | Title <chr> | Source <chr> | |
---|---|---|---|---|
SPSP193303261314 | 19330326 | 蔣介石 | shunpao | |
SPSP193303261314 | 19330326 | 蔣介石 | shunpao | |
SPSP192808060404 | 19280806 | 蔣介石返京 | shunpao | |
SPSP192808060404 | 19280806 | 蔣介石返京 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192710220709 | 19271022 | 蔣介石在日之行蹤 | shunpao | |
SPSP192710220709 | 19271022 | 蔣介石在日之行蹤 | shunpao |
Same result with (because the shunpao only has 2 search fields):
search_concordance_ex("\"蔣介石\"", corpus="shunpao")
Id <chr> | Date <chr> | Title <chr> | Source <chr> | |
---|---|---|---|---|
SPSP193303261314 | 19330326 | 蔣介石 | shunpao | |
SPSP193303261314 | 19330326 | 蔣介石 | shunpao | |
SPSP192808060404 | 19280806 | 蔣介石返京 | shunpao | |
SPSP192808060404 | 19280806 | 蔣介石返京 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192711100420 | 19271110 | 歸國途中之蔣介石 | shunpao | |
SPSP192710220709 | 19271022 | 蔣介石在日之行蹤 | shunpao | |
SPSP192710220709 | 19271022 | 蔣介石在日之行蹤 | shunpao |
Search for 蔣介石 in documents from 1933:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933")
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193303261314 | 19330326 | |
SPSP193303290819 | 19330329 | |
SPSP193305161106 | 19330516 | |
SPSP193307300307 | 19330730 | |
SPSP193303050850 | 19330305 | |
SPSP193303120602 | 19330312 | |
SPSP193303120602 | 19330312 | |
SPSP193303100303 | 19330310 | |
SPSP193303072306 | 19330307 |
Search for 蔣介石 in documents from March 1933:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="1933-03")
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193303261314 | 19330326 | |
SPSP193303290819 | 19330329 | |
SPSP193303050850 | 19330305 | |
SPSP193303120602 | 19330312 | |
SPSP193303120602 | 19330312 | |
SPSP193303100303 | 19330310 | |
SPSP193303072306 | 19330307 | |
SPSP193303280905 | 19330328 | |
SPSP193303301004 | 19330330 |
Search for 蔣介石 in documents from 1933 OR 1940:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("1933", "1940"))
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193303261314 | 19330326 | |
SPSP193303290819 | 19330329 | |
SPSP193305161106 | 19330516 | |
SPSP193307300307 | 19330730 | |
SPSP193303050850 | 19330305 | |
SPSP193303120602 | 19330312 | |
SPSP193303120602 | 19330312 | |
SPSP193303100303 | 19330310 | |
SPSP193303072306 | 19330307 |
Search for 蔣介石 in documents from 1930 to 1940:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates="[1930 TO 1940]")
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193303261314 | 19330326 | |
SPSP193201140414 | 19320114 | |
SPSP193201140414 | 19320114 | |
SPSP193204150719 | 19320415 | |
SPSP193204150719 | 19320415 | |
SPSP193203140121 | 19320314 | |
SPSP193203140121 | 19320314 | |
SPSP193202110833 | 19320211 | |
SPSP193202110833 | 19320211 |
Search for 蔣介石 in documents from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao", dates=c("[1930 TO 1940]", "1945"))
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193303261314 | 19330326 | |
SPSP193201140414 | 19320114 | |
SPSP193201140414 | 19320114 | |
SPSP193204150719 | 19320415 | |
SPSP193204150719 | 19320415 | |
SPSP193203140121 | 19320314 | |
SPSP193203140121 | 19320314 | |
SPSP193202110833 | 19320211 | |
SPSP193202110833 | 19320211 |
Search for 蔣介石 in document titles from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao",
search_fields="title",
dates=c("[1930 TO 1940]", "1945"))
Id <chr> | Date <chr> | |
---|---|---|
SPSP193303261314 | 19330326 | |
SPSP193206070403 | 19320607 | |
SPSP193303290819 | 19330329 | |
SPSP193201290811 | 19320129 | |
SPSP193201140414 | 19320114 | |
SPSP193203140121 | 19320314 | |
SPSP193204150719 | 19320415 | |
SPSP193206120408 | 19320612 | |
SPSP193206200410 | 19320620 | |
SPSP193203230610 | 19320323 |
Search for 蔣介石 in document texts from 1930 to 1940 OR from 1945:
search_concordance_ex("\"蔣介石\"", corpus="shunpao",
search_fields="text",
dates=c("[1930 TO 1940]", "1945"))
Id <chr> | Date <chr> | Title <chr> | Source <chr> | |
---|---|---|---|---|
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP194501130214 | 19450113 | 宋子文入閣與重慶的煩悶 吉田東祐 | shunpao | |
SPSP193703190430 | 19370319 | 旅歐隨筆 | shunpao | |
SPSP193703190430 | 19370319 | 旅歐隨筆 | shunpao | |
SPSP193703190430 | 19370319 | 旅歐隨筆 | shunpao | |
SPSP194503090116 | 19450309 | 舊金山會議 蔣不出席 | shunpao |
Prerequisite: Your PDF file must be in a folder accessible from your R script.
Basic usage:
pdf_df <- load_pdf_as_df("./example-en.pdf")
dplyr::slice(pdf_df, 1:10)
Page <int> | |
---|---|
1 | |
2 | |
3 | |
7 | |
9 | |
11 | |
12 | |
13 | |
15 | |
16 |
Only consider pages with a minimum of 100 characters:
pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100)
dplyr::slice(pdf_df, 1:10)
Page <int> | |
---|---|
1 | |
2 | |
11 | |
12 | |
13 | |
15 | |
16 | |
17 | |
18 | |
19 |
Change the identifier prefix:
pdf_df <- load_pdf_as_df("./example-en.pdf", min_text_length = 100, identifier="example-en")
dplyr::slice(pdf_df, 1:10)
Page <int> | |
---|---|
1 | |
2 | |
11 | |
12 | |
13 | |
15 | |
16 | |
17 | |
18 | |
19 |
Prerequisite: Have a dataframe that contains at least one column with text.
In the following examples, we’ll use the pdf_df
dataframe generated in the previous example.
Basic usage (on english texts):
# The second argument is the name of the column on which to apply NER
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", model = "eng")
## 10/10
ner_df
Type <chr> | |
---|---|
ORGANIZATION | |
PERSON | |
MONEY | |
MISC | |
LOCATION | |
PERSON | |
PERSON | |
LOCATION | |
MISC | |
LOCATION |
Specify the id column:
ner_df <- ner_on_df(dplyr::slice(pdf_df, 1:10), "Text", id_column="Id", model = "eng")
## 10/10
ner_df
Type <chr> | |
---|---|
ORGANIZATION | |
PERSON | |
MONEY | |
MISC | |
LOCATION | |
PERSON | |
PERSON | |
LOCATION | |
MISC | |
LOCATION |
On chinese:
pdfzh_df <- load_pdf_as_df("./example-zh.pdf", min_text_length = 10, identifier="example-zh")
## PDF error: Can't get Fields array<0a>
nerzh_df <- ner_on_df(dplyr::slice(pdfzh_df, 1:10), "Text", id_column="Id", model = "mdn")
## 10/10
nerzh_df
Type <chr> | |
---|---|
PER (0.7877) | |
PER (0.7245) | |
PER (0.8904) | |
PER (0.8638) | |
PER (0.8142) | |
PER (0.8904) | |
PER (0.7676) | |
PER (0.9126) | |
PER (0.7965) | |
ORG (0.6444) |
ner_on_df
can produce duplicates.proquest
. Fix coming soon.