search_*_ex
functions.search_*_ex
functions.list_filter_fields
is used to know available filter fields in a corpuslist_possible_filters
lists all possible values for a filter field.search_*
and search_*_ex
functions now automatically fetch all results, even for queries that return a lot of rows.get_search_fields_content
function that can be used to retrieve content from given search fields.view_document
can be used to show the content of a document with optional highlighting.By default, when using multiple search fields, the search_*_ex
functions apply a OR
between the fields. This means that the search query has to be found in at least one of the specified search fields.
You may sometimes want to only fetch results where the search query is found in all search fields. This can be done by changing the logical operator with AND
by using the search_operator
argument.
Example (using OR - default behavior):
histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao",
search_fields=c("title", "text"),
search_operator = "OR")
Example (using AND):
histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao",
search_fields=c("title", "text"),
search_operator = "AND")
In addition to dates (for corpora that support them), it is now possible to filter search queries by using other types of information (such as publishers or a book name).
Before being able to filter queries, it is required to know what are the available filter fields for a given corpus.
histtext::list_filter_fields("proquest")
## [1] "date" "publisher" "category"
histtext::list_filter_fields("imh-en")
## [1] "book" "page" "date" "bookno"
Note: Regular search fields may also be used as filter fields.
In order to filter search queries, it can be helpful to have an idea of the possible filters that you could apply for a given filter field.
Listing the different publishers found in proquest
:
histtext::list_possible_filters("proquest", "publisher")
The first column Value
contains the possible filters that you can use. The second column N
informs you of the number of documents that have the corresponding filter value.
Listing the different document categories found in proquest
:
histtext::list_possible_filters("proquest", "category")
Once the desired filters are chosen, they can be used with search_documents_ex
and search_concordance_ex
by creating named vectors where the names are the field names to filter on and the values are the filters. The named vector is then used with the filter_query
argument of the search_*_ex
functions.
Example filtering by book name in imh-zh
:
histtext::search_documents_ex("人", "imh-zh",
filter_query = list(book = '"上海人名錄"'))
Example filtering by publisher name (The North China Herald) in proquest
:
histtext::search_documents_ex('"Students in China"', "proquest",
filter_query = list(publisher = '"The North China Herald"'))
Example filtering by publisher name (The China Press) in proquest
:
histtext::search_documents_ex('"Students in China"', "proquest",
filter_query = list(publisher = '"The China Press"'))
It is possible to filter multiple values on the same filter field and/or to filter on multiple fields.
Example filtering by publisher name (The China Press) and by titles which contain “Students”:
histtext::search_documents_ex("CORPORATION", "proquest",
search_fields = "fulltext",
filter_query = list(publisher = '"The China Press"',
title = '"Students"'))
Example filtering by two book names:
histtext::search_documents_ex("人", "imh-zh",
filter_query = list(book = '"上海人名錄" OR "上海總商會同人錄"'))
Example filtering by two publisher names:
histtext::search_documents_ex('"Students in China"', "proquest",
filter_query = list(publisher = '"The China Press" OR "The North China Herald"'))
Example filtering by two publisher names and by the “Article” category name:
histtext::search_documents_ex('"Students in China"', "proquest",
filter_query = list(publisher = '"The China Press" OR "The North China Herald"',
category = 'Article'))
By default, when using multiple filter fields, the search_*_ex
functions apply a AND
between the fields, which means that a document is returned if it validates all filters. This behavior can be changed by using the filter_operator
argument.
Example using “AND” – the default behavior:
histtext::search_documents_ex('"TRADE CORPORATION"', "proquest",
search_fields = "fulltext",
filter_query = list(title = '"Display Ad"',
category = '"Advertisement"'),
filter_operator = "AND")
Example using “AND” – the default behavior:
histtext::search_documents_ex('"TRADE CORPORATION"', "proquest",
search_fields = "fulltext",
filter_query = list(title = '"Display Ad"',
category = '"Advertisement"'),
filter_operator = "OR")
Both kind of filters can of course be used together. The documents will have to match the given dates and the other filters.
Example:
histtext::search_documents_ex('"Students in China"', "proquest",
filter_query = list(publisher = '"The China Press" OR "The North China Herald"'),
dates = c("1933", "1936", "1941"))
Note: You may notice that a “date” filter field also exists when using the list_filter_fields
function. I advise not using this field with the filter_query
argument, and instead only use the dates
argument which is far more powerful.
Before 1.0, the only way to retrieve the content from documents on SolR was to use the get_documents
function. A limitation with this function is that it can only retrieve some predefined fields. Moreover, it will sometimes output content that is not truly stored in SolR (i.e. titles in IMH).
Version 1.0 adds the function get_search_fields_content
that returns a dataframe with the content of the user-specified search (and filter) fields.
Example:
search1 <- histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao",
search_fields=c("title", "text"),
search_operator = "AND")
histtext::get_search_fields_content(search1, "shunpao",
search_fields = c("text", "title", "page", "num"),
verbose = FALSE)
It is now possible to view the content of a document inside a RStudio panel by using the view_document
function. It can also be used to highlight in the document a specified query.
Example:
histtext::view_document("SPSP191712011128", "shunpao", query = "\"蔣介石\"")
Currently available at: https://analytics.huma-num.fr/Jeremy.Auguste/rpackage/search/
Features: