1 What’s in the news ?

2 New features

  • Added a way to change the logical operator used for search queries that use multiple search fields in search_*_ex functions.
  • Can now do filter queries when using search_*_ex functions.
  • list_filter_fields is used to know available filter fields in a corpus
  • list_possible_filters lists all possible values for a filter field.
  • search_* and search_*_ex functions now automatically fetch all results, even for queries that return a lot of rows.
  • New get_search_fields_content function that can be used to retrieve content from given search fields.
  • view_document can be used to show the content of a document with optional highlighting.

2.1 Changing the logical operator when using multiple search fields

By default, when using multiple search fields, the search_*_ex functions apply a OR between the fields. This means that the search query has to be found in at least one of the specified search fields.

You may sometimes want to only fetch results where the search query is found in all search fields. This can be done by changing the logical operator with AND by using the search_operator argument.

Example (using OR - default behavior):

histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao", 
                              search_fields=c("title", "text"),
                              search_operator = "OR")

Example (using AND):

histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao", 
                              search_fields=c("title", "text"),
                              search_operator = "AND")

2.2 Filtering search queries

In addition to dates (for corpora that support them), it is now possible to filter search queries by using other types of information (such as publishers or a book name).

2.2.1 Listing available filter fields

Before being able to filter queries, it is required to know what are the available filter fields for a given corpus.

histtext::list_filter_fields("proquest")
## [1] "date"      "publisher" "category"
histtext::list_filter_fields("imh-en")
## [1] "book"   "page"   "date"   "bookno"

Note: Regular search fields may also be used as filter fields.

2.2.2 Listing possible values for a given filter field

In order to filter search queries, it can be helpful to have an idea of the possible filters that you could apply for a given filter field.

Listing the different publishers found in proquest:

histtext::list_possible_filters("proquest", "publisher")

The first column Value contains the possible filters that you can use. The second column N informs you of the number of documents that have the corresponding filter value.

Listing the different document categories found in proquest:

histtext::list_possible_filters("proquest", "category")

2.2.3 Using filter queries

Once the desired filters are chosen, they can be used with search_documents_ex and search_concordance_ex by creating named vectors where the names are the field names to filter on and the values are the filters. The named vector is then used with the filter_query argument of the search_*_ex functions.

Example filtering by book name in imh-zh:

histtext::search_documents_ex("人", "imh-zh", 
                              filter_query = list(book = '"上海人名錄"'))

Example filtering by publisher name (The North China Herald) in proquest:

histtext::search_documents_ex('"Students in China"', "proquest", 
                              filter_query = list(publisher = '"The North China Herald"'))

Example filtering by publisher name (The China Press) in proquest:

histtext::search_documents_ex('"Students in China"', "proquest", 
                              filter_query = list(publisher = '"The China Press"'))

2.2.4 Using multiple filters

It is possible to filter multiple values on the same filter field and/or to filter on multiple fields.

2.2.4.1 Filtering on multiple different fields

Example filtering by publisher name (The China Press) and by titles which contain “Students”:

histtext::search_documents_ex("CORPORATION", "proquest",
                              search_fields = "fulltext",
                              filter_query = list(publisher = '"The China Press"',
                                                  title = '"Students"'))

2.2.4.2 Filtering with multiple filters on the same field

Example filtering by two book names:

histtext::search_documents_ex("人", "imh-zh", 
                              filter_query = list(book = '"上海人名錄" OR "上海總商會同人錄"'))

Example filtering by two publisher names:

histtext::search_documents_ex('"Students in China"', "proquest", 
                              filter_query = list(publisher = '"The China Press" OR "The North China Herald"'))

2.2.4.3 Combining both multiple filter fields and multiple filters per field

Example filtering by two publisher names and by the “Article” category name:

histtext::search_documents_ex('"Students in China"', "proquest", 
                              filter_query = list(publisher = '"The China Press" OR "The North China Herald"',
                                                  category = 'Article'))

2.2.5 Changing the logical operator used between filter fields

By default, when using multiple filter fields, the search_*_ex functions apply a AND between the fields, which means that a document is returned if it validates all filters. This behavior can be changed by using the filter_operator argument.

Example using “AND” – the default behavior:

histtext::search_documents_ex('"TRADE CORPORATION"', "proquest",
                              search_fields = "fulltext",
                              filter_query = list(title = '"Display Ad"',
                                                  category = '"Advertisement"'),
                              filter_operator = "AND")

Example using “AND” – the default behavior:

histtext::search_documents_ex('"TRADE CORPORATION"', "proquest",
                              search_fields = "fulltext",
                              filter_query = list(title = '"Display Ad"',
                                                  category = '"Advertisement"'),
                              filter_operator = "OR")

2.2.6 Combining field filters and date filters

Both kind of filters can of course be used together. The documents will have to match the given dates and the other filters.

Example:

histtext::search_documents_ex('"Students in China"', "proquest", 
                              filter_query = list(publisher = '"The China Press" OR "The North China Herald"'),
                              dates = c("1933", "1936", "1941"))

Note: You may notice that a “date” filter field also exists when using the list_filter_fields function. I advise not using this field with the filter_query argument, and instead only use the dates argument which is far more powerful.

2.3 Retrieving specific content from SolR

Before 1.0, the only way to retrieve the content from documents on SolR was to use the get_documents function. A limitation with this function is that it can only retrieve some predefined fields. Moreover, it will sometimes output content that is not truly stored in SolR (i.e. titles in IMH).

Version 1.0 adds the function get_search_fields_content that returns a dataframe with the content of the user-specified search (and filter) fields.

Example:

search1 <- histtext::search_documents_ex("\"蔣介石\"", corpus="shunpao", 
                                         search_fields=c("title", "text"),
                                         search_operator = "AND")
histtext::get_search_fields_content(search1, "shunpao",
                                    search_fields = c("text", "title", "page", "num"),
                                    verbose = FALSE)

2.4 Viewing a document in RStudio

It is now possible to view the content of a document inside a RStudio panel by using the view_document function. It can also be used to highlight in the document a specified query.

Example:

histtext::view_document("SPSP191712011128", "shunpao", query = "\"蔣介石\"")

2.5 Other “behind-the-scene” changes

  • Removed hard limit of 100,000 results
  • Big queries should have less timeout issues
  • Authentication changes (no influence on users of the private repository)

3 Corpora news and updates

  • Added new corpora: dongfangzz (zh), zhanggangdiary (zh), ncbras (en - partial), reports (en & fr)
  • proquest changes: added a new “category” field

4 HistText R Shiny User Interface

Currently available at: https://analytics.huma-num.fr/Jeremy.Auguste/rpackage/search/

Features:

  • Allows searching documents from a query
  • Allows retrieving documents as a CSV file
  • Implements most of the advanced features of the search_*_ex functions

5 What about the future of HistText ?

  • Overhaul the NER pipeline:
    • allow the user to choose between multiple models (spacy, stanford corenlp, in-house)
    • cache previously computed results
    • make output formats consistent
  • Improve error handling
  • I’m sure you’ll all have many more feature ideas