Abstract
This manual presents the functionalities of the HistText Shiny Application built on the histtext R library. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.
The HistText Interface is a companion application of the HistText R library developed by Jeremy Auguste and enhanced by Baptiste Blouin (Ph.D., postdoctoral researchers in the ENP-China Project [https://www.enpchina.eu/]). The application relies on R Shiny. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.
Links to the HistText interfaces:
Each interface page provides a link to the other one (right below the page title).
The following section presents the basic operations to query the corpora and to retrieve the results as a list of documents with or without their full text.
On the main menu, use the field “Choose a corpus/collection” to select the corpus that you want to query. Corpora are listed in a scrolling menu. They are displayed under code names. Click on the “?” on the right-hand side to obtain details on each corpus.
Enter your query term(s) in the “Query” field as shown in the example below. Click on the “?” on the right-hand side to obtain details on the query syntax. The queried terms must be placed between quotation marks.
By default, the search will return a list of documents in which the queried terms appear. The results are displayed in the “Search Results” tab by groups of ten rows. The total number of results appear below the table. In the example below, we query the expression “社會調查” in the Shenbao:
The table consists of four columns:
The results window provides a search field to search in the various columns.
Boolean operators: you can also use Boolean operators to query multiple terms at the same time. By default, if you enter multiple terms separated by a white space, the OR operator will apply. It will search any of the terms.
Use the “AND” operator (always in capital letters) to retrieve documents in which the two terms appear together.
The other mode of querying a corpus - concordance - is to search a term in the context of the sentence in which it appears. Check the “Concordance” button (below “Documents). A new field pops up”Context size" where you can determine the total number of characters around the queried term(s). By default, the context size is set at 30. It means that if you query a term with four characters such as 社會調查, the results will display 13 characters before and after the queried terms (30-4=26). Note that in Chinese a sinogram counts as one character, whereas in Latin scripts (English, French, etc.) a letter counts as one character.
The output is similar to the table generated by the “search document” function, with three additional columns containing the queried term(s) (Matched) and the text before (Before) and after (After) the queried term(s). In the table below, each row no longer represents a unique document, but each occurrence of the queried term(s) in the documents. The concordance table usually contains more rows than the table of documents, since the queried term(s) may appear several times in the same document.
You can retrieve the full text of the documents that you have obtained through Documents search or Concordance search by checking the “Also Retrieve Documents Content” box.
The output is similar to the table generated by the “search document” function, with one additional column “Text” containing the full text of the document(s).
The HistText application provides advanced search functions that we describe below. These advanced search functions apply to search fields or to filters on the results. These options vary according to the queried corpus because each come with a different set of metadata.
In Advanced Options, you can select the fields that you want to query. By defauls, all the fields of a given corpus are selected. You can focus your query by deselecting any of the fields that you want to exclude. In the example below, the query on Dongfang zazhi (dongfangzz) applies only to the Title field.
By default, the search will retrieve documents in which the queried term(s) are present in either field. If you check the “Must Match for All Search Fields” box, the queried term(s) must appear in all the selected fields.
There are two main types of filters: “Filter by date” can be applied to any corpus; “Query filters” offer filter options that vary according to the metadata of the selected corpus.
To filter by date, click on the “Filter by date” box. You can select the data range that will apply in the two fields of “Date range”. The dates need to be expressed in the standard format of the interface (YYYY-MM-DD).
To apply the other types of filters, click on the scrolling menus under “Query Filters”. The available categories vary according to each corpus. In the example below, we queried the ProQuest Chinese Newspapers Collection (CNC). Two facets are available: “Publisher” (name of the periodical) and “Category” (type of article). You can select multiple periodicals or multiple categories. You can also combine multiple facets. In the example below, we selected two periodicals (The China Critic, The China Weekly Review) and two categories of articles (Advertisements and Classified Advertisements).
You can also check to “Not” boxes to exclude the selected “Publishers” or the selected “Categories”. The query will apply to all the other “Publishers” or “Categories”
To export the results of a query, if you want to export only the list of documents, click on the “Download (Search results)” button. If you want to export the full text of the documents after selecting “Also Retrieve Documents Content”, click on the “Download (Documents content)” button. The results can be downloaded as a csv file or a tsv file.
When you tick the box labeled “Also Retrieve Documents Stats” while conducting a search, you have the opportunity to access a multitude of statistics that are tailored to the specific collections you have chosen.
You have the option to gather statistics concerning the frequency of occurrences of your search term in the textual field. This feature provides a swift and informative visualization of how often your search term appears within the documents that have been retrieved.
It is also possible to obtain statistics on the occurence in the text fields of the query by year.
If the collection incorporates multiple sources, there is an option to visualize the search results’ quantity based on various sources. This functionality allows for a comprehensive understanding of how search results are distributed among different sources within the collection.
Additionally, there is an option to compare the rate of occurrence of the searched terms to the total count of words found in the retrieved documents. This feature facilitates a contextual understanding of the significance, in terms of frequency of usage, of the searched terms within the context of the documents.