1 Introduction

The HistText Interface is a companion application of the HistText R library developed by Jeremy Auguste and enhanced by Baptiste Blouin (Ph.D., postdoctoral researchers in the ENP-China Project [https://www.enpchina.eu/]). The application relies on R Shiny. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.

Links to the HistText interfaces:

Each interface page provides a link to the other one (right below the page title).

2 Search functions

The following section presents the basic operations to query the corpora and to retrieve the results as a list of documents with or without their full text.

2.1 Corpus selection

On the main menu, use the field “Choose a corpus/collection” to select the corpus that you want to query. Corpora are listed in a scrolling menu. They are displayed under code names. Click on the “?” on the right-hand side to obtain details on each corpus.

2.3 Concordance

The other mode of querying a corpus - concordance - is to search a term in the context of the sentence in which it appears. Check the “Concordance” button (below “Documents). A new field pops up”Context size" where you can determine the total number of characters around the queried term(s). By default, the context size is set at 30. It means that if you query a term with four characters such as 社會調查, the results will display 13 characters before and after the queried terms (30-4=26). Note that in Chinese a sinogram counts as one character, whereas in Latin scripts (English, French, etc.) a letter counts as one character.


The output is similar to the table generated by the “search document” function, with three additional columns containing the queried term(s) (Matched) and the text before (Before) and after (After) the queried term(s). In the table below, each row no longer represents a unique document, but each occurrence of the queried term(s) in the documents. The concordance table usually contains more rows than the table of documents, since the queried term(s) may appear several times in the same document.

2.4 Retrieve full text

You can retrieve the full text of the documents that you have obtained through Documents search or Concordance search by checking the “Also Retrieve Documents Content” box.

The output is similar to the table generated by the “search document” function, with one additional column “Text” containing the full text of the document(s).

2.6 Export results

To export the results of a query, if you want to export only the list of documents, click on the “Download (Search results)” button. If you want to export the full text of the documents after selecting “Also Retrieve Documents Content”, click on the “Download (Documents content)” button. The results can be downloaded as a csv file or a tsv file.

3 Search stats

When you tick the box labeled “Also Retrieve Documents Stats” while conducting a search, you have the opportunity to access a multitude of statistics that are tailored to the specific collections you have chosen.

3.1 Occurence in the Text Fields

You have the option to gather statistics concerning the frequency of occurrences of your search term in the textual field. This feature provides a swift and informative visualization of how often your search term appears within the documents that have been retrieved.

3.2 Occurence in the Text Fields by Year

It is also possible to obtain statistics on the occurence in the text fields of the query by year.

3.3 Results by sources

If the collection incorporates multiple sources, there is an option to visualize the search results’ quantity based on various sources. This functionality allows for a comprehensive understanding of how search results are distributed among different sources within the collection.

3.4 Word count percentage

Additionally, there is an option to compare the rate of occurrence of the searched terms to the total count of words found in the retrieved documents. This feature facilitates a contextual understanding of the significance, in terms of frequency of usage, of the searched terms within the context of the documents.