1 Introduction

The HistText Interface is a companion application of the HistText R library developed by Jeremy Auguste and enhanced by Baptiste Blouin (Ph.D., postdoctoral researchers in the ENP-China Project [https://www.enpchina.eu/]). The application relies on R Shiny. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.

Links to the HistText interfaces:

Each interface page provides a link to the other one (right below the page title).

2 Search functions

The following section presents the basic operations to query the corpora and to retrieve the results as a list of documents with or without their full text.

2.1 Corpus selection

On the main menu, use the field “Choose a corpus/collection” to select the corpus that you want to query. Corpora are listed in a scrolling menu. They are displayed under code names. Click on the “?” on the right-hand side to obtain details on each corpus.

2.3 Concordance

The other mode of querying a corpus - concordance - is to search a term in the context of the sentence in which it appears. Check the “Concordance” button (below “Documents). A new field pops up”Context size" where you can determine the total number of characters around the queried term(s). By default, the context size is set at 30. It means that if you query a term with four characters such as 社會調查, the results will display 13 characters before and after the queried terms (30-4=26). Note that in Chinese a sinogram counts as one character, whereas in Latin scripts (English, French, etc.) a letter counts as one character.


The output is similar to the table generated by the “search document” function, with three additional columns containing the queried term(s) (Matched) and the text before (Before) and after (After) the queried term(s). In the table below, each row no longer represents a unique document, but each occurrence of the queried term(s) in the documents. The concordance table usually contains more rows than the table of documents, since the queried term(s) may appear several times in the same document.

2.4 Retrieve full text

You can retrieve the full text of the documents that you have obtained through Documents search or Concordance search by checking the “Also Retrieve Documents Content” box.

The output is similar to the table generated by the “search document” function, with one additional column “Text” containing the full text of the document(s).

2.6 Export results

To export the results of a query, if you want to export only the list of documents, click on the “Download (Search results)” button. If you want to export the full text of the documents after selecting “Also Retrieve Documents Content”, click on the “Download (Documents content)” button. The results can be downloaded as a csv file or a tsv file.

3 Search stats

When you tick the box labeled “Also Retrieve Documents Stats” while conducting a search, you have the opportunity to access a multitude of statistics that are tailored to the specific collections you have chosen.

3.1 Occurence in the Text Fields

You have the option to gather statistics concerning the frequency of occurrences of your search term in the textual field. This feature provides a swift and informative visualization of how often your search term appears within the documents that have been retrieved.

3.2 Occurence in the Text Fields by Year

It is also possible to obtain statistics on the occurence in the text fields of the query by year.

3.3 Results by sources

If the collection incorporates multiple sources, there is an option to visualize the search results’ quantity based on various sources. This functionality allows for a comprehensive understanding of how search results are distributed among different sources within the collection.

3.4 Word count percentage

Additionally, there is an option to compare the rate of occurrence of the searched terms to the total count of words found in the retrieved documents. This feature facilitates a contextual understanding of the significance, in terms of frequency of usage, of the searched terms within the context of the documents.

3.5 Numbers of Characters in Texts

Furthermore, there is an option to visually represent the character count within the retrieved documents. This visualization organizes the documents into groups based on their sizes, offering a rapid and informative overview of the distribution of document sizes that need to be processed.

3.6 Results by Year

Additionally, there is an option to gather statistical information regarding the number of search results per year. This functionality offers valuable temporal insights, allowing you to understand the distribution of search results across different years.

3.7 Percentage by years over all results

Through this visualization, you can compare, on a yearly basis, the sub-collections obtained through the search results to the complete original collection. This feature offers valuable insights into the scale at which the search results are positioned in relation to the entire collection, allowing for a comprehensive understanding of the distribution across different years.

4 Word Cloud

Another feature includes the ability to generate a word cloud based on the proximity of words within the documents retrieved for the searched terms. This visualization offers insights into the context surrounding the searched terms, potentially facilitating the expansion of the initial search by revealing related terms and concepts.

5 Embeddings

To enhance the search scope, the use of embeddings is available. After loading the embeddings by selecting “Load Embedding” and subsequently clicking “Submit Neighbor” semantically related words within the collection are identified. Clicking on any of these new words adds them to the initial search using an “OR” operator, thereby expanding the search to include additional relevant terms.

6 Search NER

Additionally, you have the option to get a sneak peek into the functionalities offered by the “Natural Language Processing Tools” page, specifically focusing on the first 10 documents. This preview provides a glimpse of the tools’ capabilities in the context of natural language processing applied to the initial set of documents. You need to check “Also retrieve NER” to access this functionality.

7 Natural Language Processing

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. HistText applies a default NER model that varies according to the language (English or Chinese) and the nature (nature of the language, OCR, etc.) of the queried corpus.

The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into the following types:

PERSON People,   including fictional
ORGANIZATION Companies, agencies,   institutions, etc.
CARDINAL Numerals that do not fall   under another type
DATE Absolute or relative   dates or periods
EVENT Named hurricanes,   battles, wars, sports events, etc.
FACILITY Buildings, airports,   highways, bridges, etc.
GPE Countries, cities,   states
LAW Named documents made into   laws
LOCATION Non-GPE locations,   mountain ranges, bodies of water
MONEY Monetary values,   including unit
NORP Nationalities or   religious or political groups
ORDINAL “first”, “second”
PERCENT Percentage (including   “%”)
PRODUCT Vehicles, weapons, foods,   etc. (Not services)
QUANTITY Measurements, as of   weight or distance
TIME Times smaller than a   day
WORK OF ART Titles of books, songs,   etc.

7.1 Input data

There are two modes of importing data, either as a csv/tsv file obtained through the document retrieval interface or by manually entering document ID(s). The csv/tsv file needs to have a DocId column (the other columns are optional). The number of documents in the file is currently capped at 200.

To apply NER, first select the corpus and upload the csv/tsv file.

Next, click on Annotate

There are two ways to display the results: as a dataframe (a list of the entities in tabular format) or as a visualization (named entities in their context).

7.2 Dataframe

The table of results contains six columns:

  • DocID
  • Type: the type of the entity extracted with its confidence index (measuring the accuracy of the classification by the algorithm)
  • Text: the name of the entity (as given in the source)
  • Start: the position of the character immediately preceding the entity in the text
  • End: the position of the character immediately following the entity in the text
  • Confidence: score of confidence on the classification of entities (scale from -1 to 1)

You can focus on specific types of entity by using the “Filter labels” field

You can select more than one type in the “Filter labels” field. In the example below, we filter by ORG(anization) and PER(son).

You can also refine the results using the “Minimum confidence” field. The “Search” field serves to query specific terms in all the columns of the results.

Finally, you can download the results as a csv file.

7.3 Visualization

The Visualization tab displays the Named entities in their context with color-coded labels.

7.4 Statistics

Much like the search feature, you can gather diverse statistics concerning the outcomes of named entity annotation. This includes detailed information about the identified named entities within the annotated documents, providing valuable insights into the distribution and characteristics of the recognized entities.

7.4.1 Identity

You can access statistics that specifically detail the count of named entities categorized by each type. This feature provides a breakdown of the identified entities, offering a comprehensive view of their distribution across different types within the annotated documents.

The “Proportion of type” feature is available to simplify the comparison of the volumes of various entity types. This functionality allows you to easily assess the relative distribution and prevalence of each entity type within the annotated documents, providing a clearer understanding of their proportional significance.

7.4.2 Date vs. Count

You can access a similar distribution across entity types, organized by years.

7.4.3 Proportion of Types by DocId

You can access a similar distribution across entity types, organized by DocId.

7.4.4 End

Through this feature, you can discern the placement of each entity type within the text, offering insights into whether specific types of entities tend to occur more frequently at the beginning or end of documents compared to others. This analysis provides a nuanced understanding of the distribution patterns of named entities within the annotated texts.

7.4.5 Confidence

Additionally, you have the option to access the distribution of the model’s confidence scores on the annotated documents. A confidence score closer to 1 indicates higher certainty from the model. This feature provides valuable insights into estimating the level of difficulty the model faced and offers an indication of the potential quantity of errors in the results, aiding in a nuanced assessment of the model’s performance.

7.4.6 Confidence vs. Type

You can access analogous results, but specific to each of the entity types. This functionality provides a detailed breakdown of the model’s confidence distribution for each entity type, allowing for a more granular assessment of the model’s performance on different types of entities within the annotated documents.

7.5 Manual input

It is also possible to apply NER to individual documents based on their document ID. Make sure you are querying the relevant corpus.

You can annotate multiple documents at the same time by adding DocId on separate lines.

Finally, you can download the results as a csv file.