Abstract
This manual presents the functionalities of the HistText Shiny Application built on the histtext R library. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.
The HistText Interface is a companion application of the HistText R library developed by Jeremy Auguste and enhanced by Baptiste Blouin (Ph.D., postdoctoral researchers in the ENP-China Project [https://www.enpchina.eu/]). The application relies on R Shiny. It consists of two main sections: (1) Search and Documents Retrieval (2) Natural Language Processing Tools.
Links to the HistText interfaces:
Each interface page provides a link to the other one (right below the page title).
The following section presents the basic operations to query the corpora and to retrieve the results as a list of documents with or without their full text.
On the main menu, use the field “Choose a corpus/collection” to select the corpus that you want to query. Corpora are listed in a scrolling menu. They are displayed under code names. Click on the “?” on the right-hand side to obtain details on each corpus.
Enter your query term(s) in the “Query” field as shown in the example below. Click on the “?” on the right-hand side to obtain details on the query syntax. The queried terms must be placed between quotation marks.
By default, the search will return a list of documents in which the queried terms appear. The results are displayed in the “Search Results” tab by groups of ten rows. The total number of results appear below the table. In the example below, we query the expression “社會調查” in the Shenbao:
The table consists of four columns:
The results window provides a search field to search in the various columns.
Boolean operators: you can also use Boolean operators to query multiple terms at the same time. By default, if you enter multiple terms separated by a white space, the OR operator will apply. It will search any of the terms.
Use the “AND” operator (always in capital letters) to retrieve documents in which the two terms appear together.
The other mode of querying a corpus - concordance - is to search a term in the context of the sentence in which it appears. Check the “Concordance” button (below “Documents). A new field pops up”Context size" where you can determine the total number of characters around the queried term(s). By default, the context size is set at 30. It means that if you query a term with four characters such as 社會調查, the results will display 13 characters before and after the queried terms (30-4=26). Note that in Chinese a sinogram counts as one character, whereas in Latin scripts (English, French, etc.) a letter counts as one character.
The output is similar to the table generated by the “search document” function, with three additional columns containing the queried term(s) (Matched) and the text before (Before) and after (After) the queried term(s). In the table below, each row no longer represents a unique document, but each occurrence of the queried term(s) in the documents. The concordance table usually contains more rows than the table of documents, since the queried term(s) may appear several times in the same document.
You can retrieve the full text of the documents that you have obtained through Documents search or Concordance search by checking the “Also Retrieve Documents Content” box.
The output is similar to the table generated by the “search document” function, with one additional column “Text” containing the full text of the document(s).
The HistText application provides advanced search functions that we describe below. These advanced search functions apply to search fields or to filters on the results. These options vary according to the queried corpus because each come with a different set of metadata.
In Advanced Options, you can select the fields that you want to query. By defauls, all the fields of a given corpus are selected. You can focus your query by deselecting any of the fields that you want to exclude. In the example below, the query on Dongfang zazhi (dongfangzz) applies only to the Title field.
By default, the search will retrieve documents in which the queried term(s) are present in either field. If you check the “Must Match for All Search Fields” box, the queried term(s) must appear in all the selected fields.
There are two main types of filters: “Filter by date” can be applied to any corpus; “Query filters” offer filter options that vary according to the metadata of the selected corpus.
To filter by date, click on the “Filter by date” box. You can select the data range that will apply in the two fields of “Date range”. The dates need to be expressed in the standard format of the interface (YYYY-MM-DD).
To apply the other types of filters, click on the scrolling menus under “Query Filters”. The available categories vary according to each corpus. In the example below, we queried the ProQuest Chinese Newspapers Collection (CNC). Two facets are available: “Publisher” (name of the periodical) and “Category” (type of article). You can select multiple periodicals or multiple categories. You can also combine multiple facets. In the example below, we selected two periodicals (The China Critic, The China Weekly Review) and two categories of articles (Advertisements and Classified Advertisements).
You can also check to “Not” boxes to exclude the selected “Publishers” or the selected “Categories”. The query will apply to all the other “Publishers” or “Categories”
To export the results of a query, if you want to export only the list of documents, click on the “Download (Search results)” button. If you want to export the full text of the documents after selecting “Also Retrieve Documents Content”, click on the “Download (Documents content)” button. The results can be downloaded as a csv file or a tsv file.
When you tick the box labeled “Also Retrieve Documents Stats” while conducting a search, you have the opportunity to access a multitude of statistics that are tailored to the specific collections you have chosen.
You have the option to gather statistics concerning the frequency of occurrences of your search term in the textual field. This feature provides a swift and informative visualization of how often your search term appears within the documents that have been retrieved.
It is also possible to obtain statistics on the occurence in the text fields of the query by year.
If the collection incorporates multiple sources, there is an option to visualize the search results’ quantity based on various sources. This functionality allows for a comprehensive understanding of how search results are distributed among different sources within the collection.
Additionally, there is an option to compare the rate of occurrence of the searched terms to the total count of words found in the retrieved documents. This feature facilitates a contextual understanding of the significance, in terms of frequency of usage, of the searched terms within the context of the documents.
Furthermore, there is an option to visually represent the character count within the retrieved documents. This visualization organizes the documents into groups based on their sizes, offering a rapid and informative overview of the distribution of document sizes that need to be processed.
Additionally, there is an option to gather statistical information regarding the number of search results per year. This functionality offers valuable temporal insights, allowing you to understand the distribution of search results across different years.
Through this visualization, you can compare, on a yearly basis, the sub-collections obtained through the search results to the complete original collection. This feature offers valuable insights into the scale at which the search results are positioned in relation to the entire collection, allowing for a comprehensive understanding of the distribution across different years.
Another feature includes the ability to generate a word cloud based on the proximity of words within the documents retrieved for the searched terms. This visualization offers insights into the context surrounding the searched terms, potentially facilitating the expansion of the initial search by revealing related terms and concepts.
To enhance the search scope, the use of embeddings is available. After loading the embeddings by selecting “Load Embedding” and subsequently clicking “Submit Neighbor” semantically related words within the collection are identified. Clicking on any of these new words adds them to the initial search using an “OR” operator, thereby expanding the search to include additional relevant terms.
Additionally, you have the option to get a sneak peek into the functionalities offered by the “Natural Language Processing Tools” page, specifically focusing on the first 10 documents. This preview provides a glimpse of the tools’ capabilities in the context of natural language processing applied to the initial set of documents. You need to check “Also retrieve NER” to access this functionality.
Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. HistText applies a default NER model that varies according to the language (English or Chinese) and the nature (nature of the language, OCR, etc.) of the queried corpus.
The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into the following types:
PERSON | People, including fictional | |
---|---|---|
ORGANIZATION | Companies, agencies, institutions, etc. | |
CARDINAL | Numerals that do not fall under another type | |
DATE | Absolute or relative dates or periods | |
EVENT | Named hurricanes, battles, wars, sports events, etc. | |
FACILITY | Buildings, airports, highways, bridges, etc. | |
GPE | Countries, cities, states | |
LAW | Named documents made into laws | |
LOCATION | Non-GPE locations, mountain ranges, bodies of water | |
MONEY | Monetary values, including unit | |
NORP | Nationalities or religious or political groups | |
ORDINAL | “first”, “second” | |
PERCENT | Percentage (including “%”) | |
PRODUCT | Vehicles, weapons, foods, etc. (Not services) | |
QUANTITY | Measurements, as of weight or distance | |
TIME | Times smaller than a day | |
WORK OF ART | Titles of books, songs, etc. |
There are two modes of importing data, either as a csv/tsv file obtained through the document retrieval interface or by manually entering document ID(s). The csv/tsv file needs to have a DocId column (the other columns are optional). The number of documents in the file is currently capped at 200.
To apply NER, first select the corpus and upload the csv/tsv file.
Next, click on Annotate
There are two ways to display the results: as a dataframe (a list of the entities in tabular format) or as a visualization (named entities in their context).
The table of results contains six columns:
You can focus on specific types of entity by using the “Filter labels” field
You can select more than one type in the “Filter labels” field. In the example below, we filter by ORG(anization) and PER(son).
You can also refine the results using the “Minimum confidence” field. The “Search” field serves to query specific terms in all the columns of the results.
Finally, you can download the results as a csv file.
The Visualization tab displays the Named entities in their context with color-coded labels.
Much like the search feature, you can gather diverse statistics concerning the outcomes of named entity annotation. This includes detailed information about the identified named entities within the annotated documents, providing valuable insights into the distribution and characteristics of the recognized entities.
You can access statistics that specifically detail the count of named entities categorized by each type. This feature provides a breakdown of the identified entities, offering a comprehensive view of their distribution across different types within the annotated documents.
The “Proportion of type” feature is available to simplify the comparison of the volumes of various entity types. This functionality allows you to easily assess the relative distribution and prevalence of each entity type within the annotated documents, providing a clearer understanding of their proportional significance.
You can access a similar distribution across entity types, organized by years.
You can access a similar distribution across entity types, organized by DocId.
Through this feature, you can discern the placement of each entity type within the text, offering insights into whether specific types of entities tend to occur more frequently at the beginning or end of documents compared to others. This analysis provides a nuanced understanding of the distribution patterns of named entities within the annotated texts.
Additionally, you have the option to access the distribution of the model’s confidence scores on the annotated documents. A confidence score closer to 1 indicates higher certainty from the model. This feature provides valuable insights into estimating the level of difficulty the model faced and offers an indication of the potential quantity of errors in the results, aiding in a nuanced assessment of the model’s performance.
You can access analogous results, but specific to each of the entity types. This functionality provides a detailed breakdown of the model’s confidence distribution for each entity type, allowing for a more granular assessment of the model’s performance on different types of entities within the annotated documents.
It is also possible to apply NER to individual documents based on their document ID. Make sure you are querying the relevant corpus.
You can annotate multiple documents at the same time by adding DocId on separate lines.
Finally, you can download the results as a csv file.