5 Named Entity Recognition (NER)

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. In histText, the function ner_on_corpus applies a default NER model that varies according to the language (English or Chinese) or the nature (nature of the language, OCR, etc.) of the queried corpus.

The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into eight main types: persons (PERS), organizations (ORG), locations (LOC), geopolitical entities (GPE) (countries, administrative regions…), temporal entities (DATE, TIME), numerical entities (MONEY, PERCENTAGE), and miscellaneous entities (MISC).

5.1 Named Entity Extraction

The function ner_on_corpus is composed of two arguments: the collection of documents with their full text (output of the “get_documents()” function), and the targeted corpus.

Example in English (ProQuest:

docs_ner_eng <- ner_on_corpus(docs_eng, corpus = "proquest")

## 1/785
## 11/785
## 21/785
## 31/785
## 41/785
## 51/785
## 61/785
## 71/785
## 81/785
## 91/785
## 101/785
## 111/785
## 121/785
## 131/785
## 141/785
## 151/785
## 161/785
## 171/785
## 181/785
## 191/785
## 201/785
## 211/785
## 221/785
## 231/785
## 241/785
## 251/785
## 261/785
## 271/785
## 281/785
## 291/785
## 301/785
## 311/785
## 321/785
## 331/785
## 341/785
## 351/785
## 361/785
## 371/785
## 381/785
## 391/785
## 401/785
## 411/785
## 421/785
## 431/785
## 441/785
## 451/785
## 461/785
## 471/785
## 481/785
## 491/785
## 501/785
## 511/785
## 521/785
## 531/785
## 541/785
## 551/785
## 561/785
## 571/785
## 581/785
## 591/785
## 601/785
## 611/785
## 621/785
## 631/785
## 641/785
## 651/785
## 661/785
## 671/785
## 681/785
## 691/785
## 701/785
## 711/785
## 721/785
## 731/785
## 741/785
## 751/785
## 761/785
## 771/785
## 781/785

docs_ner_eng

## # A tibble: 79,312 × 6
##    DocId      Type     Text                     Start   End Confidence
##    <chr>      <chr>    <chr>                    <int> <int>      <dbl>
##  1 1420025538 GPE      SHANGHAI                     0     8       0.59
##  2 1420025538 ORG      ROTARY CLUB                  9    20       0.94
##  3 1420025538 GPE      China                       63    68       0.96
##  4 1420025538 ORG      the Shanghai Rotary Club   111   135       1   
##  5 1420025538 FAC      the Union Club             139   153       0.94
##  6 1420025538 DATE     June 18,                   157   165       1   
##  7 1420025538 CARDINAL 100                        188   191       0.63
##  8 1420025538 PERSON   Fitch                      214   219       0.99
##  9 1420025538 PERSON   Ilawkings                  293   302       1   
## 10 1420025538 ORG      Rotary                     316   322       0.86
## # ℹ 79,302 more rows

Example in Chinese (Shenbao):

docs_ner_zh <- ner_on_corpus(docs_ft, corpus = "shunpao")

## 1/507
## 11/507
## 21/507
## 31/507
## 41/507
## 51/507
## 61/507
## 71/507
## 81/507
## 91/507
## 101/507
## 111/507
## 121/507
## 131/507
## 141/507
## 151/507
## 161/507
## 171/507
## 181/507
## 191/507
## 201/507
## 211/507
## 221/507
## 231/507
## 241/507
## 251/507
## 261/507
## 271/507
## 281/507
## 291/507
## 301/507
## 311/507
## 321/507
## 331/507
## 341/507
## 351/507
## 361/507
## 371/507
## 381/507
## 391/507
## 401/507
## 411/507
## 421/507
## 431/507
## 441/507
## 451/507
## 461/507
## 471/507
## 481/507
## 491/507
## 501/507

docs_ner_zh

## # A tibble: 145,182 × 6
##    DocId            Type     Text                       Start   End Confidence
##    <chr>            <chr>    <chr>                      <int> <int>      <dbl>
##  1 SPSP193010291607 ORG      本埠                           2     4       0.53
##  2 SPSP193010291607 ORG      扶輪社                         4     7       1   
##  3 SPSP193010291607 NORP     俄童                           9    11       0.97
##  4 SPSP193010291607 TIME     本月二十四日星期五晚          19    29       1   
##  5 SPSP193010291607 FAC      戈登路大華飯店                30    37       0.94
##  6 SPSP193010291607 CARDINAL 一                            50    51       1   
##  7 SPSP193010291607 TIME     每晨九時半至十二時            74    83       1   
##  8 SPSP193010291607 FAC      南京路五十號                  84    90       0.74
##  9 SPSP194704140406 EVENT    國際扶輪社第九十七區域年會     0    13       1   
## 10 SPSP194704140406 DATE     昨日                          13    15       1   
## # ℹ 145,172 more rows

The function returns a table with five columns:

Id: unique identifier of the document
Text: the name of the entity (as given in the source)
Type: the type of the entity extracted with its confidence index (measuring the accuracy of the classification by the algorithm)
Start: the position of the character immediately preceding the entity in the text
End: the position of the character immediately following the entity in the text

The histext package includes alternative models tailored to the nature of the text in the different corpora. Mostly, it provides specific models for text with noisy Optical Character Recognition (OCR) and/or missing punctuation.

The function list_ner_models() lists all the models available in the package:

list_ner_models()

##  [1] "spacy_sm:en:ner"            "spacy:en:ner"              
##  [3] "spacy_sm:zh:ner"            "spacy:zh:ner"              
##  [5] "trftc_noisy_nopunct:en:ner" "trftc_noisy:en:ner"        
##  [7] "trftc_nopunct:zh:ner"       "trftc_camembert:fr:ner"    
##  [9] "trftc_person_1class:zh:ner" "trftc_person_4class:zh:ner"

There are two main models: spaCy and trftc (Transformer Token Classification) - a model specifically developed by the ENP-China team (Baptiste Blouin, Jeremy Auguste) to handle specific issues that pertain to historical corpora in multiple languages.

Name	Model	Language	Features
spacy_sm:en:ner	spaCy	English	Default model for large corpora (faster but less reliable)
spacy_sm:zh:ner	spaCy	Chinese	Default model for large corpora (faster but less reliable)
spacy:en:ner	spaCy	English	Default model for large or small corpora (slower but more reliable)
spacy:zh:ner	spaCy	Chinese	Default model for large or small corpora (slower but more reliable)
trftc_noisy_nopunct:en:ner	TrfTC	English	Model for texts with noisy OCR and missing punctuation
trftc_noisy:en:ner	TrfTC	English	Model for texts with noisy OCR
trftc_nopunct:zh:ner	TrfTC	Chinese	Model for texts with no punctuation
trftc_camembert:fr:ner	TrfTC	French	Models for texts in French (based on BERT)
trftc_person_1class:zh:ner	TrfTC	Chinese	Model trained to identify persons in any category
trftc_person_4class:zh:ner	TrfTC	Chinese	Model trained to detect persons’ names with embedded titles (王少年 for 王局長少年, 王君少年)

To know which default model is used for a given corpus, use the function get_default_ner_model():

get_default_ner_model("proquest")
get_default_ner_model("imh-zh")

To specify the model you want to apply, use the argument “model =” in the ner_on_corpus() function:

ner_on_corpus(docs_ft, corpus = "proquest", model = "trftc_noisy_nopunct:en:ner")
ner_on_corpus(docs_ft, corpus = "shunpao", model = "trftc_person_4class:zh:ner")

5.2 Padagraph Visualization

The data extracted through NER can be visualized as a network graph using Padagraph. To enable this function, one needs to create first a network object using libraries such as igraph or tidygraph. The following lines of code detail the successive steps to transform the results of NER into an edge list and eventually a network object. The last step consists in applying the function in_padagraph() to project the tidygraph object in Padagraph. In the following example, we propose to build a two-mode network linking persons or organizations with the documents in which they appear:

# create the edge list linking documents with persons or organizations
edge_ner_eng <- docs_ner_eng %>%
  filter(Type %in% c("PERSON", "ORG")) %>% # select persons and organizations
  select(DocID, Text) %>% # retain only the relevant variables (from/to)
  filter(!Text %in% c("He","She","His","Her","Him", "Hers", "Chin", "Mrs", "Mr", "Gen", "Madame", "Madam", "he","she","his","her","him", "hers")) # remove personal pronouns
# retain entities that appeared at least twice
topnode_docs_eng <- edge_ner_eng %>%
  group_by(Text) %>%
  add_tally() %>%
  filter(n>1) %>%
  distinct(Text)
# build the network with igraph/tidygraph
topedge <- edge_ner_eng %>%
  filter(Text %in% topnode_docs_eng$Text) %>%
  rename(from = DocID, to = Text)
ig <- graph_from_data_frame(d=topedge, directed = FALSE)
tg <- tidygraph::as_tbl_graph(ig) %>%
  activate(nodes) %>%
  mutate(label=name)
# project in padagraph
tg %>% histtext::in_padagraph("RotaryNetwork")

The function get_padagraph_url directly returns the URL for displaying the graph:

tg %>% histtext::get_padagraph_url("RotaryNetwork")

Rotary Network in ProQuest
The URL for the graph is created as a permanent link that can be used in third applications or to return to the network.

The function load_in_padagraph() serves to import a tidygraph object into “Padagraph.” The function is composed of three arguments:

filepath: the path to a file which contains a tidygraph (created with the function ‘save_graph’ included in histtext)
name: the name to be given to the graph
show_graph: if TRUE, show the graph in a RStudio viewer

The function returns an URL that displays the tidygraph object in Padagraph.

save_graph(graph, filepath)
load_in_padagraph(filepath, name, show_graph = TRUE)

5.3 NER on external documents

HistText includes two functions to implement NER on external documents, i.e. that can be uploaded directly into R Studio:

run_ner() : to be applied on a string of character
ner_on_df(): to be applied to a dataframe (e.g. a text extracted from a PDF document or scraped from the web, in a dataframe format)

In addition, HistText includes a function that transform a pre-ocerized PDF document into a dataframe.

The function run_ner() contained two arguments: the text to be analyzed (text) and the model to be used (model) (use ‘list_ner_models()’ to obtain the available models):

run_ner(text, model = "spacy:en:ner", verbose = FALSE)

The function ner_on_df requires at least two arguments: the name of the variable that contains the text (Text) and the language model that will apply (English or Chinese). In the example below, we add an argument to specify the page number with it s identifier (document). We also choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document.

ner_df <- ner_on_df(dplyr::slice(docs_eng_ft, 1:10), "Text", id_column="DocId", model = "spacy:en:ner")

## 1/10

ner_df

## # A tibble: 325 × 6
##    Type   Text                     Start   End Confidence DocId     
##    <chr>  <chr>                    <dbl> <dbl>      <dbl> <chr>     
##  1 ORG    the Shanghai Rotary Club   111   135         -1 1420025538
##  2 FAC    the Union Club             139   153         -1 1420025538
##  3 DATE   June 18                    157   164         -1 1420025538
##  4 PERSON Fitch                      214   219         -1 1420025538
##  5 PERSON Ilawkings                  293   302         -1 1420025538
##  6 PERSON Ilawkings                  375   384         -1 1420025538
##  7 LOC    Europe                     435   441         -1 1420025538
##  8 GPE    Australia                  446   455         -1 1420025538
##  9 GPE    Sydney                     515   521         -1 1420025538
## 10 ORG    Rotary Club                566   577         -1 1420025538
## # ℹ 315 more rows

If applied to a PDF document, the function ner_on_df() consists in three steps:

Upload the PDF document (if based on a PDF document)
Convert the text into a dataframe (df) using the function load_pdf_as_df
Apply the function to the dataframe.

In the example below, we choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document. To eliminate the pages that do not contain text of that contain little text (pages with images), we set the minimum limit to at least 100 characters per page. The identifier argument serves to assign the name of the document (as you choose to label it) to the numbered pages (by default, pages appear simply as numbers)

dplyr::slice(docs_eng_ft, 1:10)