5 Named Entity Recognition (NER)
Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that serves to extract the name of all the real-world entities (persons, organizations, places, time, currency, etc.) mentioned in any corpus of documents. In histText, the function ner_on_corpus applies a default NER model that varies according to the language (English or Chinese) or the nature (nature of the language, OCR, etc.) of the queried corpus.
The default model is based on the software spaCy and the ontology OntoNotes. This model categorizes entities into eight main types: persons (PERS), organizations (ORG), locations (LOC), geopolitical entities (GPE) (countries, administrative regions…), temporal entities (DATE, TIME), numerical entities (MONEY, PERCENTAGE), and miscellaneous entities (MISC).
5.1 Named Entity Extraction
The function ner_on_corpus is composed of two arguments: the collection of documents with their full text (output of the “get_documents()” function), and the targeted corpus.
Example in English (ProQuest:
<- ner_on_corpus(docs_eng, corpus = "proquest") docs_ner_eng
## 1/785
## 11/785
## 21/785
## 31/785
## 41/785
## 51/785
## 61/785
## 71/785
## 81/785
## 91/785
## 101/785
## 111/785
## 121/785
## 131/785
## 141/785
## 151/785
## 161/785
## 171/785
## 181/785
## 191/785
## 201/785
## 211/785
## 221/785
## 231/785
## 241/785
## 251/785
## 261/785
## 271/785
## 281/785
## 291/785
## 301/785
## 311/785
## 321/785
## 331/785
## 341/785
## 351/785
## 361/785
## 371/785
## 381/785
## 391/785
## 401/785
## 411/785
## 421/785
## 431/785
## 441/785
## 451/785
## 461/785
## 471/785
## 481/785
## 491/785
## 501/785
## 511/785
## 521/785
## 531/785
## 541/785
## 551/785
## 561/785
## 571/785
## 581/785
## 591/785
## 601/785
## 611/785
## 621/785
## 631/785
## 641/785
## 651/785
## 661/785
## 671/785
## 681/785
## 691/785
## 701/785
## 711/785
## 721/785
## 731/785
## 741/785
## 751/785
## 761/785
## 771/785
## 781/785
docs_ner_eng
## # A tibble: 79,312 × 6
## DocId Type Text Start End Confidence
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 1420025538 GPE SHANGHAI 0 8 0.59
## 2 1420025538 ORG ROTARY CLUB 9 20 0.94
## 3 1420025538 GPE China 63 68 0.96
## 4 1420025538 ORG the Shanghai Rotary Club 111 135 1
## 5 1420025538 FAC the Union Club 139 153 0.94
## 6 1420025538 DATE June 18, 157 165 1
## 7 1420025538 CARDINAL 100 188 191 0.63
## 8 1420025538 PERSON Fitch 214 219 0.99
## 9 1420025538 PERSON Ilawkings 293 302 1
## 10 1420025538 ORG Rotary 316 322 0.86
## # ℹ 79,302 more rows
Example in Chinese (Shenbao):
<- ner_on_corpus(docs_ft, corpus = "shunpao") docs_ner_zh
## 1/507
## 11/507
## 21/507
## 31/507
## 41/507
## 51/507
## 61/507
## 71/507
## 81/507
## 91/507
## 101/507
## 111/507
## 121/507
## 131/507
## 141/507
## 151/507
## 161/507
## 171/507
## 181/507
## 191/507
## 201/507
## 211/507
## 221/507
## 231/507
## 241/507
## 251/507
## 261/507
## 271/507
## 281/507
## 291/507
## 301/507
## 311/507
## 321/507
## 331/507
## 341/507
## 351/507
## 361/507
## 371/507
## 381/507
## 391/507
## 401/507
## 411/507
## 421/507
## 431/507
## 441/507
## 451/507
## 461/507
## 471/507
## 481/507
## 491/507
## 501/507
docs_ner_zh
## # A tibble: 145,182 × 6
## DocId Type Text Start End Confidence
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 SPSP193010291607 ORG 本埠 2 4 0.53
## 2 SPSP193010291607 ORG 扶輪社 4 7 1
## 3 SPSP193010291607 NORP 俄童 9 11 0.97
## 4 SPSP193010291607 TIME 本月二十四日星期五晚 19 29 1
## 5 SPSP193010291607 FAC 戈登路大華飯店 30 37 0.94
## 6 SPSP193010291607 CARDINAL 一 50 51 1
## 7 SPSP193010291607 TIME 每晨九時半至十二時 74 83 1
## 8 SPSP193010291607 FAC 南京路五十號 84 90 0.74
## 9 SPSP194704140406 EVENT 國際扶輪社第九十七區域年會 0 13 1
## 10 SPSP194704140406 DATE 昨日 13 15 1
## # ℹ 145,172 more rows
The function returns a table with five columns:
- Id: unique identifier of the document
- Text: the name of the entity (as given in the source)
- Type: the type of the entity extracted with its confidence index (measuring the accuracy of the classification by the algorithm)
- Start: the position of the character immediately preceding the entity in the text
- End: the position of the character immediately following the entity in the text
The histext package includes alternative models tailored to the nature of the text in the different corpora. Mostly, it provides specific models for text with noisy Optical Character Recognition (OCR) and/or missing punctuation.
The function list_ner_models() lists all the models available in the package:
list_ner_models()
## [1] "spacy_sm:en:ner" "spacy:en:ner"
## [3] "spacy_sm:zh:ner" "spacy:zh:ner"
## [5] "trftc_noisy_nopunct:en:ner" "trftc_noisy:en:ner"
## [7] "trftc_nopunct:zh:ner" "trftc_camembert:fr:ner"
## [9] "trftc_person_1class:zh:ner" "trftc_person_4class:zh:ner"
There are two main models: spaCy and trftc (Transformer Token Classification) - a model specifically developed by the ENP-China team (Baptiste Blouin, Jeremy Auguste) to handle specific issues that pertain to historical corpora in multiple languages.
Name | Model | Language | Features |
---|---|---|---|
spacy_sm:en:ner | spaCy | English | Default model for large corpora (faster but less reliable) |
spacy_sm:zh:ner | spaCy | Chinese | Default model for large corpora (faster but less reliable) |
spacy:en:ner | spaCy | English | Default model for large or small corpora (slower but more reliable) |
spacy:zh:ner | spaCy | Chinese | Default model for large or small corpora (slower but more reliable) |
trftc_noisy_nopunct:en:ner | TrfTC | English | Model for texts with noisy OCR and missing punctuation |
trftc_noisy:en:ner | TrfTC | English | Model for texts with noisy OCR |
trftc_nopunct:zh:ner | TrfTC | Chinese | Model for texts with no punctuation |
trftc_camembert:fr:ner | TrfTC | French | Models for texts in French (based on BERT) |
trftc_person_1class:zh:ner | TrfTC | Chinese | Model trained to identify persons in any category |
trftc_person_4class:zh:ner | TrfTC | Chinese | Model trained to detect persons’ names with embedded titles (王少年 for 王局長少年, 王君少年) |
To know which default model is used for a given corpus, use the function get_default_ner_model():
get_default_ner_model("proquest")
get_default_ner_model("imh-zh")
To specify the model you want to apply, use the argument “model =” in the ner_on_corpus() function:
ner_on_corpus(docs_ft, corpus = "proquest", model = "trftc_noisy_nopunct:en:ner")
ner_on_corpus(docs_ft, corpus = "shunpao", model = "trftc_person_4class:zh:ner")
5.2 Padagraph Visualization
The data extracted through NER can be visualized as a network graph using Padagraph. To enable this function, one needs to create first a network object using libraries such as igraph or tidygraph. The following lines of code detail the successive steps to transform the results of NER into an edge list and eventually a network object. The last step consists in applying the function in_padagraph() to project the tidygraph object in Padagraph. In the following example, we propose to build a two-mode network linking persons or organizations with the documents in which they appear:
# create the edge list linking documents with persons or organizations
<- docs_ner_eng %>%
edge_ner_eng filter(Type %in% c("PERSON", "ORG")) %>% # select persons and organizations
select(DocID, Text) %>% # retain only the relevant variables (from/to)
filter(!Text %in% c("He","She","His","Her","Him", "Hers", "Chin", "Mrs", "Mr", "Gen", "Madame", "Madam", "he","she","his","her","him", "hers")) # remove personal pronouns
# retain entities that appeared at least twice
<- edge_ner_eng %>%
topnode_docs_eng group_by(Text) %>%
add_tally() %>%
filter(n>1) %>%
distinct(Text)
# build the network with igraph/tidygraph
<- edge_ner_eng %>%
topedge filter(Text %in% topnode_docs_eng$Text) %>%
rename(from = DocID, to = Text)
<- graph_from_data_frame(d=topedge, directed = FALSE)
ig <- tidygraph::as_tbl_graph(ig) %>%
tg activate(nodes) %>%
mutate(label=name)
# project in padagraph
%>% histtext::in_padagraph("RotaryNetwork") tg
The function get_padagraph_url directly returns the URL for displaying the graph:
%>% histtext::get_padagraph_url("RotaryNetwork") tg
The URL for the graph is created as a permanent link that can be used in third applications or to return to the network.
The function load_in_padagraph() serves to import a tidygraph object into “Padagraph.” The function is composed of three arguments:
- filepath: the path to a file which contains a tidygraph (created with the function ‘save_graph’ included in histtext)
- name: the name to be given to the graph
- show_graph: if TRUE, show the graph in a RStudio viewer
The function returns an URL that displays the tidygraph object in Padagraph.
save_graph(graph, filepath)
load_in_padagraph(filepath, name, show_graph = TRUE)
5.3 NER on external documents
HistText includes two functions to implement NER on external documents, i.e. that can be uploaded directly into R Studio:
- run_ner() : to be applied on a string of character
- ner_on_df(): to be applied to a dataframe (e.g. a text extracted from a PDF document or scraped from the web, in a dataframe format)
In addition, HistText includes a function that transform a pre-ocerized PDF document into a dataframe.
The function run_ner() contained two arguments: the text to be analyzed (text) and the model to be used (model) (use ‘list_ner_models()’ to obtain the available models):
run_ner(text, model = "spacy:en:ner", verbose = FALSE)
The function ner_on_df requires at least two arguments: the name of the variable that contains the text (Text) and the language model that will apply (English or Chinese). In the example below, we add an argument to specify the page number with it s identifier (document). We also choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document.
<- ner_on_df(dplyr::slice(docs_eng_ft, 1:10), "Text", id_column="DocId", model = "spacy:en:ner") ner_df
## 1/10
ner_df
## # A tibble: 325 × 6
## Type Text Start End Confidence DocId
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 ORG the Shanghai Rotary Club 111 135 -1 1420025538
## 2 FAC the Union Club 139 153 -1 1420025538
## 3 DATE June 18 157 164 -1 1420025538
## 4 PERSON Fitch 214 219 -1 1420025538
## 5 PERSON Ilawkings 293 302 -1 1420025538
## 6 PERSON Ilawkings 375 384 -1 1420025538
## 7 LOC Europe 435 441 -1 1420025538
## 8 GPE Australia 446 455 -1 1420025538
## 9 GPE Sydney 515 521 -1 1420025538
## 10 ORG Rotary Club 566 577 -1 1420025538
## # ℹ 315 more rows
If applied to a PDF document, the function ner_on_df() consists in three steps:
- Upload the PDF document (if based on a PDF document)
- Convert the text into a dataframe (df) using the function load_pdf_as_df
- Apply the function to the dataframe.
In the example below, we choose to display only the first ten rows of the document, but this line of code should be removed to process the whole document. To eliminate the pages that do not contain text of that contain little text (pages with images), we set the minimum limit to at least 100 characters per page. The identifier argument serves to assign the name of the document (as you choose to label it) to the numbered pages (by default, pages appear simply as numbers)
::slice(docs_eng_ft, 1:10) dplyr