1 Chinese Word Segmentation

Similar usage to NER functions.

histtext::list_cws_models()
## [1] "trftc_shunpao:zh:cws"

Simple output (directly tokenized strings):

imh_df <- histtext::search_documents('"共產黨員"', "imh-zh")

histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)
## 1/2

Detailed output:

histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)
## 1/2

Can also be applied on custom dataframes with histtext::cws_on_df

2 Search Concordance on custom dataframes

2.1 Simple query

The most simplest way to use the function is to directly use the wanted keyword. Please note that Solr-like queries are NOT supported here (i.e., no support for OR, AND, brackets, …).

custom_df <- tibble::tibble(DocId = c("A", "B", "C", "D", "E"),
                            Text = c("A nice little text used as an example text.",
                                     "Another wee text that will be really needed for examples.",
                                     "This is once again a piece of text that will be handy as a textual example.",
                                     "A final example with plenty of words that together build up a small wall of text that for sure will be used as a demo. We are looking forward to this text full of context.",
                                     "Text is everything. Texts are everywhere."))

histtext::search_concordance_on_df(custom_df, "text", id_column = "DocId", 
                                   context_size = 50, 
                                   case_sensitive = FALSE)

For English it makes sense to specify that spaces are usually word separators:

histtext::search_concordance_on_df(custom_df, "text", id_column = "DocId",
                                   context_size = 50, 
                                   case_sensitive = FALSE, 
                                   space_is_word_sep = TRUE)

2.2 Regexp query

More advanced queries can be done using regular expressions.

histtext::search_concordance_on_df(custom_df, "(text|word)s?", 
                                   id_column = "DocId",
                                   context_size = 50,
                                   use_regexp = TRUE,
                                   case_sensitive = FALSE,
                                   space_is_word_sep = TRUE)

3 Question Answering

Question answering models are now available, mainly for biographical information extraction. These models are not fully usable and may timeout on long texts, especially if you ask many questions at a time.

histtext::list_qa_models()
## [1] "trftc_biography:zh:qa" "trftc_biography:en:qa"

3.1 Basic usage

The most basic use is to ask a single question:

imh_en_df <- histtext::search_documents('"member of party"', "imh-en")

histtext::qa_on_corpus(imh_en_df, "What is his full name?", "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9

Or multiple variants of a question:

histtext::qa_on_corpus(imh_en_df, c("What is his full name?", "What name?"), "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9

3.2 More questions

A more advanced usage of QA can be done where questions can depend on previous questions:

questions <- list("name:full" = c("What is his full name?", "What name?"),
                  "education:location" = c("Where {name:full} study at?", "Where study at?"))
histtext::qa_on_corpus(imh_en_df, questions, "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9

You can also change the number of answers that a question should be allowed to output:

histtext::qa_on_corpus(imh_en_df, questions, "imh-en", max_answers = list("education:location" = 2))
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9

Examples of questions on which models where trained with can be accessed with histtext::biography_questions:

histtext::biography_questions("en")
## $`name:full`
## [1] "What name?"             "What is his full name?"
## 
## $`birth:location`
## [1] "Where born?"                          
## [2] "In what location is he born?"         
## [3] "In what location is {name:full} born?"
## 
## $`birth:year`
## [1] "When born?"                     "What year is he born?"         
## [3] "What year is {name:full} born?"
## 
## $`education:location`
## [1] "Where {name:full} study at?"                               
## [2] "Where study at?"                                           
## [3] "What school, college or university did {name:full} attend?"
## 
## $`education:year`
## [1] "When {name:full} study at {#education:location}?"
## [2] "When study at {#education:location}?"            
## 
## $`position:job`
## [1] "What job position?" "What job?"         
## 
## $`position:job_location`
## [1] "Where {name:full} was {#position:job}?"
## [2] "In what location {#position:job}?"     
## [3] "Where {#position:job}?"                
## 
## $`position:job_year`
## [1] "When {name:full} was {#position:job}?"
## [2] "What year {#position:job}?"           
## [3] "When {#position:job}?"
histtext::biography_questions("zh")
## $`name:full`
## [1] "什麼名字?"
## 
## $`name:given`
## [1] "名什麼?"
## 
## $`name:art`
## [1] "號是什麼?"
## 
## $`name:courtesy`
## [1] "字是什麼?"
## 
## $`birth:location`
## [1] "哪裡出生?"   "他從哪裡來?"
## 
## $`birth:age`
## [1] "他幾歲了?"
## 
## $`education:location`
## [1] "哪裡{%zh}上學?"           "上過什麼{%zh}學校或大學?"
## [3] "什麼{%zh}教育?"          
## 
## $`education:level`
## [1] "您的學歷是多少?"
## 
## $`position:job`
## [1] "什麼工作?"

Please avoid directly using the output of this function as an input to the qa functions. The amount of questions will overwhelm the server and your query will probably time-out in many cases. You can however take a small subset of these.