Similar usage to NER functions.
histtext::list_cws_models()
## [1] "trftc_shunpao:zh:cws"
Simple output (directly tokenized strings):
imh_df <- histtext::search_documents('"共產黨員"', "imh-zh")
histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = FALSE)
## 1/2
Detailed output:
histtext::cws_on_corpus(imh_df, "imh-zh", detailed_output = TRUE)
## 1/2
Can also be applied on custom dataframes with histtext::cws_on_df
The most simplest way to use the function is to directly use the wanted keyword. Please note that Solr-like queries are NOT supported here (i.e., no support for OR, AND, brackets, …).
custom_df <- tibble::tibble(DocId = c("A", "B", "C", "D", "E"),
Text = c("A nice little text used as an example text.",
"Another wee text that will be really needed for examples.",
"This is once again a piece of text that will be handy as a textual example.",
"A final example with plenty of words that together build up a small wall of text that for sure will be used as a demo. We are looking forward to this text full of context.",
"Text is everything. Texts are everywhere."))
histtext::search_concordance_on_df(custom_df, "text", id_column = "DocId",
context_size = 50,
case_sensitive = FALSE)
For English it makes sense to specify that spaces are usually word separators:
histtext::search_concordance_on_df(custom_df, "text", id_column = "DocId",
context_size = 50,
case_sensitive = FALSE,
space_is_word_sep = TRUE)
More advanced queries can be done using regular expressions.
histtext::search_concordance_on_df(custom_df, "(text|word)s?",
id_column = "DocId",
context_size = 50,
use_regexp = TRUE,
case_sensitive = FALSE,
space_is_word_sep = TRUE)
Question answering models are now available, mainly for biographical information extraction. These models are not fully usable and may timeout on long texts, especially if you ask many questions at a time.
histtext::list_qa_models()
## [1] "trftc_biography:zh:qa" "trftc_biography:en:qa"
The most basic use is to ask a single question:
imh_en_df <- histtext::search_documents('"member of party"', "imh-en")
histtext::qa_on_corpus(imh_en_df, "What is his full name?", "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9
Or multiple variants of a question:
histtext::qa_on_corpus(imh_en_df, c("What is his full name?", "What name?"), "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9
A more advanced usage of QA can be done where questions can depend on previous questions:
questions <- list("name:full" = c("What is his full name?", "What name?"),
"education:location" = c("Where {name:full} study at?", "Where study at?"))
histtext::qa_on_corpus(imh_en_df, questions, "imh-en")
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9
You can also change the number of answers that a question should be allowed to output:
histtext::qa_on_corpus(imh_en_df, questions, "imh-en", max_answers = list("education:location" = 2))
## 1/9
## 2/9
## 3/9
## 4/9
## 5/9
## 6/9
## 7/9
## 8/9
## 9/9
Examples of questions on which models where trained with can be accessed with histtext::biography_questions
:
histtext::biography_questions("en")
## $`name:full`
## [1] "What name?" "What is his full name?"
##
## $`birth:location`
## [1] "Where born?"
## [2] "In what location is he born?"
## [3] "In what location is {name:full} born?"
##
## $`birth:year`
## [1] "When born?" "What year is he born?"
## [3] "What year is {name:full} born?"
##
## $`education:location`
## [1] "Where {name:full} study at?"
## [2] "Where study at?"
## [3] "What school, college or university did {name:full} attend?"
##
## $`education:year`
## [1] "When {name:full} study at {#education:location}?"
## [2] "When study at {#education:location}?"
##
## $`position:job`
## [1] "What job position?" "What job?"
##
## $`position:job_location`
## [1] "Where {name:full} was {#position:job}?"
## [2] "In what location {#position:job}?"
## [3] "Where {#position:job}?"
##
## $`position:job_year`
## [1] "When {name:full} was {#position:job}?"
## [2] "What year {#position:job}?"
## [3] "When {#position:job}?"
histtext::biography_questions("zh")
## $`name:full`
## [1] "什麼名字?"
##
## $`name:given`
## [1] "名什麼?"
##
## $`name:art`
## [1] "號是什麼?"
##
## $`name:courtesy`
## [1] "字是什麼?"
##
## $`birth:location`
## [1] "哪裡出生?" "他從哪裡來?"
##
## $`birth:age`
## [1] "他幾歲了?"
##
## $`education:location`
## [1] "哪裡{%zh}上學?" "上過什麼{%zh}學校或大學?"
## [3] "什麼{%zh}教育?"
##
## $`education:level`
## [1] "您的學歷是多少?"
##
## $`position:job`
## [1] "什麼工作?"
Please avoid directly using the output of this function as an input to the qa functions. The amount of questions will overwhelm the server and your query will probably time-out in many cases. You can however take a small subset of these.