8 Additional features

8.1 Regular Expressions

The function extract_regexps_from_subcorpus() is designed to search a list of regular expressions in a corpus of documents. The function is composed of two arguments:

“corpus”: consists of a table with the documents (must include DocId, Title and Text columns), typically the output of the “get_documents()” function (e.g. “docs_eng_ft”).
“regexps”: a table that indicates the pattern(s) to look for (including two columns “Regexp” and “Type”)

The function returns a three-column table: the document id in which the pattern was found, the type of pattern, and the matched term (pattern).

regexpample <- read_delim("regexps.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE) # load example table with regular expressions to search
regexp_output <- extract_regexps_from_subcorpus(docs_eng_ft, regexpample)

8.2 Data transformation

This document presents a substantial workflow for the transformation of data.