8 Additional features
8.1 Regular Expressions
The function extract_regexps_from_subcorpus() is designed to search a list of regular expressions in a corpus of documents. The function is composed of two arguments:
- “corpus”: consists of a table with the documents (must include DocId, Title and Text columns), typically the output of the “get_documents()” function (e.g. “docs_eng_ft”).
- “regexps”: a table that indicates the pattern(s) to look for (including two columns “Regexp” and “Type”)
The function returns a three-column table: the document id in which the pattern was found, the type of pattern, and the matched term (pattern).
<- read_delim("regexps.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE) # load example table with regular expressions to search
regexpample <- extract_regexps_from_subcorpus(docs_eng_ft, regexpample) regexp_output
8.2 Data transformation
This document presents a substantial workflow for the transformation of data.