This tutorial is a continuation of our workflow aimed at revisiting the role of “returned students” in China based on a systematic analysis of the press. It relies on the textfeatures packages to explore the textual features of newspaper articles. The ultimate goal is to build a model for the automatic classification of articles based on shared stylistic features.
Load the dataset
We load the dataset:
Our training dataset consists of a hand-coded sample of 1102 articles (out of 2744 in our initial corpus) which related to “returned students” in the English-language press in China. (More details on the method for building the corpus here). The article were unequally distributed across ten different periodicals, among which the British North-China Herald, the American China Weekly Review and the Chinese China Press (Dalubao) - all based in Shanghai - clearly dominated. Each of the 1102 articles was manually associated to the section under which it appeared in the periodical. We defined ten broad sections (genres). Each article can fall under one and only one class. (More details on the classification method here)
Our modeling goal is to predict the genre of each article (i.e. the section under which it appeared) and its source (i.e. the periodical in which it appeared).
How were the articles distributed across periodicals and newspaper sections?
rs_genre %>%
count(Source, sort = TRUE)
As already said, three Shanghai-based periodicals dominated the corpus: the British North-China Herald (409 articles), the Chinese China Press (Dalubao) (290) and the American China Weekly Review (238).
For better legibility, we recode the genres as follows:
rs_genre <- rs_genre %>%
mutate(Genre_code = fct_recode(Genre,
"0" = "Advertisement",
"1" = "Authored article",
"2" = "Correspondence (Abroad)",
"3" = "Correspondence (China)",
"4" = "Editorial",
"5" = "People & Events",
"6" = "Local & General News",
"7" = "Readers Letter",
"8" = "Special Pages",
"9" = "Other"
Genres (newspaper section)
rs_genre %>%
count(Genre_code, Genre, sort = TRUE)
The classes are clearly imbalanced. The sections “Correspondence (China)” and “People & Events” amounted for almost half of the sample. “Authored article”, “Local & General News” and “Advertisements” altogether represented about one third. How does this vary across periodicals ?
Genre by periodical
rs_genre %>%
Source = fct_inorder(Source),
Genre = fct_lump_n(Genre, 10)
) %>%
count(Source, Genre_code, Genre) %>%
mutate(Genre_code = reorder_within(Genre_code, n, Source)) %>%
ggplot(aes(n, Genre_code, fill = Source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Source, scales = "free", ncol = 2) +
scale_y_reordered() +
labs(title = "Genre of articles in the 'returned students' press corpus",
subtitle = "Distribution by periodical",
x = "Number of articles",
y = "Section")
Advertisements dominated in two periodicals, the China Press and Peking Gazette. Local & General News prevailed in most non-Shanghai papers (Peking Daily News, Peking Leaders, Canton Times). This section also featured prominently in the Shanghai Times and The China Press. The two top sections “Correspondence (China)” and “People & Events” dominated in the two major periodicals - the British North-China Herald and the American China Weekly Review. Their statistical importance largely accounts for the prevalence of the two sections in the entire corpus. The missionary journal The Chinese Recorder presented a different profile, privileging opinion articles (“Authored article”) upon all other genres.
Let’s create a dataset for our modeling question, and look at a few example lines.
In order to simplify, we retain only the three most prominent periodicals and lump together the less important ones:
rs_genre_featured <- rs_genre %>%
mutate(Source_fct = fct_lump(Source, n = 3)) %>%
select(Source_fct, Genre_code, Genre, text = Text)
rs_genre_featured %>%
count(Source_fct, sort = TRUE)
Let’s have a look of a few example lines from the “Local & General News” section:
rs_genre_featured %>%
filter(Genre == "People & Events") %>%
sample_n(5) %>%
We find a variety of articles under different labels such as “Personal Notes” from the North-China Herald, “Mainly About Chinese Personages” in The China Press, or the “Men & Events” and “Who’s Who” sections in the China Weekly Review.
What are the highest log odds words from each genre of articles?
Compute the log odds ratio using the package tidylo:
rs_genre_lo <- rs_genre_featured %>%
unnest_tokens(word, text) %>%
count(Genre, word) %>%
bind_log_odds(Genre, word, n) %>%
Note: About the benefits of using a weighted log odds ratio for text analysis when the analytical question is focused on differences in frequency across sets, see Introducing tidylo
Visualize the differences between sections:
rs_genre_lo %>%
group_by(Genre) %>%
top_n(15) %>%
ungroup() %>%
mutate(word = reorder(word, log_odds_weighted)) %>%
ggplot(aes(log_odds_weighted, word, fill = Genre)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~Genre, scales = "free", ncol = 2) +
labs(y = NULL)
Stop words (the, of, to) and numbers are the most frequent in the sections “Correspondence (Abroad)”, “Editorial” and “Other”. Words referring to housing and home facilities (such as bathrooms, garage, furniture, rental, heating) and business contacts (apply, phone, box, tel) are most common in advertisements (they were mostly classified advertisements, rather than commercial advertisements for consumer goods). Such words also appeared in the “Authored articles” and “People & Events” sections, which suggests that these sections might have been misclassified. The remaining genres contained a richer spectrum of words, which reflect their higher sensitiveness to special events or debates.
Except for the possibly misclassified articles we mentioned, these words make sense, but the counts are probably too low to build a good model with. Instead, let’s try using text features like the number of punctuation characters, number of pronouns, and so forth.
Extract text features usint the package textfeatures
tf <- textfeatures(
sentiment = FALSE, word_dims = 0,
normalize = FALSE, verbose = FALSE
Visualize text
tf %>%
bind_cols(rs_genre_featured) %>%
group_by(Genre_code, Genre) %>%
summarise(across(starts_with("n_"), mean)) %>%
pivot_longer(starts_with("n_"), names_to = "text_feature") %>%
filter(value > 0.01) %>%
mutate(text_feature = fct_reorder(text_feature, -value)) %>%
ggplot(aes(Genre_code, value, fill = Genre)) +
geom_col(position = "dodge", alpha = 0.8, show.legend = FALSE) +
facet_wrap(~text_feature, scales = "free", ncol = 4) +
labs(title = "Text features in the 'returned students' press corpus",
subtitle = "Distribution by genre (section)",
x = "Section",
y = "Mean text features per article line",
fill = "Genre")
The plot details the text features that characterize each section. More details about the various features here.
Let’s have a closer look of the text features extracted from our corpus:
Note: We shall beware of not overstating or understating these results, especially regarding punctuation, since many errors may occur during the ocerizing and segmenting process. It is not clear, for instance, whether punctuation was systematically extracted from the original text. Conversely, the full text of some articles contains punctuation signs that did not appear in the original article (for instance, we noticed that many advertisements start with quotation marks that did not appear in the original article). The two last categories contained a variety of articles, which makes it uneasy to identify any stylistic coherence.
We can start by loading the tidymodels metapackage, and splitting our data into training and testing sets.
rs_genre_tf_split <- initial_split(rs_genre_featured, strata = Genre_code)
rs_genre_tf_train <- training(rs_genre_tf_split)
rs_genre_tf_test <- testing(rs_genre_tf_split)
Next, we create cross-validation resamples of the training data, to evaluate our models:
rs_genre_tf_folds <- vfold_cv(rs_genre_tf_train, strata = Genre_code)
Next, we preprocess our data to get it ready for modeling:
rs_genre_tf_rec <- recipe(Genre_code ~ text, data = rs_genre_tf_train) %>%
step_downsample(Genre_code) %>%
step_textfeature(text) %>%
step_zv(all_predictors()) %>%
rs_genre_tf_prep <- prep(rs_genre_tf_rec)
## Data Recipe
## Inputs:
## role #variables
## outcome 1
## predictor 1
## Training data contained 828 data points and no missing data.
## Operations:
## Down-sampling based on Genre_code [trained]
## Text feature extraction for text [trained]
## Zero variance filter removed 16 items [trained]
## Centering and scaling for 11 items [trained]
Let’s walk through the steps in this recipe.
We’re mostly going to use this recipe in a workflow() so we don’t need to stress too much about whether to prep() or not. Since we are going to compute variable importance, we will need to come back to juice(rs_genre_tf_prep).
Let’s compare two different models, a random forest model and a support vector machine model. We start by creating the model specifications.
rf_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger") %>%
## Random Forest Model Specification (classification)
## Main Arguments:
## trees = 1000
## Computational engine: ranger
svm_spec <- svm_rbf(cost = 0.5) %>%
set_engine("kernlab") %>%
## Radial Basis Function Support Vector Machine Specification (classification)
## Main Arguments:
## cost = 0.5
## Computational engine: kernlab
Next we build a tidymodels workflow(), a helper object to help manage modeling pipelines with pieces that fit together like Lego blocks. Notice that there is no model yet (Model: None).
rs_genre_tf_wf <- workflow() %>%
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: None
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
## ● step_downsample()
## ● step_textfeature()
## ● step_zv()
## ● step_normalize()
Now we can add a model, and the fit to each of the resamples.
First, we can fit the random forest model:
(The model failed)
Second, we can fit the support vector machine model.
svm_rs <- rs_genre_tf_wf %>%
add_model(svm_spec) %>%
resamples = rs_genre_tf_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
We have fit each of our candidate models to our resampled training set!
Confusion matrix for folding 1
svm_rs %>%
collect_predictions() %>%
filter(id == "Fold01") %>%
conf_mat(Genre_code, .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
In the next tutorial, we will develop an alternative method using the package stylo.