Abstract
This document relies on various text analysis techniques to analyze the titles of periodicals in Crow’s Newspaper Directory of China (1931-5).
Outline: Simple tokenization Word Bigram Trigram Multiword tokenization Collocation Biterm topic modeling
We load the original dataset containing 1060 periodicals and 47 variables drawn from the two directories of 1931 and 1935:
library(readr)
crowdata <- read_delim("crowdata.csv", ";",
escape_double = FALSE, trim_ws = TRUE)
datatable(crowdata)
We select the variables we will use for the analysis of titles:
crow_title <- crowdata %>% select(Year, City_py, City_zh,
Title_eng, Title_zh,
Established, Periodicity, Language)
datatable(crow_title)
We create a unique identifier for each periodical:
crow_title_id <- rowid_to_column(crow_title) %>%
rename(id = rowid) %>%
mutate(id = paste(City_py, Periodicity, id, sep = "_"))
We pre-process the text data so we will not have to do it during the tokenization process. By pre-processing, we mean remove function words (The, L’, Les), punctuation and “’s”. We finally delete the blank spaces that may have been created in the process:
crow_title_id <- crow_title_id %>% mutate(title = str_remove_all(Title_eng, ", The")) %>%
mutate(title = str_remove_all(title, "The")) %>%
mutate(title = str_remove_all(title, ", L'")) %>%
mutate(title = str_remove_all(title, "Les")) %>%
mutate(title = str_remove_all(title, " and"))
crow_title_id$title <- gsub("['’]s\\b|[^[:alnum:] [:blank:]]", "", crow_title_id$title)
crow_title_id$title <- trimws(crow_title_id$title, which = c("both"))
Note that we create new variable (title) for the post-processed text. It is recommended to maintain the original text data (Title_eng). The new column “title” contains the clean text we will use for tokenization:
datatable(crow_title_id %>% select(id, Title_eng, title))
Since we will examine each edition separately, we split the data into two distinct samples:
crow_title_1931 <- crow_title_id %>% filter(Year == "1931")
crow_title_1935 <- crow_title_id %>% filter(Year == "1935")
We"re good to go !!
Since text analyses techniques were initially trained on English-language data, they have been most commonly applied to and still usually work better with English-language texts. Therefore we will start with exploring titles in English (Title_eng). In a second step, we will explore titles in Chinese (Title_zh).
For the analysis of English titles, we rely on three main packages and will proceed in several steps:
We start with the 1935 sample because its larger size and wider timescope will yield more interesting results.
We load the packages:
library(tidyverse)
library(tidytext)
First we create a clean dataset of tokenized text. We remove the remaining punctuation and stop words (e.g. “The”), if any, and we lowercase each word. The resulting table contains an additional column with the word extracted from each title. Each observation now corresponds to a word, and no longer a periodical. The dataset now contains 1900 observations, referring to the 1900 tokens extracted from the titles of the 702 periodicals. Only the 6 first results are displayed:
data("stop_words")
crow_unigram <- crow_title_1935 %>%
unnest_tokens(output = word, input = title) %>%
anti_join(stop_words)
kable(head(crow_unigram), caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
id | Year | City_py | City_zh | Title_eng | Title_zh | Established | Periodicity | Language | word |
---|---|---|---|---|---|---|---|---|---|
Shanghai_Daily_2 | 1935 | Shanghai | 上海 | Sin Wan Pao | 新聞報 | 1893 | Daily | Chinese | sin |
Shanghai_Daily_2 | 1935 | Shanghai | 上海 | Sin Wan Pao | 新聞報 | 1893 | Daily | Chinese | wan |
Shanghai_Daily_2 | 1935 | Shanghai | 上海 | Sin Wan Pao | 新聞報 | 1893 | Daily | Chinese | pao |
Shanghai_Daily_5 | 1935 | Shanghai | 上海 | Shanghai Herald | 申報 | 1872 | Daily | Chinese | shanghai |
Shanghai_Daily_5 | 1935 | Shanghai | 上海 | Shanghai Herald | 申報 | 1872 | Daily | Chinese | herald |
Shanghai_Daily_6 | 1935 | Shanghai | 上海 | China Times, The | 時事新報 | 1908 | Daily | Chinese | china |
We can then count the number of words and sort them by their decreasing frequency in the corpus:
unigram_count <- crow_unigram %>%
group_by(word) %>%
tally() %>%
arrange(desc(n))
unigram_count %>%
arrange(desc(n))
As expected, the three most frequent words referred to the daily press and daily news (news, daily, press). They reflect the heavy weight of daily newspapers as we showed in a previous essay. The next most frequent words describe the geographical situation (China) and the increasingly commercial nature of the press in Republican China (commercial). Other recurring words reflect the multiple rhythms of the press (evening, weekly), the centrality of Shanghai as the major publishing center, and the emergence of political values associated with the Republican regime (Republican, people).
What words were most frequently associated with each category of periodical?
Let’s find the ten first words for each category of periodical:
top10_word <- crow_unigram %>%
group_by(Periodicity, word) %>%
tally() %>%
arrange(Periodicity, desc(n)) %>%
group_by(Periodicity) %>%
top_n(10)
top_periodicity <- top10_word %>% filter(n>1)
datatable(top10_word)
Let’s visualize the results. Let’s compare daily, tabloids and weekly:
top10_word %>%
filter(Periodicity %in% c("Daily", "Tabloid", "Weekly"))%>%
group_by(Periodicity) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free") +
labs(x = "Number of occurrences", y = "Word",
title = "Naming periodicals in Republican China",
subtitle = "Ten most frequent words used for dailies, tabloids and weeklies",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
The three categories emphasized the concept of news. Daily and tabloid emphasized their daily periodicity. The words Sunday and weekly were specific to the weekly periodicity of weekly periodicals. Tabloids and weeklies emphasized their local anchorage in Shanghai, whereas general daily newspapers emphasized more abstract values (republican, people) and larger political entities (China), like the two others. Tabloids and dailies both had morning editions, but evening editions were specific to daily newspapers. Tabloids focused on entertainment (star, radio, movie) but also shared more serious concerns (people, peace). Other terms described the specific nature of each periodical - pao for dailies (wade-giles transliteration for “bao” 報, newspaper), herald for weekly. Pictorial indicated that weekly generally contained more illustrations than dailies. Review and Critic point to their critical distance in contrast to the instantaneity of daily newspapers.
Let’s now compare less frequent periodicals:
top10_word %>%
filter(Periodicity %in% c("Annual", "Monthly", "Quarterly"))%>%
filter(word %in% top_periodicity$word) %>%
group_by(Periodicity)%>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free") +
labs(x = "Number of occurrences", y = "Word",
title = "Naming periodicals in Republican China",
subtitle = "Ten most frequent words used for annuals, monthlies and quarterlies",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
Annuals and quarterlies were barely significant from a statistical point of view. Monthlies showed an interestingly wide range of terms that reflected the diversity of their content and intended audience. Besides words that referred to their periodicity (monthly, magazine, journal), their geographical situation (China, Chinese, Shanghai) and their critical distance (review), we find more specific terms delineating two main groups of monthlies, each with its specific function and public: intellectual or professional journals catering to a highly educated and specialized readership (medical, review, commerce) and popular often illustrated magazines commonly associated with women, entertainment and modernity (pictorial, magazine, modern).
Irregular periodicals with intermediate frequency:
top10_word %>%
filter(Periodicity %in% c("Bimonthly", "Biweekly", "Semi-monthly", "Semi-weekly"))%>%
filter(word %in% top_periodicity$word)%>%
group_by(Periodicity)%>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = Periodicity)) +
geom_col(show.legend = FALSE)+
scale_x_continuous(breaks = integer_breaks())+
facet_wrap(~ Periodicity, scales = "free") +
labs(x = "Number of occurrences", y = "Word",
title = "Naming periodicals in Republican China",
subtitle = "Ten most frequent words used for intermediate periodicals",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
An alternative, widely used method to weight terms according to their semantic contribution to a document or a group of documents is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.
Let’s compute the TF-IDF weights for the words extracted from the titles of periodicals for each category of periodicals (periodicity):
crow_unigram_periodicity_tf_idf <- crow_unigram %>%
count(Periodicity, word) %>%
bind_tf_idf(word, Periodicity, n) %>%
arrange(desc(tf_idf))
kable(head(crow_unigram_periodicity_tf_idf), title = "TF-IDF of terms used for naming periodicals", subtitle = "TF-IDF per periodicity group", caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
Periodicity | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|
Bimonthly | electrical | 1 | 0.3333333 | 2.302585 | 0.7675284 |
Annual | directory | 3 | 0.2000000 | 2.302585 | 0.4605170 |
Semi-monthly | semimonthly | 5 | 0.1923077 | 2.302585 | 0.4428048 |
Bimonthly | engineering | 1 | 0.3333333 | 1.203973 | 0.4013243 |
Biweekly | steel | 1 | 0.1000000 | 2.302585 | 0.2302585 |
Weekly | weekly | 24 | 0.1355932 | 1.609438 | 0.2182289 |
The major advantage of TF-IDF compare to simple frequency is to emphasize the words that are specific to a given document (or group of documents), instead of the most common words in general. For instance, we see in the table that technical words such as “electrical” or “engineering” were more specifically associated with to bimonthlies, which reflects the highly specialized nature of their content and their readership profile. The high tf-idf of the word “directory” in relation to annuals indicates that directories as a genre were generally published on a yearly basis.
We can plot the top tf-idf words for each category:
crow_unigram_periodicity_tf_idf %>%
group_by(Periodicity) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free") +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the names of periodicals",
subtitle = "tf-idf by periodicity",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
Let’s focus on the three most frequent categories:
crow_unigram_periodicity_tf_idf %>%
filter(Periodicity %in% c("Daily", "Tabloid", "Weekly"))%>%
group_by(Periodicity) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free", ncol=3) +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the names of periodicals",
subtitle = "tf-idf in most frequent periodicals",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
The terms news, daily and press were commonly used for naming both dailies and tabloids, in the same order of importance. Evening editions were specific to daily newspapers, whereas tabloids more often appeared in the morning. The high tf_idf of the word republican suggests that dailies tended to privilege “serious” political news, whereas tabloids were more often devoted to entertainment (star, radio, movie). Weekly featured totally different words that indicated their specific rhythmic patterns (weekly, Sunday), their emphasis on illustrations (pictorial) and their critical distance (review).
We compare to less frequent periodicals:
crow_unigram_periodicity_tf_idf %>%
filter(Periodicity %in% c("Annual", "Monthly", "Quarterly"))%>%
group_by(Periodicity) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free", ncol=3) +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the names of periodicals",
subtitle = "tf-idf in less frequent periodicals",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
As we observed earlier, the words directory and list indicated that annuals were closely related to directories as a specific genre. The word including suggests that such directories often displayed long, descriptive titles that provided additional information on their content. Other terms referred to their publisher. The words associated with monthlies related to their periodicity (monthly, quarterly), their professional or popular content (medical, magazine) and their critical distance (review). Quarterlies were more strongly associated with technical terms that emphasized their highly specialized contents (physiology, nursing, naturalist medica, economics, caduceus, agricultural).
Finally, we compare irregular periodicals:
crow_unigram_periodicity_tf_idf %>%
filter(Periodicity %in% c("Bimonthly", "Biweekly", "Semi-monthly", "Semi-weekly"))%>%
group_by(Periodicity) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = Periodicity)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Periodicity, scales = "free", ncol=2) +
labs(x = "tf-idf", y = "word",
title = "Highest tf-idf words in the names of periodicals",
subtitle = "tf-idf in irregular periodicals",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
As previously observed, bimonthlies stressed their technical contents (electrical, engineering), whereas biweeklies emphasized political and philosophical concepts (steel, people, truth). Note that steel did not refer to the metallic material but was used in a metaphorical sense to denote political strength. Semi-monthlies appealed to their readers’ social or political sensibility (children, national defense, forum, companion). Semi-weeklies highlighted their locality and focused on popular entertainments (diamond, robin, player).
From the previous observations, we suspect that some terms were shared between different periodicals whereas others were more specific to one or two categories. For instance, we noticed that the words “daily” and “news” were shared between dailies and tabloids, or that weeklies and monthlies had the word “review” in common. What periodicals were the most similar to each other based on the words they shared? What words were most common/specific in the press lexicon?
To answer the question, we use the package widyr to compute the pairwise correlation between periodicals based on the words they have in common. We use the packages ggraph and igraph to visualize the correlations as a network graph. For the sake of legibility, the network contains only the strongest correlations (>0.15):
library(widyr)
crow_title_cors <- crow_unigram %>%
group_by(Periodicity, word) %>%
count() %>%
pairwise_cor(Periodicity, word, n, sort = TRUE)
library(ggraph)
library(igraph)
set.seed(2017)
crow_title_cors %>%
filter(correlation > .15) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation, width = correlation)) +
geom_node_point(size = 12, color = "lightblue") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
labs(title = "Naming periodicals in Republican China",
subtitle = "Word correlation between categories of periodicals",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
The network graph confirms the strong correlation between dailies and tabloids and further reveals their lexical proximity with semi-weeklies and biweeklies. The second main cluster groups monthlies with semi-monthlies, quarterlies and to a lesser extent, bimonthlies. Weeklies held a bridging position between the two groups. Annuals held an outlying position, which reflected their statistical insignificance. Overall, the lexical similarity of periodicals’ titles reflect their temporal proximity. The two clusters opposed two major types of periodicity - daily-based (most frequent) and monthly-based (less frequent) - with intermediate frequency in-between (weeklies).
Did the naming of periodicals depend on their language?
As we proceeded earlier for periodicity, let’s find the ten first words for each language:
top10_word_lang <- crow_unigram %>%
group_by(Language, word) %>%
tally() %>%
arrange(Language, desc(n)) %>%
group_by(Language) %>%
top_n(10)
top_language <- top10_word_lang %>% filter(n>1)
datatable(top10_word_lang)
Let’s compare the words used in two major language groups, Chinese and English:
top10_word_lang %>%
filter(Language %in% c("Chinese", "English"))%>%
group_by(Language) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = Language)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Language, scales = "free", ncol=2) +
labs(x = "Frequency", y = "Word",
title = "Naming periodicals in Republican China",
subtitle = "Chinese and English periodicals (10 most frequent words)",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
News, press and daily were much more commonly used for Chinese periodicals than they were for English periodicals. English editors, by contrast, emphasized the locality (China, Chinese, Shanghai, Hongkong). The ranking also reflects differences in the nature of periodicals across language. Dailies dominated the Chinese press, whereas the English press consisted mainly of weeklies and less frequent publications. Finally, the Chinese press placed a stronger emphasis on political values (people, republican).
top10_word_lang %>%
filter(Language %in% c("Japanese", "Russian"))%>%
group_by(Language) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = Language)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Language, scales = "free", ncol=2) +
scale_x_continuous(breaks = integer_breaks())+
labs(x = "Frequency", y = "Word",
title = "Naming periodicals in Republican China",
subtitle = "Japanese and Russian periodicals (10 most frequent words)",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
The native language impacted strongly on the names of Japanese and Russian periodicals. The place of publication (Shanghai, Manchuria) was also commonly used to differentiate local editions or communities.
Other foreign language were statistically insignificant.
Let’s compute the TF-IDF weights for the words extracted from the titles of periodicals for each language group:
crow_unigram_language_tf_idf <- crow_unigram %>%
count(Language, word) %>%
bind_tf_idf(word, Language, n) %>%
arrange(desc(tf_idf))
kable(head(crow_unigram_language_tf_idf), title = "TF-IDF of terms used for naming periodicals", subtitle = "TF-IDF per language group", caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
Language | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|
Russian | zaria | 2 | 0.4000000 | 1.94591 | 0.7783641 |
Bilingual | nursing | 1 | 0.3333333 | 1.94591 | 0.6486367 |
French | de | 1 | 0.2500000 | 1.94591 | 0.4864775 |
French | le | 1 | 0.2500000 | 1.94591 | 0.4864775 |
German | deutschchinesische | 1 | 0.2000000 | 1.94591 | 0.3891820 |
German | deutsche | 1 | 0.2000000 | 1.94591 | 0.3891820 |
As e did previously for periodicity, we can examine word similarity between languages.
crow_title_cors_lang <- crow_unigram %>%
group_by(Language, word) %>%
count() %>%
pairwise_cor(Language, word, n, sort = TRUE)
set.seed(2017)
crow_title_cors_lang %>%
filter(correlation > .15) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation, width = correlation)) +
geom_node_point(size = 12, color = "lightblue") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
labs(title = "Naming periodicals in Republican China",
subtitle = "Word correlation between language",
caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")
The strongest correlation (0.4) linked English and bilingual (Chinese-English) periodicals. The next pairs associated English and Chinese, English and French, and to a lesser extent, French-bilingual and French-Japanese (0.3). English served as a bridge between Chinese and foreign languages. German and Russian languages held outlying positions and were diametrically opposed to Chinese periodicals. Japanese bridged English with minority languages (German, Russian, French). We shall beware of not over-interpreting these results, given the statistical insignificance of foreign languages, except for English. In fact, Shanghai was the only word they had in common.
Were some words more frequently used in certain cities than others? Can we detect spatial, geographical patters in the naming of periodicals?
For instance, we hypothesize than the differentiation between morning/evening editions, as well as Sunday editions, were specific to large urban centers such as Shanghai, Tianjin, Hongkong, Nanjing or Beijing.
datatable(crow_unigram %>%
filter(word %in% c("evening", "morning", "noon", "sunday")) %>%
group_by(City_py, City_zh, word) %>%
count() %>%
arrange(word, desc(n)))
Interestingly, evening editions were more developed in Chongqing (Sichuan) and Tianjin than in Shanghai and Nanjing. Smaller cities like Changzhou, Jinan, Taiyuan and Zhenjiang also offered at least two evening editions each. By contrast, Beijing and southern cities such as Fuzhou, Guangzhou or Hongkong offered just one edition. Morning editions appeared less popular. There were more widely developed in Tianjin (4), before Shanghai (2). The remaining localities scattered across China offered one morning edition each. Noon editions were restricted to four cities - Chonqging, Harbin, Tianjin, Hongkong - also dispersed throughout China. Sunday editions were exclusive to three large metropolitan centers - Shanghai, Tianjin, Hongkong.
Further research is needed to analyze further the spatial patterns in the naming of periodicals across provinces and cities.
Did the titles of periodicals reflect their founding period, the time during which they were established? Can we detect temporal patterns or time effects in the naming of periodicals in Republican China?
One way to proceed is to pick up the most important words for each category of periodicals or each lexical category, and examine how they occurred over time. We can select several words simultaneously and examine how they evolve in parallel over time.
Let’s examine for instance the words related to the nature and frequency of periodicals (the plot focuses on the period posterior to 1900. There were too few periodicals before):
crow_unigram %>%
filter(word %in% c("daily", "weekly", "monthly", "journal", "magazine", "review", "pictorial", "news")) %>%
group_by(Established, word) %>%
filter(Established > 1900) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word))+
labs(title = "Naming periodicals in modern China",
subtitle = "Words indicating the genre of periodicals",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
News and daily were constitently used during the entire period. Review emerged quite early and persisted over time, though less commonly used than other terms. Journal appeared in the early years of the Republic and was used intermittently. It was not until the 1930s that weekly and monthly gained currency in concurrently with the words “pictorial” and “magazine”. They pointed to the later emergence of popular magazines which we highlighted earlier.
crow_unigram %>%
filter(word %in% c("news", "press")) %>%
group_by(Established, word) %>%
filter(Established > 1900) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word), position = "dodge")+
labs(title = "Naming periodicals in modern China",
subtitle = "Word associated with the press industry",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
crow_unigram %>%
filter(word %in% c("morning", "evening")) %>%
group_by(Established, word) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word), show.legend = FALSE)+
facet_wrap(~ word, scales = "free") +
scale_y_continuous(breaks = integer_breaks())+
labs(title = "Naming periodicals in modern China",
subtitle = "Time-related words",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
Morning and evening newspapers both appeared in the late imperial years and peaked in the 1930s. The two terms were barely used in between. No morning dailies were established in the early 1920s (or more accurately, no editor chose the term “morning” to name its newspaper), in contrast with two occurrences for “evening” during the same period.
crow_unigram %>%
filter(word %in% c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")) %>%
group_by(Established, word) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word))+
labs(title = "Naming periodicals in modern China",
subtitle = "Word associated with republican values",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
crow_unigram %>%
filter(word %in% c("north", "south", "east", "eastern")) %>%
filter(Established > 1900) %>%
group_by(Established, word) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word))+
labs(title = "Naming periodicals in modern China",
subtitle = "Geo-based words",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
crow_unigram %>%
filter(word %in% c("shanghai", "tientsin", "hongkong", "szechuen", "tsingtao", "peiping", "nantung", "canton", "tsinan", "wenchow")) %>%
group_by(Established, word) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word))+
labs(title = "Naming periodicals in modern China",
subtitle = "Place names",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
crow_unigram %>%
filter(word %in% c("industrial", "commercial", "economic", "medical", "engineering","education", "movie")) %>%
group_by(Established, word) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = word))+
labs(title = "Naming periodicals in modern China",
subtitle = "Field-based Words",
x = "Year",
y = "Frequency",
fill = "Word",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
An alternative method would consist in grouping years by decade or reusing the four-phase periodization we designed in a previous essay. We will apply this method later using the quanteda package and the log-likehood ratio test.
Unigram tokenization offers a preliminary solution for a exploring the names of periodicals but it presents serious limitations. It is not suitable, however, to handle compound expressions such as “daily news” or “republican daily news”. In the next sections, we will explore a few methods for handling multiword units (bigram, trigrams, multiword tokenization) and relations between words (collocations or co-occurrences).
Wordclouds are less informative but more visually striking than conventional plots. In this section, we create a series of wordclouds to compare word frequencies depending on the
Load package:
library(wordcloud2)
crow_unigram_periodicity <- crow_unigram %>%
group_by(Periodicity) %>%
count(word) %>%
mutate(n = as.numeric(n))
crow_unigram_periodicity %>%
arrange(Periodicity, desc(n))
“Daily” Wordcloud:
wc_daily <- crow_unigram_periodicity %>%
filter(Periodicity == "Daily") %>%
select(word, n)
wc_daily$Periodicity <- NULL
wordcloud2(wc_daily, size = 2)
Tabloid:
wc_tabloid<- crow_unigram_periodicity %>%
filter(Periodicity == "Tabloid") %>%
select(word, n)
wc_tabloid$Periodicity <- NULL
wordcloud2(wc_tabloid, size = 2)
Weekly:
wc_weekly<- crow_unigram_periodicity %>%
filter(Periodicity == "Weekly") %>%
select(word, n)
wc_weekly$Periodicity <- NULL
wordcloud2(wc_weekly, size = 2,
rotateRatio = 1, backgroundColor = "grey", color = "random-light")
Monthly:
wc_monthly<- crow_unigram_periodicity %>%
filter(Periodicity == "Monthly") %>%
select(word, n)
wc_monthly$Periodicity <- NULL
wordcloud2(wc_monthly, minSize = 2,
rotateRatio = 1, backgroundColor = "grey", color = "random-light", shape = 'circle')
crow_unigram_lang <- crow_unigram %>%
group_by(Language) %>%
count(word) %>%
mutate(n = as.numeric(n))
crow_unigram_lang %>%
arrange(Language, desc(n))
Wordcloud of Chinese periodicals:
wc_zh <- crow_unigram_lang%>%
filter(Language == "Chinese") %>%
select(word, n)
wc_zh$Language <- NULL
wordcloud2(wc_zh, size = 2, color = "random-light", backgroundColor = "grey", minRotation = -pi/2, maxRotation = -pi/2)
Wordcloud of English periodicals:
wc_eng <- crow_unigram_lang %>%
filter(Language == "English") %>%
select(word, n)
wc_eng$Language <- NULL
wordcloud2(wc_eng, size = 2, color = "random-light", backgroundColor = "grey")
Wordcloud of other foreign periodicals:
wc_for <- crow_unigram_lang %>%
filter(!Language %in% c("Chinese", "English")) %>%
select(word, n)
wc_for$Language <- NULL
wordcloud2(wc_for, color = "random-light", backgroundColor = "grey")
crow_unigram <- crow_unigram %>% mutate(Period=cut(Established,
breaks=c(0, 1903, 1916, 1927, 1935),
include.lowest = TRUE))
crow_unigram <- crow_unigram %>% mutate(Period = fct_recode(Period, "1829-1903" = "[0,1.9e+03]",
"1904-1916" = "(1.9e+03,1.92e+03]",
"1917-1927" = "(1.92e+03,1.93e+03]",
"1928-1935" = "(1.93e+03,1.94e+03]"))
crow_unigram_period <- crow_unigram %>%
group_by(Period) %>%
count(word) %>%
mutate(n = as.numeric(n))
crow_unigram_period %>%
arrange(Period, desc(n))
Wordcloud for the first phase (1829-1903):
wc_tp1 <- crow_unigram_period %>%
filter(Period == "1829-1903") %>%
select(word, n)
wc_tp1$Period <- NULL
wordcloud2(wc_tp1, color = "random-light", backgroundColor = "grey")
Wordcloud for the second phase (1904-1916):
wc_tp2 <- crow_unigram_period %>%
filter(Period == "1904-1916") %>%
select(word, n)
wc_tp2$Period <- NULL
wordcloud2(wc_tp2, size = 2, color = "random-light", backgroundColor = "grey")
Wordcloud for the third phase (1917-1927):
wc_tp3 <- crow_unigram_period %>%
filter(Period == "1917-1927") %>%
select(word, n)
wc_tp3$Period <- NULL
wordcloud2(wc_tp3, size = 2, color = "random-light", backgroundColor = "grey")
Wordcloud for the last phase (1928-1935): :
wc_tp4 <- crow_unigram_period %>%
filter(Period == "1928-1935") %>%
select(word, n)
wc_tp4$Period <- NULL
wordcloud2(wc_tp4, size = 2, color = "random-light", backgroundColor = "grey")
In order to handle multiword units such as “daily news”, we will tokenize the text into bigrams (two adjacent words):
crow_bigram <- crow_title_1935 %>%
unnest_tokens(output = bigram,
input = title,
token = "ngrams",
n = 2)
# separate bigrams into words for cleaning
crow_bigram_separated <- crow_bigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
# remove stop words and other non words
crow_bigram_filtered <- crow_bigram_separated %>%
rename(word = word1) %>%
anti_join(stop_words) %>%
rename(word1 = word) %>%
rename(word = word2) %>%
anti_join(stop_words) %>%
rename(word2 = word)
# reunite bigrams
crow_bigram_united <- crow_bigram_filtered %>%
drop_na(word1) %>%
drop_na(word2) %>%
unite(bigram, word1, word2, sep = " ")
kable(head(crow_bigram_united), caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
id | Year | City_py | City_zh | Title_eng | Title_zh | Established | Periodicity | Language | bigram |
---|---|---|---|---|---|---|---|---|---|
Shanghai_Daily_2 | 1935 | Shanghai | 上海 | Sin Wan Pao | 新聞報 | 1893 | Daily | Chinese | sin wan |
Shanghai_Daily_2 | 1935 | Shanghai | 上海 | Sin Wan Pao | 新聞報 | 1893 | Daily | Chinese | wan pao |
Shanghai_Daily_5 | 1935 | Shanghai | 上海 | Shanghai Herald | 申報 | 1872 | Daily | Chinese | shanghai herald |
Shanghai_Daily_6 | 1935 | Shanghai | 上海 | China Times, The | 時事新報 | 1908 | Daily | Chinese | china times |
Shanghai_Daily_7 | 1935 | Shanghai | 上海 | Eastern Times | 時報 | 1904 | Daily | Chinese | eastern times |
Tianjin_Daily_8 | 1935 | Tianjin | 天津 | Social Welfare | 益世報 | 1915 | Daily | Chinese | social welfare |
The resulting table contains an additional column with the bigrams extracted from each title. Each observation now corresponds to a bigram, instead of a periodical in the original dataset and a word in the unigram tokenization. The dataset now contains 1900 observations, referring to the 1126 bigrams extracted from the titles of the 702 periodicals (Only the 6 first results are displayed above). Missing values (NA) refer to short titles consisting in just one word.
We can then compute and visualize the frequencies and tf-idf as we did for unigrams:
# count bigrams
crow_bigram_count <- crow_bigram_united %>%
count(bigram, sort = TRUE)
crow_bigram_count
As expected, the most frequent bigrams are daily/news (168 occurrences) and to a lesser extent, republican/daily (27). These three words were to be found in the common trigram “republican daily news”. Less expected is the pair people/press (29) which points to the emerging notion of public opinion in the Republican press. Three other important bigrams emphasized the development of the press as an industry in itself (commercial press, commercial daily) or as a medium for informing the public about industrial and commercial issues (industrial commercial). Finally, the bigrams evening news and morning press reflect the diversification of the rhythmic patterns of the Republican press, with the differentiation between morning and evening editions.
When did these terms appeared? Did the Republican-inspired titles coincide with the founding of the Republic? When did the press began to emphasize its commercial nature and to diversify its edition? We can select the bigrams we are interested in and plot their distribution over time, as we did for unigrams:
p <- crow_bigram_united %>%
filter(Established > 1900) %>%
filter(bigram %in% c("daily news", "republican daily", "people press")) %>%
group_by(Established, bigram) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = bigram))+
facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
labs(title = "Naming periodicals in modern China: Periodizing the most frequent bigrams",
subtitle = "Periodizing the three most frequent bigrams",
x = "Year",
y = "Frequency",
fill = "Bigram",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
fig <- ggplotly(p)
fig
Notice that we reduce the timescale to the 20th century since there was only one occurrence of “daily news” prior to 1900 (North-China Daily News). Indeed, the first “Republican daily” was established during the first year of the Republic in 1912. The terms did not decline and even gained in popularity during the following decade. Two more were established in 1922 and 1925, and dramatically increased under the Nationalist regime. The terms “people press” appeared later (1917). Their trajectory was less continuous, with two peaks in 1927 and 1932. The words “daily news”, by contrast, were less dependent on the political chronology. They appeared in late imperial years. As already observed, the first occurrence referred to the North-China Daily News in 1850. They enjoyed a continuous yet limited presence until 1927 and peaked in the early 1930s.
p <- crow_bigram_united %>%
filter(bigram %in% c("evening news", "morning press")) %>%
group_by(Established, bigram) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = bigram))+
facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
labs(title = "Naming periodicals in modern China: Periodizing rhythmic patterns",
subtitle = "Periodizing rhythmic patterns",
x = "Year",
y = "Frequency",
fill = "Bigram",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
fig <- ggplotly(p)
fig
Based on bigrams, it appears that evening editions appeared earlier than morning editions but persisted during the Nanjing decade, concurrently with the development of morning editions. This largely supports our previous analysis based on single words.
p <- crow_bigram_united %>%
filter(bigram %in% c("commercial press", "commercial daily", "industrial commercial")) %>%
group_by(Established, bigram) %>%
tally() %>%
ggplot() +
geom_col(aes(x = Established, y = n, fill = bigram))+
facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
labs(title = "Naming periodicals in modern China: Periodizing the Commercialization of the press",
subtitle = "Periodizing rhythmic patterns",
x = "Year",
y = "Frequency",
fill = "Bigram",
caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")
fig <- ggplotly(p)
fig
Editors had emphasized the commercial nature or content of their publications since the late imperial years (1907). We observe an alternation between the two major expressions. Commercial daily appeared earlier than commercial press and persisted over the entire period, except for a short eclipse between 1915 and 1924. The terms “commercial press” precisely emerged during this eclipse and enjoyed a discontinuous popularity until the early 1930s. The third compound industrial commercial referred to the content of specialized journals rather than the commercial turn of the press in general. This account for its irregular occurrence over time, with two major peak in 1915 and 1931.
TF-IDF by periodicity:
crow_bigram_periodicity_tf_idf <- crow_bigram_united %>%
count(Periodicity, bigram) %>%
bind_tf_idf(bigram, Periodicity, n) %>%
arrange(desc(tf_idf))
kable(head(crow_bigram_periodicity_tf_idf), title = "TF-IDF of bigrams used for naming periodicals", subtitle = "TF-IDF per periodicity group", caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
Periodicity | bigram | n | tf | idf | tf_idf |
---|---|---|---|---|---|
Bimonthly | electrical engineering | 1 | 1.0000000 | 2.302585 | 2.3025851 |
Biweekly | people press | 2 | 0.3333333 | 1.203973 | 0.4013243 |
Biweekly | china truth | 1 | 0.1666667 | 2.302585 | 0.3837642 |
Biweekly | foochow people | 1 | 0.1666667 | 2.302585 | 0.3837642 |
Biweekly | singling people | 1 | 0.1666667 | 2.302585 | 0.3837642 |
Biweekly | steel press | 1 | 0.1666667 | 2.302585 | 0.3837642 |
TF-IDF by language:
crow_bigram_periodicity_tf_idf <- crow_bigram_united %>%
count(Language, bigram) %>%
bind_tf_idf(bigram, Language, n) %>%
arrange(desc(tf_idf))
kable(head(crow_bigram_periodicity_tf_idf), title = "TF-IDF of bigrams used for naming periodicals", subtitle = "TF-IDF per language group", caption = "First 6 rows") %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
Language | bigram | n | tf | idf | tf_idf |
---|---|---|---|---|---|
Bilingual | nursing journal | 1 | 1.0000000 | 1.94591 | 1.9459101 |
Russian | nasha zaria | 1 | 0.5000000 | 1.94591 | 0.9729551 |
Russian | shanghai zaria | 1 | 0.5000000 | 1.94591 | 0.9729551 |
French | de shanghai | 1 | 0.3333333 | 1.94591 | 0.6486367 |
French | journal de | 1 | 0.3333333 | 1.94591 | 0.6486367 |
French | shanghai le | 1 | 0.3333333 | 1.94591 | 0.6486367 |
Bigrams are useful to handle compound expressions such as “daily news”, but there are restricted to adjacent words. Collocations or co-occurrences provide a more sophisticated approach for analyzing the relations between more distant words. It is of limited interest in our case since we are dealing with very short strings of text that rarely contained more than three words. But it is still instructive to experiment with an alternative method and compare the results we obtained using bigrams.
What words most often co-occur in the titles of periodicals? We propose compare two alternative methods for analyzing the co-occurrences in the names of periodicals. First, we rely on tidytext/tidygraph to adopt a tidy approach. Second, we rely on quanteda and more sophisticate metrics.
First, we compute the most frequent pairs of words in each title. The table below sort the pairs of words by decreasing frequency of co-occurrence:
library(tidyverse)
library(tidytext)
library(widyr)
word_pairs <- crow_unigram %>%
pairwise_count(word, id, sort = TRUE)
datatable(word_pairs)
The results are similar to those obtained with bigrams. As expected, the most frequent pairs are daily/news (168 occurrences) and to a lesser extent, republican/news (27) and republican/daily (27). These three words were to be found in the common trigram “republican daily news”. The pair people/press (29), evening/news (20), commercial/press commercial/daily (15) were also among the most frequent. Notice that they displayed higher scores than their corresponding bigrams.
We can plot the co-occurrences as a network graph - showing only the most frequent pairs (n>2):
set.seed(2016)
word_pairs %>%
filter(n > 2) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines")) +
theme_void()+
labs(title = "Word co-occurrences in the names of periodicals",
subtitle = "Most frequent pairs (n>2)",
caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")
The graphs is not uneasy to interpret as such. To make a better use of the collocation approach, we propose to focus on the two most frequent words - press and daily - and compare their collocates:
# Select the target terms
# Focus on "press"
press1 <- word_pairs %>%
filter(item1 == "press")
press2 <- word_pairs %>%
filter(item2 == "press")
press_cooc <- bind_rows(press1, press2)
press_cooc
# Focus on "news"
news1 <- word_pairs %>%
filter(item1 == "news")
news2 <- word_pairs %>%
filter(item2 == "news")
news_cooc <- bind_rows(news1, news2)
news_cooc
# bind rows
press_news_cooc <- bind_rows(press_cooc, news_cooc)
Let’s plot their most frequent collocates as barplots:
word_pairs %>%
filter(item1 %in% c("press", "news")) %>%
group_by(item1) %>%
top_n(10) %>%
ungroup() %>%
mutate(item2 = reorder(item2, n)) %>%
ggplot(aes(item2, n, fill = item1)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ item1, scales = "free") +
coord_flip()+
labs(x = NULL, y = "Number of co-occurrences",
title = "Word co-occurrences in periodicals' names",
subtitle = "Most frequent words co-occurring with 'news' and 'press'",
caption = "Based on ProQuest Crow's \"Newspaper Directory of China (1935)\"")
We further plot their collocates as a network graph:
set.seed(2021)
press_news_cooc %>%
filter(n > 2) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines")) +
theme_void()+
labs(title = "Word co-occurrences in periodicals' names",
subtitle = "Most frequent collocates of 'press' and 'news' (n>2)",
caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")
The network highlights the collocates they have in common at the center of the graph. Most of them referred to place names (China, Szechuen, Tsingtao, Tientsin), North). Some refered the periodicity (daily, morning) of periodicals or to their editorial line (people, industrial, commercial). The oulying collocates are specific to one of the two target terms. For instead, adjectives such as “true” and “strong” only co-occurred with “press”, whereas “impartial”, “current” and “social” appeared only in connection with “news” “public occurred only in relation to news.”Evening" was exclusively associated with “news”. Republican-inspired words such as “citizen”, “republican” or “national” were also exclusive to “news”. There was no clear geographical pattern, though the collocates of news" referred to broader geopolitical entities (North China, world, overseas, east, eastern), than its counterpart “press”, which pointed exclusively to cities (Beijing, Harbin, Nantung, Shanghai, Yangzhou).
We can make the graph more information by indexing the size and color of nodes on their centrality and the width of edges on their strength (number of concurrences).
First we create a corpus object containing the collocations extracted from the titles of periodicals:
crow_token_colloc <- tokens(crow_title_1935$title, remove_punct = TRUE) %>%
tokens_remove(stopwords("english"))
crow_colloc <- textstat_collocations(crow_token_colloc, size = 2, min_count = 2)
We create a document-feature matrix (DFM) from the corpus object:
crow_dfm <- crow_title_1935$title %>%
quanteda::dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
quanteda::dfm_trim(min_termfreq = 10, verbose = FALSE)
We define the target terms (“news” and “press”):
coocTerm1 <- "news"
coocTerm2 <- "press"
We calculate the co-occurrence statistics for each term:
# load the function for co-occurrence calculation
source("https://tm4ss.github.io/calculateCoocStatistics.R")
# calculate co-occurrence statistics
coocs1 <- calculateCoocStatistics(coocTerm1, crow_dfm, measure="LOGLIK")
coocs2 <- calculateCoocStatistics(coocTerm2, crow_dfm, measure="LOGLIK")
We inspect the results for “news” (19 top collocates):
coocs1[1:20]
## daily press weekly shanghai evening east
## 297.89918580 142.58559843 15.91486711 13.74748603 11.64477467 9.63445351
## voice public chinese people industrial commercial
## 6.18133820 5.79876126 4.14097804 3.52938305 3.37887165 2.85573784
## china south new tientsin morning great
## 1.92293955 0.75539409 0.54814930 0.41241935 0.40148965 0.07184508
## pictorial pao
## 0.04965055 NaN
We inspect the results for “press” (19 top collocates):
coocs2[1:20]
## news people daily new evening morning
## 142.58559843 35.04632474 21.03786754 16.10109375 8.80215448 8.09034146
## weekly commercial chinese voice great shanghai
## 6.62014968 5.65734341 4.14607533 3.75419663 2.58059024 2.44810282
## china east review tientsin public south
## 1.96205708 1.27823972 0.98213355 0.36080166 0.34585795 0.01956591
## industrial pao
## 0.01260104 NaN
We reduce the document-feature matrix to contain only the top 19 collocates of news (we selected only 19 collocates because starting from 20, we find only missing values):
redux_dfm1 <- dfm_select(crow_dfm,
pattern = c(names(coocs1)[1:19], "news"))
redux_dfm1
## Document-feature matrix of: 702 documents, 20 features (92.8% sparse).
## features
## docs shanghai china south news weekly daily press evening new chinese
## text1 0 0 0 0 0 0 0 0 0 0
## text2 1 0 0 0 0 0 0 0 0 0
## text3 0 1 0 0 0 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 1 1 0 0 0 0 0 0
## [ reached max_ndoc ... 696 more documents, reached max_nfeat ... 10 more features ]
redux_dfm2 <- dfm_select(crow_dfm,
pattern = c(names(coocs2)[1:19], "press"))
redux_dfm2
## Document-feature matrix of: 702 documents, 20 features (92.8% sparse).
## features
## docs shanghai china south news weekly daily press evening new chinese
## text1 0 0 0 0 0 0 0 0 0 0
## text2 1 0 0 0 0 0 0 0 0 0
## text3 0 1 0 0 0 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 1 1 0 0 0 0 0 0
## [ reached max_ndoc ... 696 more documents, reached max_nfeat ... 10 more features ]
We transform the document-feature matrix into a feature-co-occurrence matrix (FCM):
tag_fcm1 <- fcm(redux_dfm1)
tag_fcm1
## Feature co-occurrence matrix of: 20 by 20 features.
## features
## features shanghai china south news weekly daily press evening new chinese
## shanghai 0 0 0 2 0 1 3 1 1 0
## china 0 0 3 10 3 8 5 2 3 0
## south 0 0 0 5 0 4 2 1 0 0
## news 0 0 0 0 1 168 2 20 19 3
## weekly 0 0 0 0 0 0 1 0 1 1
## daily 0 0 0 0 0 0 21 1 17 2
## press 0 0 0 0 0 0 0 1 26 1
## evening 0 0 0 0 0 0 0 0 3 1
## new 0 0 0 0 0 0 0 0 0 0
## chinese 0 0 0 0 0 0 0 0 0 0
## [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]
tag_fcm2 <- fcm(redux_dfm2)
tag_fcm2
## Feature co-occurrence matrix of: 20 by 20 features.
## features
## features shanghai china south news weekly daily press evening new chinese
## shanghai 0 0 0 2 0 1 3 1 1 0
## china 0 0 3 10 3 8 5 2 3 0
## south 0 0 0 5 0 4 2 1 0 0
## news 0 0 0 0 1 168 2 20 19 3
## weekly 0 0 0 0 0 0 1 0 1 1
## daily 0 0 0 0 0 0 21 1 17 2
## press 0 0 0 0 0 0 0 1 26 1
## evening 0 0 0 0 0 0 0 0 3 1
## new 0 0 0 0 0 0 0 0 0 0
## chinese 0 0 0 0 0 0 0 0 0 0
## [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]
Finally, we plot the results as a network graph:
library(quanteda.textplots)
textplot_network(tag_fcm1,
min_freq = 1,
edge_alpha = 0.1,
edge_size = 5,
edge_color = "purple",
vertex_labelsize = log(rowSums(tag_fcm1)+1)) +
labs(title = "Collocations in the names of periodicals",
subtitle = "A graph of 'news' and its most frequent collocates",
caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")
library(quanteda.textplots)
textplot_network(tag_fcm2,
min_freq = 1,
edge_alpha = 0.1,
edge_size = 5,
edge_color = "purple",
vertex_labelsize = log(rowSums(tag_fcm2)+1))+
labs(title = "Collocations in the names of periodicals",
subtitle = "A graph of 'press' and its most frequent collocates",
caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")
On the two graphs, the size of labels is proportionate to the frequency of words and the width of edge is proportionate to the strength of collocations.
Let’s measure more precisely the strength of the association for each collocate.
The strongest collocates of “news” (CollStrength > 0.04):
coocdf1 <- coocs1 %>%
as.data.frame() %>%
dplyr::mutate(CollStrength = coocs1,
Term = names(coocs1)) %>%
dplyr::filter(CollStrength > 0.04)
coocdf1 %>%
select(CollStrength) %>%
arrange(desc(CollStrength))
The strongest collocates of “press” (CollStrength > 0.012):
coocdf2 <- coocs2 %>%
as.data.frame() %>%
dplyr::mutate(CollStrength = coocs2,
Term = names(coocs2)) %>%
dplyr::filter(CollStrength > 0.012)
coocdf2 %>%
select(CollStrength) %>%
arrange(desc(CollStrength))
We plot the collocates to better visualize their relative strength:
ggplot(coocdf1, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
geom_point() +
coord_flip() +
theme_bw() +
labs(title = "Strongest collocates of the word 'news'",
subtitle = "Collocates in the names of periodicals",
x = "Collocate",
y = "Collocation Strength")
ggplot(coocdf2, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
geom_point() +
coord_flip() +
theme_bw() +
labs(title = "Strongest collocates of the word 'press'",
subtitle = "Collocates in the names of periodicals",
caption = "Based on Crow's \"Newspaper Directory of China (1935)\"",
x = "Collocate",
y = "Collocation Strength")
The next sections experiment with alternative visualizations.
First we create a distance matrix for each collocate:
library(Matrix)
coocurrences <- t(crow_dfm) %*% crow_dfm
collocates <- as.matrix(coocurrences)
coolocs1 <- c(coocdf1$Term, "news")
coolocs2 <- c(coocdf2$Term, "press")
# remove non-collocating terms with news
collocates_redux1 <- collocates[rownames(collocates) %in% coolocs1, ]
collocates_redux1 <- collocates_redux1[, colnames(collocates_redux1) %in% coolocs1]
# create distance matrix
distmtx1 <- dist(collocates_redux1)
# remove non-collocating terms with press
collocates_redux2 <- collocates[rownames(collocates) %in% coolocs2, ]
collocates_redux2 <- collocates_redux2[, colnames(collocates_redux2) %in% coolocs2]
# create distance matrix
distmtx2 <- dist(collocates_redux2)
Second we create a hierarchical cluster object from the distance matrix
library(cluster)
clustertexts1 <- hclust(
distmtx1,
method="ward.D2")
clustertexts2 <- hclust(
distmtx2,
method="ward.D2")
Finally we visualize the word trees:
library(ggdendro)
ggdendrogram(clustertexts1) +
ggtitle("Collocates of 'news'")
ggdendrogram(clustertexts2) +
ggtitle("Collocates of 'press'")
Except for a few distinct words (review vs pictorial), the two dendograms are very similar to each other, which reveals the close semantic proximity between the two collocates.
Finally we rely on correspondence analysis to visualize the collocates on a biplot.
We load the packages:
library(FactoMineR)
library(factoextra)
library(explor)
We visualize the collocates of “news” on a biplot:
res.ca1 <- CA(collocates_redux1, graph = FALSE)
fviz_ca_row(res.ca1, repel = TRUE, col.row = "gray20")
We can interact with the biplot:
res1 <- explor::prepare_results(res.ca1)
explor::CA_var_plot(res1, xax = 1, yax = 2, lev_sup = FALSE, var_sup = FALSE,
var_sup_choice = , var_hide = "None", var_lab_min_contrib = 0, col_var = "Position",
symbol_var = NULL, size_var = NULL, size_range = c(10, 300), labels_size = 10,
point_size = 56, transitions = TRUE, labels_positions = NULL, xlim = c(-3.16,
3.02), ylim = c(-1.98, 4.2))
Similarly, we visualize the collocates of “press” on the biplot:
res.ca2 <- CA(collocates_redux2, graph = FALSE)
fviz_ca_row(res.ca2, repel = TRUE, col.row = "gray20")
To interact with the biplot:
res2 <- explor::prepare_results(res.ca2)
explor::CA_var_plot(res2, xax = 1, yax = 2, lev_sup = FALSE, var_sup = FALSE,
var_sup_choice = , var_hide = "None", var_lab_min_contrib = 0, col_var = "Position",
symbol_var = NULL, size_var = NULL, size_range = c(10, 300), labels_size = 10,
point_size = 56, transitions = TRUE, labels_positions = NULL, xlim = c(-1.46,
6.56), ylim = c(-4.45, 3.57))
Last, we measure the significance of collocation to identify the collocates that significantly co-occured and not only by chance
# convert to data frame
coocdf <- as.data.frame(as.matrix(collocates))
# reduce data
diag(coocdf) <- 0
coocdf <- coocdf[which(rowSums(coocdf) > 10),]
coocdf <- coocdf[, which(colSums(coocdf) > 10)]
# extract stats
cooctb <- coocdf %>%
dplyr::mutate(Term = rownames(coocdf)) %>%
tidyr::gather(CoocTerm, TermCoocFreq,
colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
dplyr::mutate(Term = factor(Term),
CoocTerm = factor(CoocTerm)) %>%
dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
dplyr::group_by(Term) %>%
dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
dplyr::ungroup(Term) %>%
dplyr::group_by(CoocTerm) %>%
dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
dplyr::arrange(Term) %>%
dplyr::mutate(a = TermCoocFreq,
b = TermFreq - a,
c = CoocFreq - a,
d = AllFreq - (a + b + c)) %>%
dplyr::mutate(NRows = nrow(coocdf))
# inspect results
datatable(head(cooctb, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")
We select the two keyterms we are interested in:
News:
cooctb_redux_news <- cooctb %>%
dplyr::filter(Term == coocTerm1)
cooctb_redux_press <- cooctb %>%
dplyr::filter(Term == coocTerm2)
cooctb_redux_news %>%
arrange(desc(CoocFreq))
Press:
cooctb_redux_press %>%
arrange(desc(CoocFreq))
Which terms are over- and under-proportionately used with “news”:
coocStatz1 <- cooctb_redux_news %>%
dplyr::rowwise() %>%
dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),
ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001",
p <= .01 ~ "p<.01",
p <= .05 ~ "p<.05",
FALSE ~ "n.s."))
# add information to the table and remove superfluous columns s that the table can be more easily parsed:
coocStatz1_clean <- coocStatz1 %>%
dplyr::ungroup() %>%
dplyr::arrange(p) %>%
dplyr::mutate(j = 1:n()) %>%
dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001",
p <= corr01 ~ "p<.01",
p <= corr05 ~ "p<.05",
FALSE ~ "n.s.")) %>%
dplyr::mutate(p = round(p, 6)) %>%
dplyr::mutate(x2 = round(x2, 1)) %>%
dplyr::mutate(phi = round(phi, 2)) %>%
dplyr::arrange(p) %>%
dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))
# inspect results
coocStatz1_clean %>%
arrange(p)
Which terms are over- and under-proportionately used with “news”:
coocStatz2 <- cooctb_redux_press %>%
dplyr::rowwise() %>%
dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d),
ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%
dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001",
p <= .01 ~ "p<.01",
p <= .05 ~ "p<.05",
FALSE ~ "n.s."))
# add information to the table and remove superfluous columns s that the table can be more easily parsed:
coocStatz2_clean <- coocStatz2 %>%
dplyr::ungroup() %>%
dplyr::arrange(p) %>%
dplyr::mutate(j = 1:n()) %>%
dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001",
p <= corr01 ~ "p<.01",
p <= corr05 ~ "p<.05",
FALSE ~ "n.s.")) %>%
dplyr::mutate(p = round(p, 6)) %>%
dplyr::mutate(x2 = round(x2, 1)) %>%
dplyr::mutate(phi = round(phi, 2)) %>%
dplyr::arrange(p) %>%
dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))
# inspect results
coocStatz2_clean %>%
arrange(p)
# inspect results
coocStatz2_clean %>%
arrange(Significance)
Another way, somewhat simple, to proceed is to extract collocations with the quanteda package.
We load the package:
options(stringsAsFactors = FALSE)
library(quanteda)
We create and preprocess the corpus object:
crow_title_1935_corpus <- corpus(crow_title_1935$title, docnames = crow_title_1935$id)
corpus_tokens <- crow_title_1935_corpus %>%
tokens(remove_punct = FALSE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords(), padding = T)
We search for multi-word unit candidates. We set 2 as the minimum frequency of occurrences. On this basis, the algorithm detected 109 collocations. Only the 10 most frequent are displayed:
crow_mwu <- textstat_collocations(corpus_tokens, min_count = 2)
head(crow_mwu, 10) %>%
arrange(desc(count))
Finally, we create a Document-Term-Matrix (DTM) with unigram tokens and concatenated MWU tokens:
DTM <- corpus_tokens %>%
tokens_remove("") %>%
dfm()
dim(DTM)
## [1] 702 526
The DTM contains 702 documents (periodicals) and 526 distinct tokens (unigrams or word units).
A widely used method to weight terms according to their semantic contribution to a document is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.
Let us compute TF-IDF weights for all terms in the “daily” category for instance:
# Compute IDF: log(N / n_i)
number_of_docs <- nrow(DTM)
term_in_docs <- colSums(DTM > 0)
idf <- log(number_of_docs / term_in_docs)
# Compute TF
daily <- which(crow_title_1935$Periodicity == "Daily")
tf <- as.vector(DTM[daily, ])
# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)
The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the daily category, according to this weight:
sort(tf_idf, decreasing = T)[1:20]
## <NA> fen kiangying just <NA> <NA> <NA> <NA>
## 13.107867 6.553933 6.553933 6.553933 6.553933 6.553933 6.553933 6.553933
## <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6.553933 6.553933 6.553933 6.553933 6.553933 6.553933 6.553933 6.553933
## <NA> <NA> <NA> <NA>
## 6.553933 6.553933 6.553933 6.553933
We can instead focus on a single year, for instance 1917 - known as a “significant year” in the history of journalism in China (He, 2019)
# Compute TF for the year 1917
title1917 <- which(crow_title_1935$Established == "1917")
tf <- as.vector(DTM[title1917, ])
# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)
sort(tf_idf, decreasing = T)[1:20]
## strength universal circulating kiang guilding
## 6.553933 6.553933 6.553933 6.553933 6.553933
## shanghailander <NA> <NA> <NA> <NA>
## 6.553933 6.553933 6.553933 6.553933 6.553933
## <NA> <NA> dairen green <NA>
## 6.553933 6.553933 5.860786 5.860786 5.860786
## <NA> <NA> sunday women hongkong
## 5.860786 5.860786 5.455321 5.167639 4.474492
We can measure the frequencies of certain terms over time. This time we will use decades instead of year. Frequencies in single decades are plotted as line graphs to follow their trends over time. First, we determine which terms to analyze. We focus on the words related to Republican values. We reduce our DTM to this these terms:
terms_to_observe <- c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")
DTM_reduced <- as.matrix(DTM[, terms_to_observe])
We create a new variable for decade and we count word frequencies per decade:
crow_title_1935$decade <- paste0(substr(crow_title_1935$Established, 0, 3), "0")
counts_per_decade <- aggregate(DTM_reduced, by = list(decade = crow_title_1935$decade), sum)
We plot the time series:
# give x and y values beautiful names
decades <- counts_per_decade$decade
frequencies <- counts_per_decade[, terms_to_observe]
# plot multiple frequencies
matplot(decades, frequencies, type = "l")
# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)
Since there were relatively few articles prior to 1900, we can reduce the time window and focus on the periodicals established after 1880:
counts_per_decade2 <- counts_per_decade %>% filter(decade > 1880)
# give x and y values beautiful names
decades <- counts_per_decade2$decade
frequencies <- counts_per_decade2[, terms_to_observe]
# plot multiple frequencies
matplot(decades, frequencies, type = "l")
# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)
## Heatmaps
The overlapping of several time series in a plot can become very confusing. Heatmaps provide an alternative for the visualization of multiple frequencies over time. In this visualization method, a time series is mapped as a row in a matrix grid. Each cell of the grid is filled with a color corresponding to the value from the time series. Thus, several time series can be displayed in parallel.
In addition, the time series can be sorted by similarity in a heatmap. In this way, similar frequency sequences with parallel shapes (heat activated cells) can be detected more quickly. Dendrograms can be plotted aside to visualize quantities of similarity.
terms_to_observe <- c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")
DTM_reduced <- as.matrix(DTM[, terms_to_observe])
rownames(DTM_reduced) <- ifelse(as.integer(crow_title_1935$Established) %% 2 == 0, crow_title_1935$Established, "")
heatmap(t(DTM_reduced), scale = "row", Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE, margins = c(5, 10))
terms_to_observe2 <- c("daily", "weekly", "monthly", "semimonthly", "journal", "magazine", "review", "pictorial", "news")
DTM_reduced2 <- as.matrix(DTM[, terms_to_observe2])
rownames(DTM_reduced2) <- ifelse(as.integer(crow_title_1935$Established) %% 2 == 0, crow_title_1935$Established, "")
heatmap(t(DTM_reduced2), scale = "row", Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE, margins = c(5, 10))
We define the target corpus:
targetDTM <- DTM
Then we load the function calculateLogLikelihood
source("https://tm4ss.github.io/calculateLogLikelihood.R")
We create a self loop for extracting keywords for each decade:
crow_decades <- unique(crow_title_1935$decade)
for (decade in crow_decades) {
cat("Extracting terms per decade", decade, "\n")
selector_logical_idx <- crow_title_1935$decade == decade
decadeDTM <- targetDTM[selector_logical_idx, ]
termCountsTarget <- colSums(decadeDTM)
otherDTM <- targetDTM[!selector_logical_idx, ]
termCountsComparison <- colSums(otherDTM)
loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)
top100 <- sort(loglik_terms, decreasing = TRUE)[1:100]
fileName <- paste0("wordclouds/", decade, ".pdf")
pdf(fileName, width = 9, height = 7)
wordcloud::wordcloud(names(top100), top100, max.words = 100, scale = c(3, .9), colors = RColorBrewer::brewer.pal(8, "Dark2"), random.order = F)
dev.off()
}
## Extracting terms per decade 1890
## Extracting terms per decade 1870
## Extracting terms per decade 1900
## Extracting terms per decade 1910
## Extracting terms per decade 1920
## Extracting terms per decade 1930
## Extracting terms per decade NA0
## Extracting terms per decade 1850
## Extracting terms per decade 1880
## Extracting terms per decade 1860