Prologue

Outline: Simple tokenization Word Bigram Trigram Multiword tokenization Collocation Biterm topic modeling

Prepare data

We load the original dataset containing 1060 periodicals and 47 variables drawn from the two directories of 1931 and 1935:

library(readr)
crowdata <- read_delim("crowdata.csv", ";", 
                       escape_double = FALSE, trim_ws = TRUE)

datatable(crowdata)

We select the variables we will use for the analysis of titles:

Title in English (Title_eng)
Title in Chinese (Title_zh)
Edition of the directory (Year)
Periodicity: to examine whether there was a correlation between the periodicity of periodicals and their title they bore;
Language: to examine whether there was a correlation between the language of periodicals and their title;
Place of publication (City_zh, City_py, Province): to detect geographical pattern in naming periodicals.
Year of establishment (Established): to detect temporal patterns in naming periodicals (for instance, common titles such as “Republican Daily News” and titles containing such words as “Truth”, “Public” or “Social Welfare” rose after the establishment of the Republic in 1911).

crow_title <- crowdata %>% select(Year, City_py, City_zh, 
                                  Title_eng, Title_zh, 
                                  Established, Periodicity, Language)

datatable(crow_title)

We create a unique identifier for each periodical:

crow_title_id <- rowid_to_column(crow_title) %>% 
  rename(id = rowid) %>% 
  mutate(id = paste(City_py, Periodicity, id, sep = "_"))

We pre-process the text data so we will not have to do it during the tokenization process. By pre-processing, we mean remove function words (The, L’, Les), punctuation and “’s”. We finally delete the blank spaces that may have been created in the process:

crow_title_id <- crow_title_id %>% mutate(title = str_remove_all(Title_eng, ", The")) %>% 
  mutate(title = str_remove_all(title, "The")) %>% 
  mutate(title = str_remove_all(title, ", L'")) %>% 
  mutate(title = str_remove_all(title, "Les")) %>%
  mutate(title = str_remove_all(title, " and")) 
crow_title_id$title <- gsub("['’]s\\b|[^[:alnum:] [:blank:]]", "", crow_title_id$title)
crow_title_id$title <- trimws(crow_title_id$title, which = c("both"))

Note that we create new variable (title) for the post-processed text. It is recommended to maintain the original text data (Title_eng). The new column “title” contains the clean text we will use for tokenization:

datatable(crow_title_id %>% select(id, Title_eng, title))

Since we will examine each edition separately, we split the data into two distinct samples:

crow_title_1931 <- crow_title_id %>% filter(Year == "1931")
crow_title_1935 <- crow_title_id %>% filter(Year == "1935")

We"re good to go !!

Since text analyses techniques were initially trained on English-language data, they have been most commonly applied to and still usually work better with English-language texts. Therefore we will start with exploring titles in English (Title_eng). In a second step, we will explore titles in Chinese (Title_zh).

For the analysis of English titles, we rely on three main packages and will proceed in several steps:

Simple tokenization with three levels of granularity: word (1-gram), bigram, trigram using the packages tidytext/tidyverse
Multiword tokenization with the package quanteda
Analyze and visualize word frequency, Tf-idf and words over time for tokens of various size (tidytext, quanteda)
Analyze and visualize correlation between tokens of various size (tidytext, quanteda)
Analyze and visualize collocation between tokens of various size (tidytext, quanteda)
Biterm topic modeling using the package btm

We start with the 1935 sample because its larger size and wider timescope will yield more interesting results.

Simple tokenization

We load the packages:

library(tidyverse)
library(tidytext)

Unigrams

First we create a clean dataset of tokenized text. We remove the remaining punctuation and stop words (e.g. “The”), if any, and we lowercase each word. The resulting table contains an additional column with the word extracted from each title. Each observation now corresponds to a word, and no longer a periodical. The dataset now contains 1900 observations, referring to the 1900 tokens extracted from the titles of the 702 periodicals. Only the 6 first results are displayed:

data("stop_words")

crow_unigram <- crow_title_1935 %>% 
  unnest_tokens(output = word, input = title) %>% 
  anti_join(stop_words) 

kable(head(crow_unigram), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
id	Year	City_py	City_zh	Title_eng	Title_zh	Established	Periodicity	Language	word
Shanghai_Daily_2	1935	Shanghai	上海	Sin Wan Pao	新聞報	1893	Daily	Chinese	sin
Shanghai_Daily_2	1935	Shanghai	上海	Sin Wan Pao	新聞報	1893	Daily	Chinese	wan
Shanghai_Daily_2	1935	Shanghai	上海	Sin Wan Pao	新聞報	1893	Daily	Chinese	pao
Shanghai_Daily_5	1935	Shanghai	上海	Shanghai Herald	申報	1872	Daily	Chinese	shanghai
Shanghai_Daily_5	1935	Shanghai	上海	Shanghai Herald	申報	1872	Daily	Chinese	herald
Shanghai_Daily_6	1935	Shanghai	上海	China Times, The	時事新報	1908	Daily	Chinese	china

Frequency

We can then count the number of words and sort them by their decreasing frequency in the corpus:

unigram_count <- crow_unigram %>% 
  group_by(word) %>% 
  tally() %>% 
  arrange(desc(n))

unigram_count %>% 
  arrange(desc(n))

As expected, the three most frequent words referred to the daily press and daily news (news, daily, press). They reflect the heavy weight of daily newspapers as we showed in a previous essay. The next most frequent words describe the geographical situation (China) and the increasingly commercial nature of the press in Republican China (commercial). Other recurring words reflect the multiple rhythms of the press (evening, weekly), the centrality of Shanghai as the major publishing center, and the emergence of political values associated with the Republican regime (Republican, people).

Periodicity

What words were most frequently associated with each category of periodical?

Let’s find the ten first words for each category of periodical:

top10_word <- crow_unigram %>% 
  group_by(Periodicity, word) %>% 
  tally() %>% 
  arrange(Periodicity, desc(n)) %>% 
  group_by(Periodicity) %>% 
  top_n(10)

top_periodicity <- top10_word %>% filter(n>1)

datatable(top10_word)

Let’s visualize the results. Let’s compare daily, tabloids and weekly:

top10_word %>% 
  filter(Periodicity %in% c("Daily", "Tabloid", "Weekly"))%>%
  group_by(Periodicity) %>%
  top_n(10, n) %>%  
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%  
  ggplot(aes(n, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free") +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Naming periodicals in Republican China", 
       subtitle = "Ten most frequent words used for dailies, tabloids and weeklies", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

The three categories emphasized the concept of news. Daily and tabloid emphasized their daily periodicity. The words Sunday and weekly were specific to the weekly periodicity of weekly periodicals. Tabloids and weeklies emphasized their local anchorage in Shanghai, whereas general daily newspapers emphasized more abstract values (republican, people) and larger political entities (China), like the two others. Tabloids and dailies both had morning editions, but evening editions were specific to daily newspapers. Tabloids focused on entertainment (star, radio, movie) but also shared more serious concerns (people, peace). Other terms described the specific nature of each periodical - pao for dailies (wade-giles transliteration for “bao” 報, newspaper), herald for weekly. Pictorial indicated that weekly generally contained more illustrations than dailies. Review and Critic point to their critical distance in contrast to the instantaneity of daily newspapers.

Let’s now compare less frequent periodicals:

top10_word %>% 
  filter(Periodicity %in% c("Annual", "Monthly", "Quarterly"))%>%
  filter(word %in% top_periodicity$word) %>%
  group_by(Periodicity)%>%
  top_n(10, n) %>%  
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%  
  ggplot(aes(n, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free") +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Naming periodicals in Republican China", 
       subtitle = "Ten most frequent words used for annuals, monthlies and quarterlies", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

Annuals and quarterlies were barely significant from a statistical point of view. Monthlies showed an interestingly wide range of terms that reflected the diversity of their content and intended audience. Besides words that referred to their periodicity (monthly, magazine, journal), their geographical situation (China, Chinese, Shanghai) and their critical distance (review), we find more specific terms delineating two main groups of monthlies, each with its specific function and public: intellectual or professional journals catering to a highly educated and specialized readership (medical, review, commerce) and popular often illustrated magazines commonly associated with women, entertainment and modernity (pictorial, magazine, modern).

Irregular periodicals with intermediate frequency:

top10_word %>% 
  filter(Periodicity %in% c("Bimonthly", "Biweekly", "Semi-monthly", "Semi-weekly"))%>%
  filter(word %in% top_periodicity$word)%>%
  group_by(Periodicity)%>%
  top_n(10, n) %>%  
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%  
  ggplot(aes(n, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE)+
  scale_x_continuous(breaks = integer_breaks())+
  facet_wrap(~ Periodicity, scales = "free") +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Naming periodicals in Republican China", 
       subtitle = "Ten most frequent words used for intermediate periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

TF-IDF

An alternative, widely used method to weight terms according to their semantic contribution to a document or a group of documents is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.

Let’s compute the TF-IDF weights for the words extracted from the titles of periodicals for each category of periodicals (periodicity):

crow_unigram_periodicity_tf_idf <- crow_unigram %>% 
  count(Periodicity, word)  %>%
  bind_tf_idf(word, Periodicity, n) %>%
  arrange(desc(tf_idf))

kable(head(crow_unigram_periodicity_tf_idf), title = "TF-IDF of terms used for naming periodicals", subtitle = "TF-IDF per periodicity group", caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Periodicity	word	n	tf	idf	tf_idf
Bimonthly	electrical	1	0.3333333	2.302585	0.7675284
Annual	directory	3	0.2000000	2.302585	0.4605170
Semi-monthly	semimonthly	5	0.1923077	2.302585	0.4428048
Bimonthly	engineering	1	0.3333333	1.203973	0.4013243
Biweekly	steel	1	0.1000000	2.302585	0.2302585
Weekly	weekly	24	0.1355932	1.609438	0.2182289

The major advantage of TF-IDF compare to simple frequency is to emphasize the words that are specific to a given document (or group of documents), instead of the most common words in general. For instance, we see in the table that technical words such as “electrical” or “engineering” were more specifically associated with to bimonthlies, which reflects the highly specialized nature of their content and their readership profile. The high tf-idf of the word “directory” in relation to annuals indicates that directories as a genre were generally published on a yearly basis.

We can plot the top tf-idf words for each category:

crow_unigram_periodicity_tf_idf %>% 
  group_by(Periodicity) %>%
  top_n(5, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free") +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the names of periodicals", 
       subtitle = "tf-idf by periodicity", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

Let’s focus on the three most frequent categories:

crow_unigram_periodicity_tf_idf %>% 
  filter(Periodicity %in% c("Daily", "Tabloid", "Weekly"))%>%
  group_by(Periodicity) %>%
  top_n(5, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free", ncol=3) +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the names of periodicals", 
       subtitle = "tf-idf in most frequent periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

The terms news, daily and press were commonly used for naming both dailies and tabloids, in the same order of importance. Evening editions were specific to daily newspapers, whereas tabloids more often appeared in the morning. The high tf_idf of the word republican suggests that dailies tended to privilege “serious” political news, whereas tabloids were more often devoted to entertainment (star, radio, movie). Weekly featured totally different words that indicated their specific rhythmic patterns (weekly, Sunday), their emphasis on illustrations (pictorial) and their critical distance (review).

We compare to less frequent periodicals:

crow_unigram_periodicity_tf_idf %>% 
  filter(Periodicity %in% c("Annual", "Monthly", "Quarterly"))%>%
  group_by(Periodicity) %>%
  top_n(5, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free", ncol=3) +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the names of periodicals", 
       subtitle = "tf-idf in less frequent periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

As we observed earlier, the words directory and list indicated that annuals were closely related to directories as a specific genre. The word including suggests that such directories often displayed long, descriptive titles that provided additional information on their content. Other terms referred to their publisher. The words associated with monthlies related to their periodicity (monthly, quarterly), their professional or popular content (medical, magazine) and their critical distance (review). Quarterlies were more strongly associated with technical terms that emphasized their highly specialized contents (physiology, nursing, naturalist medica, economics, caduceus, agricultural).

Finally, we compare irregular periodicals:

crow_unigram_periodicity_tf_idf %>% 
  filter(Periodicity %in% c("Bimonthly", "Biweekly", "Semi-monthly", "Semi-weekly"))%>%
  group_by(Periodicity) %>%
  top_n(5, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = Periodicity)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Periodicity, scales = "free", ncol=2) +
  labs(x = "tf-idf", y = "word", 
       title = "Highest tf-idf words in the names of periodicals", 
       subtitle = "tf-idf in irregular periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

As previously observed, bimonthlies stressed their technical contents (electrical, engineering), whereas biweeklies emphasized political and philosophical concepts (steel, people, truth). Note that steel did not refer to the metallic material but was used in a metaphorical sense to denote political strength. Semi-monthlies appealed to their readers’ social or political sensibility (children, national defense, forum, companion). Semi-weeklies highlighted their locality and focused on popular entertainments (diamond, robin, player).

Correlation between periodicals

From the previous observations, we suspect that some terms were shared between different periodicals whereas others were more specific to one or two categories. For instance, we noticed that the words “daily” and “news” were shared between dailies and tabloids, or that weeklies and monthlies had the word “review” in common. What periodicals were the most similar to each other based on the words they shared? What words were most common/specific in the press lexicon?

To answer the question, we use the package widyr to compute the pairwise correlation between periodicals based on the words they have in common. We use the packages ggraph and igraph to visualize the correlations as a network graph. For the sake of legibility, the network contains only the strongest correlations (>0.15):

library(widyr)

crow_title_cors <- crow_unigram %>% 
  group_by(Periodicity, word) %>% 
  count() %>%
  pairwise_cor(Periodicity, word, n, sort = TRUE)

library(ggraph)
library(igraph)
set.seed(2017)

crow_title_cors %>%
  filter(correlation > .15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 12, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  labs(title = "Naming periodicals in Republican China", 
       subtitle = "Word correlation between categories of periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

The network graph confirms the strong correlation between dailies and tabloids and further reveals their lexical proximity with semi-weeklies and biweeklies. The second main cluster groups monthlies with semi-monthlies, quarterlies and to a lesser extent, bimonthlies. Weeklies held a bridging position between the two groups. Annuals held an outlying position, which reflected their statistical insignificance. Overall, the lexical similarity of periodicals’ titles reflect their temporal proximity. The two clusters opposed two major types of periodicity - daily-based (most frequent) and monthly-based (less frequent) - with intermediate frequency in-between (weeklies).

Language

Did the naming of periodicals depend on their language?

As we proceeded earlier for periodicity, let’s find the ten first words for each language:

top10_word_lang <- crow_unigram %>% 
  group_by(Language, word) %>% 
  tally() %>% 
  arrange(Language, desc(n)) %>% 
  group_by(Language) %>% 
  top_n(10)

top_language <- top10_word_lang %>% filter(n>1)

datatable(top10_word_lang)

Let’s compare the words used in two major language groups, Chinese and English:

top10_word_lang %>% 
  filter(Language %in% c("Chinese", "English"))%>%
  group_by(Language) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Language)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Language, scales = "free", ncol=2) +
  labs(x = "Frequency", y = "Word", 
       title = "Naming periodicals in Republican China", 
       subtitle = "Chinese and English periodicals (10 most frequent words)", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

News, press and daily were much more commonly used for Chinese periodicals than they were for English periodicals. English editors, by contrast, emphasized the locality (China, Chinese, Shanghai, Hongkong). The ranking also reflects differences in the nature of periodicals across language. Dailies dominated the Chinese press, whereas the English press consisted mainly of weeklies and less frequent publications. Finally, the Chinese press placed a stronger emphasis on political values (people, republican).

top10_word_lang %>% 
  filter(Language %in% c("Japanese", "Russian"))%>%
  group_by(Language) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Language)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Language, scales = "free", ncol=2) +
  scale_x_continuous(breaks = integer_breaks())+
  labs(x = "Frequency", y = "Word", 
       title = "Naming periodicals in Republican China", 
       subtitle = "Japanese and Russian periodicals (10 most frequent words)", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

The native language impacted strongly on the names of Japanese and Russian periodicals. The place of publication (Shanghai, Manchuria) was also commonly used to differentiate local editions or communities.

Other foreign language were statistically insignificant.

Let’s compute the TF-IDF weights for the words extracted from the titles of periodicals for each language group:

crow_unigram_language_tf_idf <- crow_unigram %>% 
  count(Language, word)  %>%
  bind_tf_idf(word, Language, n) %>%
  arrange(desc(tf_idf))

kable(head(crow_unigram_language_tf_idf), title = "TF-IDF of terms used for naming periodicals", subtitle = "TF-IDF per language group", caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Language	word	n	tf	idf	tf_idf
Russian	zaria	2	0.4000000	1.94591	0.7783641
Bilingual	nursing	1	0.3333333	1.94591	0.6486367
French	de	1	0.2500000	1.94591	0.4864775
French	le	1	0.2500000	1.94591	0.4864775
German	deutschchinesische	1	0.2000000	1.94591	0.3891820
German	deutsche	1	0.2000000	1.94591	0.3891820

Correlation between language

As e did previously for periodicity, we can examine word similarity between languages.

crow_title_cors_lang <- crow_unigram %>% 
  group_by(Language, word) %>% 
  count() %>%
  pairwise_cor(Language, word, n, sort = TRUE)

set.seed(2017)

crow_title_cors_lang %>%
  filter(correlation > .15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 12, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  labs(title = "Naming periodicals in Republican China", 
       subtitle = "Word correlation between language", 
       caption = "Based on Crow's \"Newspaper Directory of China\"(1935)")

The strongest correlation (0.4) linked English and bilingual (Chinese-English) periodicals. The next pairs associated English and Chinese, English and French, and to a lesser extent, French-bilingual and French-Japanese (0.3). English served as a bridge between Chinese and foreign languages. German and Russian languages held outlying positions and were diametrically opposed to Chinese periodicals. Japanese bridged English with minority languages (German, Russian, French). We shall beware of not over-interpreting these results, given the statistical insignificance of foreign languages, except for English. In fact, Shanghai was the only word they had in common.

Location

Were some words more frequently used in certain cities than others? Can we detect spatial, geographical patters in the naming of periodicals?

For instance, we hypothesize than the differentiation between morning/evening editions, as well as Sunday editions, were specific to large urban centers such as Shanghai, Tianjin, Hongkong, Nanjing or Beijing.

datatable(crow_unigram %>% 
  filter(word %in% c("evening", "morning", "noon", "sunday")) %>% 
  group_by(City_py, City_zh, word) %>% 
  count() %>% 
  arrange(word, desc(n)))

Interestingly, evening editions were more developed in Chongqing (Sichuan) and Tianjin than in Shanghai and Nanjing. Smaller cities like Changzhou, Jinan, Taiyuan and Zhenjiang also offered at least two evening editions each. By contrast, Beijing and southern cities such as Fuzhou, Guangzhou or Hongkong offered just one edition. Morning editions appeared less popular. There were more widely developed in Tianjin (4), before Shanghai (2). The remaining localities scattered across China offered one morning edition each. Noon editions were restricted to four cities - Chonqging, Harbin, Tianjin, Hongkong - also dispersed throughout China. Sunday editions were exclusive to three large metropolitan centers - Shanghai, Tianjin, Hongkong.

Further research is needed to analyze further the spatial patterns in the naming of periodicals across provinces and cities.

Words over time

Did the titles of periodicals reflect their founding period, the time during which they were established? Can we detect temporal patterns or time effects in the naming of periodicals in Republican China?

One way to proceed is to pick up the most important words for each category of periodicals or each lexical category, and examine how they occurred over time. We can select several words simultaneously and examine how they evolve in parallel over time.

Let’s examine for instance the words related to the nature and frequency of periodicals (the plot focuses on the period posterior to 1900. There were too few periodicals before):

crow_unigram %>%
  filter(word %in% c("daily", "weekly", "monthly", "journal", "magazine", "review", "pictorial", "news")) %>%
  group_by(Established, word) %>% 
  filter(Established > 1900)  %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word))+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Words indicating the genre of periodicals", 
       x = "Year", 
       y = "Frequency",
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

News and daily were constitently used during the entire period. Review emerged quite early and persisted over time, though less commonly used than other terms. Journal appeared in the early years of the Republic and was used intermittently. It was not until the 1930s that weekly and monthly gained currency in concurrently with the words “pictorial” and “magazine”. They pointed to the later emergence of popular magazines which we highlighted earlier.

crow_unigram %>%
  filter(word %in% c("news", "press")) %>%
  group_by(Established, word) %>% 
  filter(Established > 1900)  %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word), position = "dodge")+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Word associated with the press industry", 
       x = "Year", 
       y = "Frequency",
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

crow_unigram %>%
  filter(word %in% c("morning", "evening")) %>%
  group_by(Established, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word), show.legend = FALSE)+
  facet_wrap(~ word, scales = "free") +
  scale_y_continuous(breaks = integer_breaks())+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Time-related words", 
       x = "Year", 
       y = "Frequency",
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

Morning and evening newspapers both appeared in the late imperial years and peaked in the 1930s. The two terms were barely used in between. No morning dailies were established in the early 1920s (or more accurately, no editor chose the term “morning” to name its newspaper), in contrast with two occurrences for “evening” during the same period.

crow_unigram %>%
  filter(word %in% c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")) %>%
  group_by(Established, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word))+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Word associated with republican values", 
       x = "Year", 
       y = "Frequency", 
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

crow_unigram %>%
  filter(word %in% c("north", "south", "east", "eastern")) %>%
    filter(Established > 1900)  %>% 
  group_by(Established, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word))+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Geo-based words", 
       x = "Year", 
       y = "Frequency", 
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

crow_unigram %>%
  filter(word %in% c("shanghai", "tientsin", "hongkong", "szechuen", "tsingtao", "peiping", "nantung", "canton", "tsinan", "wenchow")) %>%
  group_by(Established, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word))+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Place names", 
       x = "Year", 
       y = "Frequency", 
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

crow_unigram %>%
  filter(word %in% c("industrial", "commercial", "economic", "medical", "engineering","education", "movie")) %>%
  group_by(Established, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = word))+ 
  labs(title = "Naming periodicals in modern China", 
       subtitle = "Field-based Words", 
       x = "Year", 
       y = "Frequency", 
       fill = "Word", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

An alternative method would consist in grouping years by decade or reusing the four-phase periodization we designed in a previous essay. We will apply this method later using the quanteda package and the log-likehood ratio test.

Unigram tokenization offers a preliminary solution for a exploring the names of periodicals but it presents serious limitations. It is not suitable, however, to handle compound expressions such as “daily news” or “republican daily news”. In the next sections, we will explore a few methods for handling multiword units (bigram, trigrams, multiword tokenization) and relations between words (collocations or co-occurrences).

Wordclouds

Wordclouds are less informative but more visually striking than conventional plots. In this section, we create a series of wordclouds to compare word frequencies depending on the

periodicity
language
Year of establishment (period, decade)
Place of publication (city, province)

Load package:

library(wordcloud2)

Periodicity

crow_unigram_periodicity <- crow_unigram %>%
group_by(Periodicity) %>%
count(word) %>% 
  mutate(n = as.numeric(n))

crow_unigram_periodicity %>% 
  arrange(Periodicity, desc(n))

“Daily” Wordcloud:

wc_daily <- crow_unigram_periodicity %>% 
  filter(Periodicity == "Daily") %>% 
  select(word, n) 

wc_daily$Periodicity <- NULL

wordcloud2(wc_daily, size = 2)

Tabloid:

wc_tabloid<- crow_unigram_periodicity %>% 
  filter(Periodicity == "Tabloid") %>% 
  select(word, n) 

wc_tabloid$Periodicity <- NULL

wordcloud2(wc_tabloid, size = 2)

Weekly:

wc_weekly<- crow_unigram_periodicity %>% 
  filter(Periodicity == "Weekly") %>% 
  select(word, n) 

wc_weekly$Periodicity <- NULL

wordcloud2(wc_weekly, size = 2, 
           rotateRatio = 1, backgroundColor = "grey", color = "random-light")

Monthly:

wc_monthly<- crow_unigram_periodicity %>% 
  filter(Periodicity == "Monthly") %>% 
  select(word, n) 

wc_monthly$Periodicity <- NULL

wordcloud2(wc_monthly, minSize = 2,
           rotateRatio = 1, backgroundColor = "grey", color = "random-light", shape = 'circle')

Language

crow_unigram_lang <- crow_unigram %>%
group_by(Language) %>%
count(word) %>% 
  mutate(n = as.numeric(n))

crow_unigram_lang %>% 
  arrange(Language, desc(n))

Wordcloud of Chinese periodicals:

wc_zh <- crow_unigram_lang%>% 
  filter(Language == "Chinese") %>% 
  select(word, n) 

wc_zh$Language <- NULL

wordcloud2(wc_zh, size = 2, color = "random-light", backgroundColor = "grey", minRotation = -pi/2, maxRotation = -pi/2)

Wordcloud of English periodicals:

wc_eng <- crow_unigram_lang %>% 
  filter(Language == "English") %>% 
  select(word, n) 

wc_eng$Language <- NULL

wordcloud2(wc_eng, size = 2, color = "random-light", backgroundColor = "grey")

Wordcloud of other foreign periodicals:

wc_for <- crow_unigram_lang %>% 
  filter(!Language %in% c("Chinese", "English")) %>% 
  select(word, n) 

wc_for$Language <- NULL

wordcloud2(wc_for, color = "random-light", backgroundColor = "grey")

Period

crow_unigram <- crow_unigram %>% mutate(Period=cut(Established, 
                                                   breaks=c(0, 1903, 1916, 1927, 1935), 
                                                   include.lowest = TRUE))
crow_unigram <- crow_unigram %>% mutate(Period = fct_recode(Period, "1829-1903" = "[0,1.9e+03]",
                                                     "1904-1916" = "(1.9e+03,1.92e+03]",
                                                     "1917-1927" = "(1.92e+03,1.93e+03]",
                                                     "1928-1935" = "(1.93e+03,1.94e+03]"))

crow_unigram_period <- crow_unigram %>%
group_by(Period) %>%
count(word) %>% 
  mutate(n = as.numeric(n))

crow_unigram_period %>% 
  arrange(Period, desc(n))

Wordcloud for the first phase (1829-1903):

wc_tp1 <- crow_unigram_period %>% 
  filter(Period == "1829-1903") %>% 
  select(word, n) 

wc_tp1$Period <- NULL

wordcloud2(wc_tp1, color = "random-light", backgroundColor = "grey")

Wordcloud for the second phase (1904-1916):

wc_tp2 <- crow_unigram_period %>% 
  filter(Period == "1904-1916") %>% 
  select(word, n) 

wc_tp2$Period <- NULL

wordcloud2(wc_tp2, size = 2, color = "random-light", backgroundColor = "grey")

Wordcloud for the third phase (1917-1927):

wc_tp3 <- crow_unigram_period %>% 
  filter(Period == "1917-1927") %>% 
  select(word, n) 

wc_tp3$Period <- NULL

wordcloud2(wc_tp3, size = 2, color = "random-light", backgroundColor = "grey")

Wordcloud for the last phase (1928-1935): :

wc_tp4 <- crow_unigram_period %>% 
  filter(Period == "1928-1935") %>% 
  select(word, n) 

wc_tp4$Period <- NULL

wordcloud2(wc_tp4, size = 2, color = "random-light", backgroundColor = "grey")

Bigrams

In order to handle multiword units such as “daily news”, we will tokenize the text into bigrams (two adjacent words):

crow_bigram <- crow_title_1935 %>% 
  unnest_tokens(output = bigram, 
                input = title, 
                token = "ngrams", 
                n = 2)  

# separate bigrams into words for cleaning
crow_bigram_separated <- crow_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ") 

# remove stop words and other non words
crow_bigram_filtered <- crow_bigram_separated %>% 
  rename(word = word1) %>%
  anti_join(stop_words) %>% 
  rename(word1 = word) %>% 
  rename(word = word2) %>%
  anti_join(stop_words) %>% 
  rename(word2 = word) 

# reunite bigrams

crow_bigram_united <- crow_bigram_filtered  %>% 
  drop_na(word1) %>%
  drop_na(word2) %>%
  unite(bigram, word1, word2, sep = " ")

kable(head(crow_bigram_united), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
id	Year	City_py	City_zh	Title_eng	Title_zh	Established	Periodicity	Language	bigram
Shanghai_Daily_2	1935	Shanghai	上海	Sin Wan Pao	新聞報	1893	Daily	Chinese	sin wan
Shanghai_Daily_2	1935	Shanghai	上海	Sin Wan Pao	新聞報	1893	Daily	Chinese	wan pao
Shanghai_Daily_5	1935	Shanghai	上海	Shanghai Herald	申報	1872	Daily	Chinese	shanghai herald
Shanghai_Daily_6	1935	Shanghai	上海	China Times, The	時事新報	1908	Daily	Chinese	china times
Shanghai_Daily_7	1935	Shanghai	上海	Eastern Times	時報	1904	Daily	Chinese	eastern times
Tianjin_Daily_8	1935	Tianjin	天津	Social Welfare	益世報	1915	Daily	Chinese	social welfare

The resulting table contains an additional column with the bigrams extracted from each title. Each observation now corresponds to a bigram, instead of a periodical in the original dataset and a word in the unigram tokenization. The dataset now contains 1900 observations, referring to the 1126 bigrams extracted from the titles of the 702 periodicals (Only the 6 first results are displayed above). Missing values (NA) refer to short titles consisting in just one word.

Frequency

We can then compute and visualize the frequencies and tf-idf as we did for unigrams:

# count bigrams 

crow_bigram_count <- crow_bigram_united %>%
  count(bigram, sort = TRUE)

crow_bigram_count

As expected, the most frequent bigrams are daily/news (168 occurrences) and to a lesser extent, republican/daily (27). These three words were to be found in the common trigram “republican daily news”. Less expected is the pair people/press (29) which points to the emerging notion of public opinion in the Republican press. Three other important bigrams emphasized the development of the press as an industry in itself (commercial press, commercial daily) or as a medium for informing the public about industrial and commercial issues (industrial commercial). Finally, the bigrams evening news and morning press reflect the diversification of the rhythmic patterns of the Republican press, with the differentiation between morning and evening editions.

Bigrams over time

When did these terms appeared? Did the Republican-inspired titles coincide with the founding of the Republic? When did the press began to emphasize its commercial nature and to diversify its edition? We can select the bigrams we are interested in and plot their distribution over time, as we did for unigrams:

p <- crow_bigram_united %>%     
  filter(Established > 1900)  %>% 
  filter(bigram %in% c("daily news", "republican daily", "people press")) %>%
  group_by(Established, bigram) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = bigram))+ 
  facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
  labs(title = "Naming periodicals in modern China: Periodizing the most frequent bigrams", 
       subtitle = "Periodizing the three most frequent bigrams", 
       x = "Year", 
       y = "Frequency",
       fill = "Bigram", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

fig <- ggplotly(p)
fig

Notice that we reduce the timescale to the 20th century since there was only one occurrence of “daily news” prior to 1900 (North-China Daily News). Indeed, the first “Republican daily” was established during the first year of the Republic in 1912. The terms did not decline and even gained in popularity during the following decade. Two more were established in 1922 and 1925, and dramatically increased under the Nationalist regime. The terms “people press” appeared later (1917). Their trajectory was less continuous, with two peaks in 1927 and 1932. The words “daily news”, by contrast, were less dependent on the political chronology. They appeared in late imperial years. As already observed, the first occurrence referred to the North-China Daily News in 1850. They enjoyed a continuous yet limited presence until 1927 and peaked in the early 1930s.

p <- crow_bigram_united %>%     
  filter(bigram %in% c("evening news", "morning press")) %>%
  group_by(Established, bigram) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = bigram))+ 
  facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
  labs(title = "Naming periodicals in modern China: Periodizing rhythmic patterns", 
       subtitle = "Periodizing rhythmic patterns", 
       x = "Year", 
       y = "Frequency",
       fill = "Bigram", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

fig <- ggplotly(p)
fig

Based on bigrams, it appears that evening editions appeared earlier than morning editions but persisted during the Nanjing decade, concurrently with the development of morning editions. This largely supports our previous analysis based on single words.

p <- crow_bigram_united %>%     
  filter(bigram %in% c("commercial press", "commercial daily", "industrial commercial")) %>%
  group_by(Established, bigram) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Established, y = n, fill = bigram))+ 
  facet_wrap(~ bigram, scales = "free_y", nrow = 3) +
  labs(title = "Naming periodicals in modern China: Periodizing the Commercialization of the press", 
       subtitle = "Periodizing rhythmic patterns", 
       x = "Year", 
       y = "Frequency",
       fill = "Bigram", 
       caption = "Based on Crow's \"Newspaper Directory of China\" (1935)")

fig <- ggplotly(p)
fig

Editors had emphasized the commercial nature or content of their publications since the late imperial years (1907). We observe an alternation between the two major expressions. Commercial daily appeared earlier than commercial press and persisted over the entire period, except for a short eclipse between 1915 and 1924. The terms “commercial press” precisely emerged during this eclipse and enjoyed a discontinuous popularity until the early 1930s. The third compound industrial commercial referred to the content of specialized journals rather than the commercial turn of the press in general. This account for its irregular occurrence over time, with two major peak in 1915 and 1931.

TF-IDF

TF-IDF by periodicity:

crow_bigram_periodicity_tf_idf <- crow_bigram_united %>% 
  count(Periodicity, bigram)  %>%
  bind_tf_idf(bigram, Periodicity, n) %>%
  arrange(desc(tf_idf))

kable(head(crow_bigram_periodicity_tf_idf), title = "TF-IDF of bigrams used for naming periodicals", subtitle = "TF-IDF per periodicity group", caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Periodicity	bigram	n	tf	idf	tf_idf
Bimonthly	electrical engineering	1	1.0000000	2.302585	2.3025851
Biweekly	people press	2	0.3333333	1.203973	0.4013243
Biweekly	china truth	1	0.1666667	2.302585	0.3837642
Biweekly	foochow people	1	0.1666667	2.302585	0.3837642
Biweekly	singling people	1	0.1666667	2.302585	0.3837642
Biweekly	steel press	1	0.1666667	2.302585	0.3837642

TF-IDF by language:

crow_bigram_periodicity_tf_idf <- crow_bigram_united %>% 
  count(Language, bigram)  %>%
  bind_tf_idf(bigram, Language, n) %>%
  arrange(desc(tf_idf))

kable(head(crow_bigram_periodicity_tf_idf), title = "TF-IDF of bigrams used for naming periodicals", subtitle = "TF-IDF per language group", caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Language	bigram	n	tf	idf	tf_idf
Bilingual	nursing journal	1	1.0000000	1.94591	1.9459101
Russian	nasha zaria	1	0.5000000	1.94591	0.9729551
Russian	shanghai zaria	1	0.5000000	1.94591	0.9729551
French	de shanghai	1	0.3333333	1.94591	0.6486367
French	journal de	1	0.3333333	1.94591	0.6486367
French	shanghai le	1	0.3333333	1.94591	0.6486367

Bigrams are useful to handle compound expressions such as “daily news”, but there are restricted to adjacent words. Collocations or co-occurrences provide a more sophisticated approach for analyzing the relations between more distant words. It is of limited interest in our case since we are dealing with very short strings of text that rarely contained more than three words. But it is still instructive to experiment with an alternative method and compare the results we obtained using bigrams.

Co-occurrences

What words most often co-occur in the titles of periodicals? We propose compare two alternative methods for analyzing the co-occurrences in the names of periodicals. First, we rely on tidytext/tidygraph to adopt a tidy approach. Second, we rely on quanteda and more sophisticate metrics.

Tidy approach

First, we compute the most frequent pairs of words in each title. The table below sort the pairs of words by decreasing frequency of co-occurrence:

library(tidyverse)
library(tidytext)
library(widyr)

word_pairs <- crow_unigram %>%
  pairwise_count(word, id, sort = TRUE)


datatable(word_pairs)

The results are similar to those obtained with bigrams. As expected, the most frequent pairs are daily/news (168 occurrences) and to a lesser extent, republican/news (27) and republican/daily (27). These three words were to be found in the common trigram “republican daily news”. The pair people/press (29), evening/news (20), commercial/press commercial/daily (15) were also among the most frequent. Notice that they displayed higher scores than their corresponding bigrams.

We can plot the co-occurrences as a network graph - showing only the most frequent pairs (n>2):

set.seed(2016)

word_pairs %>%
  filter(n > 2) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, point.padding  = unit(0.2, "lines")) +
  theme_void()+
  labs(title = "Word co-occurrences in the names of periodicals", 
       subtitle = "Most frequent pairs (n>2)", 
       caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")

The graphs is not uneasy to interpret as such. To make a better use of the collocation approach, we propose to focus on the two most frequent words - press and daily - and compare their collocates:

# Select the target terms

# Focus on "press"

press1 <- word_pairs %>%
  filter(item1 == "press")
press2 <- word_pairs %>%
  filter(item2 == "press")
press_cooc <- bind_rows(press1, press2)

press_cooc

# Focus on "news" 

news1 <- word_pairs %>%
  filter(item1 == "news")
news2 <- word_pairs %>%
  filter(item2 == "news")
news_cooc <- bind_rows(news1, news2)

news_cooc

# bind rows 

press_news_cooc <- bind_rows(press_cooc, news_cooc)

Let’s plot their most frequent collocates as barplots:

word_pairs %>%
  filter(item1 %in% c("press", "news")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, n)) %>%
  ggplot(aes(item2, n, fill = item1)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()+
  labs(x = NULL, y = "Number of co-occurrences", 
       title = "Word co-occurrences in periodicals' names", 
       subtitle = "Most frequent words co-occurring with 'news' and 'press'", 
       caption = "Based on ProQuest Crow's \"Newspaper Directory of China (1935)\"")

We further plot their collocates as a network graph:

set.seed(2021)

press_news_cooc %>%
  filter(n > 2) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, point.padding  = unit(0.2, "lines")) +
  theme_void()+
  labs(title = "Word co-occurrences in periodicals' names", 
       subtitle = "Most frequent collocates of 'press' and 'news' (n>2)", 
       caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")

The network highlights the collocates they have in common at the center of the graph. Most of them referred to place names (China, Szechuen, Tsingtao, Tientsin), North). Some refered the periodicity (daily, morning) of periodicals or to their editorial line (people, industrial, commercial). The oulying collocates are specific to one of the two target terms. For instead, adjectives such as “true” and “strong” only co-occurred with “press”, whereas “impartial”, “current” and “social” appeared only in connection with “news” “public occurred only in relation to news.”Evening" was exclusively associated with “news”. Republican-inspired words such as “citizen”, “republican” or “national” were also exclusive to “news”. There was no clear geographical pattern, though the collocates of news" referred to broader geopolitical entities (North China, world, overseas, east, eastern), than its counterpart “press”, which pointed exclusively to cities (Beijing, Harbin, Nantung, Shanghai, Yangzhou).

We can make the graph more information by indexing the size and color of nodes on their centrality and the width of edges on their strength (number of concurrences).

Quanteda

Preprocessing

First we create a corpus object containing the collocations extracted from the titles of periodicals:

crow_token_colloc <- tokens(crow_title_1935$title, remove_punct = TRUE) %>%
    tokens_remove(stopwords("english"))
crow_colloc <- textstat_collocations(crow_token_colloc, size = 2, min_count = 2)

We create a document-feature matrix (DFM) from the corpus object:

crow_dfm <- crow_title_1935$title %>% 
  quanteda::dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
  quanteda::dfm_trim(min_termfreq = 10, verbose = FALSE)

We define the target terms (“news” and “press”):

coocTerm1 <- "news" 
coocTerm2 <- "press"

Calculation

We calculate the co-occurrence statistics for each term:

# load the function for co-occurrence calculation
source("https://tm4ss.github.io/calculateCoocStatistics.R")
# calculate co-occurrence statistics
coocs1 <- calculateCoocStatistics(coocTerm1, crow_dfm, measure="LOGLIK")
coocs2 <- calculateCoocStatistics(coocTerm2, crow_dfm, measure="LOGLIK")

We inspect the results for “news” (19 top collocates):

coocs1[1:20]

##        daily        press       weekly     shanghai      evening         east 
## 297.89918580 142.58559843  15.91486711  13.74748603  11.64477467   9.63445351 
##        voice       public      chinese       people   industrial   commercial 
##   6.18133820   5.79876126   4.14097804   3.52938305   3.37887165   2.85573784 
##        china        south          new     tientsin      morning        great 
##   1.92293955   0.75539409   0.54814930   0.41241935   0.40148965   0.07184508 
##    pictorial          pao 
##   0.04965055          NaN

We inspect the results for “press” (19 top collocates):

coocs2[1:20]

##         news       people        daily          new      evening      morning 
## 142.58559843  35.04632474  21.03786754  16.10109375   8.80215448   8.09034146 
##       weekly   commercial      chinese        voice        great     shanghai 
##   6.62014968   5.65734341   4.14607533   3.75419663   2.58059024   2.44810282 
##        china         east       review     tientsin       public        south 
##   1.96205708   1.27823972   0.98213355   0.36080166   0.34585795   0.01956591 
##   industrial          pao 
##   0.01260104          NaN

We reduce the document-feature matrix to contain only the top 19 collocates of news (we selected only 19 collocates because starting from 20, we find only missing values):

redux_dfm1 <- dfm_select(crow_dfm, 
                        pattern = c(names(coocs1)[1:19], "news"))

redux_dfm1

## Document-feature matrix of: 702 documents, 20 features (92.8% sparse).
##        features
## docs    shanghai china south news weekly daily press evening new chinese
##   text1        0     0     0    0      0     0     0       0   0       0
##   text2        1     0     0    0      0     0     0       0   0       0
##   text3        0     1     0    0      0     0     0       0   0       0
##   text4        0     0     0    0      0     0     0       0   0       0
##   text5        0     0     0    0      0     0     0       0   0       0
##   text6        0     0     1    1      0     0     0       0   0       0
## [ reached max_ndoc ... 696 more documents, reached max_nfeat ... 10 more features ]

redux_dfm2 <- dfm_select(crow_dfm, 
                         pattern = c(names(coocs2)[1:19], "press"))

redux_dfm2

## Document-feature matrix of: 702 documents, 20 features (92.8% sparse).
##        features
## docs    shanghai china south news weekly daily press evening new chinese
##   text1        0     0     0    0      0     0     0       0   0       0
##   text2        1     0     0    0      0     0     0       0   0       0
##   text3        0     1     0    0      0     0     0       0   0       0
##   text4        0     0     0    0      0     0     0       0   0       0
##   text5        0     0     0    0      0     0     0       0   0       0
##   text6        0     0     1    1      0     0     0       0   0       0
## [ reached max_ndoc ... 696 more documents, reached max_nfeat ... 10 more features ]

We transform the document-feature matrix into a feature-co-occurrence matrix (FCM):

tag_fcm1 <- fcm(redux_dfm1)
tag_fcm1

## Feature co-occurrence matrix of: 20 by 20 features.
##           features
## features   shanghai china south news weekly daily press evening new chinese
##   shanghai        0     0     0    2      0     1     3       1   1       0
##   china           0     0     3   10      3     8     5       2   3       0
##   south           0     0     0    5      0     4     2       1   0       0
##   news            0     0     0    0      1   168     2      20  19       3
##   weekly          0     0     0    0      0     0     1       0   1       1
##   daily           0     0     0    0      0     0    21       1  17       2
##   press           0     0     0    0      0     0     0       1  26       1
##   evening         0     0     0    0      0     0     0       0   3       1
##   new             0     0     0    0      0     0     0       0   0       0
##   chinese         0     0     0    0      0     0     0       0   0       0
## [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]

tag_fcm2 <- fcm(redux_dfm2)
tag_fcm2

## Feature co-occurrence matrix of: 20 by 20 features.
##           features
## features   shanghai china south news weekly daily press evening new chinese
##   shanghai        0     0     0    2      0     1     3       1   1       0
##   china           0     0     3   10      3     8     5       2   3       0
##   south           0     0     0    5      0     4     2       1   0       0
##   news            0     0     0    0      1   168     2      20  19       3
##   weekly          0     0     0    0      0     0     1       0   1       1
##   daily           0     0     0    0      0     0    21       1  17       2
##   press           0     0     0    0      0     0     0       1  26       1
##   evening         0     0     0    0      0     0     0       0   3       1
##   new             0     0     0    0      0     0     0       0   0       0
##   chinese         0     0     0    0      0     0     0       0   0       0
## [ reached max_feat ... 10 more features, reached max_nfeat ... 10 more features ]

Finally, we plot the results as a network graph:

Network graph

library(quanteda.textplots)

textplot_network(tag_fcm1, 
                 min_freq = 1, 
                 edge_alpha = 0.1, 
                 edge_size = 5,
                 edge_color = "purple",
                 vertex_labelsize = log(rowSums(tag_fcm1)+1)) + 
  labs(title = "Collocations in the names of periodicals", 
       subtitle = "A graph of 'news' and its most frequent collocates",
       caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")

library(quanteda.textplots)

textplot_network(tag_fcm2, 
                 min_freq = 1, 
                 edge_alpha = 0.1, 
                 edge_size = 5,
                 edge_color = "purple",
                 vertex_labelsize = log(rowSums(tag_fcm2)+1))+ 
  labs(title = "Collocations in the names of periodicals", 
       subtitle = "A graph of 'press' and its most frequent collocates",
       caption = "Based on Crow's \"Newspaper Directory of China (1935)\"")

On the two graphs, the size of labels is proportionate to the frequency of words and the width of edge is proportionate to the strength of collocations.

Let’s measure more precisely the strength of the association for each collocate.

Collocation strength

The strongest collocates of “news” (CollStrength > 0.04):

coocdf1 <- coocs1 %>%
  as.data.frame() %>%
  dplyr::mutate(CollStrength = coocs1,
                Term = names(coocs1)) %>%
  dplyr::filter(CollStrength > 0.04)

coocdf1 %>% 
  select(CollStrength) %>% 
  arrange(desc(CollStrength))

The strongest collocates of “press” (CollStrength > 0.012):

coocdf2 <- coocs2 %>%
  as.data.frame() %>%
  dplyr::mutate(CollStrength = coocs2,
                Term = names(coocs2)) %>%
  dplyr::filter(CollStrength > 0.012)

coocdf2 %>% 
  select(CollStrength) %>% 
  arrange(desc(CollStrength))

We plot the collocates to better visualize their relative strength:

Simple plot

ggplot(coocdf1, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
  geom_point() +
  coord_flip() +
  theme_bw() +
  labs(title = "Strongest collocates of the word 'news'", 
       subtitle = "Collocates in the names of periodicals", 
       x = "Collocate", 
       y = "Collocation Strength")

ggplot(coocdf2, aes(x = reorder(Term, CollStrength, mean), y = CollStrength)) +
  geom_point() +
  coord_flip() +
  theme_bw() +
  labs(title = "Strongest collocates of the word 'press'", 
       subtitle = "Collocates in the names of periodicals", 
       caption = "Based on Crow's \"Newspaper Directory of China (1935)\"", 
       x = "Collocate", 
       y = "Collocation Strength")

The next sections experiment with alternative visualizations.

Dendogram

First we create a distance matrix for each collocate:

library(Matrix)

coocurrences <- t(crow_dfm) %*% crow_dfm
collocates <- as.matrix(coocurrences)

coolocs1 <- c(coocdf1$Term, "news")
coolocs2 <- c(coocdf2$Term, "press")

# remove non-collocating terms with news
collocates_redux1 <- collocates[rownames(collocates) %in% coolocs1, ]
collocates_redux1 <- collocates_redux1[, colnames(collocates_redux1) %in% coolocs1]
# create distance matrix
distmtx1 <- dist(collocates_redux1)

# remove non-collocating terms with press
collocates_redux2 <- collocates[rownames(collocates) %in% coolocs2, ]
collocates_redux2 <- collocates_redux2[, colnames(collocates_redux2) %in% coolocs2]
# create distance matrix
distmtx2 <- dist(collocates_redux2)

Second we create a hierarchical cluster object from the distance matrix

library(cluster)

clustertexts1 <- hclust(    
  distmtx1,                 
  method="ward.D2") 

clustertexts2 <- hclust(    
  distmtx2,                 
  method="ward.D2")

Finally we visualize the word trees:

library(ggdendro)
ggdendrogram(clustertexts1) +
  ggtitle("Collocates of 'news'")

ggdendrogram(clustertexts2) +
  ggtitle("Collocates of 'press'")

Except for a few distinct words (review vs pictorial), the two dendograms are very similar to each other, which reveals the close semantic proximity between the two collocates.

Biplot

Finally we rely on correspondence analysis to visualize the collocates on a biplot.

We load the packages:

library(FactoMineR)
library(factoextra)
library(explor)

We visualize the collocates of “news” on a biplot:

res.ca1 <- CA(collocates_redux1, graph = FALSE)
fviz_ca_row(res.ca1, repel = TRUE, col.row = "gray20")

We can interact with the biplot:

res1 <- explor::prepare_results(res.ca1)
explor::CA_var_plot(res1, xax = 1, yax = 2, lev_sup = FALSE, var_sup = FALSE,
                    var_sup_choice = , var_hide = "None", var_lab_min_contrib = 0, col_var = "Position",
                    symbol_var = NULL, size_var = NULL, size_range = c(10, 300), labels_size = 10,
                    point_size = 56, transitions = TRUE, labels_positions = NULL, xlim = c(-3.16,
                                                                                           3.02), ylim = c(-1.98, 4.2))

Similarly, we visualize the collocates of “press” on the biplot:

res.ca2 <- CA(collocates_redux2, graph = FALSE)
fviz_ca_row(res.ca2, repel = TRUE, col.row = "gray20")

To interact with the biplot:

res2 <- explor::prepare_results(res.ca2)
explor::CA_var_plot(res2, xax = 1, yax = 2, lev_sup = FALSE, var_sup = FALSE,
                    var_sup_choice = , var_hide = "None", var_lab_min_contrib = 0, col_var = "Position",
                    symbol_var = NULL, size_var = NULL, size_range = c(10, 300), labels_size = 10,
                    point_size = 56, transitions = TRUE, labels_positions = NULL, xlim = c(-1.46,
                                                                                           6.56), ylim = c(-4.45, 3.57))

Significance

Last, we measure the significance of collocation to identify the collocates that significantly co-occured and not only by chance

# convert to data frame
coocdf <- as.data.frame(as.matrix(collocates))
# reduce data
diag(coocdf) <- 0
coocdf <- coocdf[which(rowSums(coocdf) > 10),]
coocdf <- coocdf[, which(colSums(coocdf) > 10)]
# extract stats
cooctb <- coocdf %>%
  dplyr::mutate(Term = rownames(coocdf)) %>%
  tidyr::gather(CoocTerm, TermCoocFreq,
                colnames(coocdf)[1]:colnames(coocdf)[ncol(coocdf)]) %>%
  dplyr::mutate(Term = factor(Term),
                CoocTerm = factor(CoocTerm)) %>%
  dplyr::mutate(AllFreq = sum(TermCoocFreq)) %>%
  dplyr::group_by(Term) %>%
  dplyr::mutate(TermFreq = sum(TermCoocFreq)) %>%
  dplyr::ungroup(Term) %>%
  dplyr::group_by(CoocTerm) %>%
  dplyr::mutate(CoocFreq = sum(TermCoocFreq)) %>%
  dplyr::arrange(Term) %>%
  dplyr::mutate(a = TermCoocFreq,
                b = TermFreq - a,
                c = CoocFreq - a, 
                d = AllFreq - (a + b + c)) %>%
  dplyr::mutate(NRows = nrow(coocdf))
# inspect results
datatable(head(cooctb, 100), rownames = FALSE, options = list(pageLength = 10, scrollX=T), filter = "none")

We select the two keyterms we are interested in:

News:

cooctb_redux_news <- cooctb %>%
  dplyr::filter(Term == coocTerm1)

cooctb_redux_press <- cooctb %>%
  dplyr::filter(Term == coocTerm2)

cooctb_redux_news %>% 
  arrange(desc(CoocFreq))

Press:

cooctb_redux_press %>% 
  arrange(desc(CoocFreq))

Which terms are over- and under-proportionately used with “news”:

coocStatz1 <- cooctb_redux_news %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d), 
                                                        ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
  dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001",
                                                p <= .01 ~ "p<.01",
                                                p <= .05 ~ "p<.05", 
                                                FALSE ~ "n.s."))

# add information to the table and remove superfluous columns s that the table can be more easily parsed:

coocStatz1_clean <- coocStatz1 %>%
  dplyr::ungroup() %>%
  dplyr::arrange(p) %>%
  dplyr::mutate(j = 1:n()) %>%
  dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
  dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
  dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
  dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001",
                                                    p <= corr01 ~ "p<.01",
                                                    p <= corr05 ~ "p<.05", 
                                                    FALSE ~ "n.s.")) %>%
  dplyr::mutate(p = round(p, 6)) %>%
  dplyr::mutate(x2 = round(x2, 1)) %>%
  dplyr::mutate(phi = round(phi, 2)) %>%
  dplyr::arrange(p) %>%
  dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
  dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))

# inspect results 
coocStatz1_clean %>% 
  arrange(p)

Which terms are over- and under-proportionately used with “news”:

coocStatz2 <- cooctb_redux_press %>%
  dplyr::rowwise() %>%
  dplyr::mutate(p = as.vector(unlist(fisher.test(matrix(c(a, b, c, d), 
                                                        ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(x2 = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))[1]))) %>%
  dplyr::mutate(phi = sqrt((x2/(a + b + c + d)))) %>%
  dplyr::mutate(expected = as.vector(unlist(chisq.test(matrix(c(a, b, c, d), ncol = 2, byrow = T))$expected[1]))) %>%
  dplyr::mutate(Significance = dplyr::case_when(p <= .001 ~ "p<.001",
                                                p <= .01 ~ "p<.01",
                                                p <= .05 ~ "p<.05", 
                                                FALSE ~ "n.s."))

# add information to the table and remove superfluous columns s that the table can be more easily parsed:

coocStatz2_clean <- coocStatz2 %>%
  dplyr::ungroup() %>%
  dplyr::arrange(p) %>%
  dplyr::mutate(j = 1:n()) %>%
  dplyr::mutate(corr05 = ((j/NRows)*0.05)) %>%
  dplyr::mutate(corr01 = ((j/NRows)*0.01)) %>%
  dplyr::mutate(corr001 = ((j/NRows)*0.001)) %>%
  dplyr::mutate(CorrSignificance = dplyr::case_when(p <= corr001 ~ "p<.001",
                                                    p <= corr01 ~ "p<.01",
                                                    p <= corr05 ~ "p<.05", 
                                                    FALSE ~ "n.s.")) %>%
  dplyr::mutate(p = round(p, 6)) %>%
  dplyr::mutate(x2 = round(x2, 1)) %>%
  dplyr::mutate(phi = round(phi, 2)) %>%
  dplyr::arrange(p) %>%
  dplyr::select(-a, -b, -c, -d, -j, -NRows, -corr05, -corr01, -corr001) %>%
  dplyr::mutate(Type = ifelse(expected > TermCoocFreq, "Antitype", "Type"))

# inspect results 
coocStatz2_clean %>% 
  arrange(p)

# inspect results 
coocStatz2_clean %>% 
  arrange(Significance)

Multiword tokenization

Another way, somewhat simple, to proceed is to extract collocations with the quanteda package.

We load the package:

options(stringsAsFactors = FALSE)
library(quanteda)

We create and preprocess the corpus object:

crow_title_1935_corpus <- corpus(crow_title_1935$title, docnames = crow_title_1935$id)

corpus_tokens <- crow_title_1935_corpus %>% 
  tokens(remove_punct = FALSE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower() %>%
  tokens_remove(pattern = stopwords(), padding = T)

We search for multi-word unit candidates. We set 2 as the minimum frequency of occurrences. On this basis, the algorithm detected 109 collocations. Only the 10 most frequent are displayed:

crow_mwu <- textstat_collocations(corpus_tokens, min_count = 2) 

head(crow_mwu, 10) %>% 
  arrange(desc(count))

Finally, we create a Document-Term-Matrix (DTM) with unigram tokens and concatenated MWU tokens:

DTM <- corpus_tokens %>% 
  tokens_remove("") %>%
  dfm() 

dim(DTM)

## [1] 702 526

The DTM contains 702 documents (periodicals) and 526 distinct tokens (unigrams or word units).

TF-IDF

A widely used method to weight terms according to their semantic contribution to a document is the term frequency–inverse document frequency measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.

Let us compute TF-IDF weights for all terms in the “daily” category for instance:

# Compute IDF: log(N / n_i)
number_of_docs <- nrow(DTM)
term_in_docs <- colSums(DTM > 0)
idf <- log(number_of_docs / term_in_docs)

# Compute TF
daily <- which(crow_title_1935$Periodicity == "Daily")
tf <- as.vector(DTM[daily, ])

# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)

The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the daily category, according to this weight:

sort(tf_idf, decreasing = T)[1:20]

##      <NA>       fen kiangying      just      <NA>      <NA>      <NA>      <NA> 
## 13.107867  6.553933  6.553933  6.553933  6.553933  6.553933  6.553933  6.553933 
##      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA> 
##  6.553933  6.553933  6.553933  6.553933  6.553933  6.553933  6.553933  6.553933 
##      <NA>      <NA>      <NA>      <NA> 
##  6.553933  6.553933  6.553933  6.553933

We can instead focus on a single year, for instance 1917 - known as a “significant year” in the history of journalism in China (He, 2019)

# Compute TF for the year 1917 
title1917 <- which(crow_title_1935$Established == "1917")
tf <- as.vector(DTM[title1917, ])

# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(DTM)

sort(tf_idf, decreasing = T)[1:20]

##       strength      universal    circulating          kiang       guilding 
##       6.553933       6.553933       6.553933       6.553933       6.553933 
## shanghailander           <NA>           <NA>           <NA>           <NA> 
##       6.553933       6.553933       6.553933       6.553933       6.553933 
##           <NA>           <NA>         dairen          green           <NA> 
##       6.553933       6.553933       5.860786       5.860786       5.860786 
##           <NA>           <NA>         sunday          women       hongkong 
##       5.860786       5.860786       5.455321       5.167639       4.474492

Time series

We can measure the frequencies of certain terms over time. This time we will use decades instead of year. Frequencies in single decades are plotted as line graphs to follow their trends over time. First, we determine which terms to analyze. We focus on the words related to Republican values. We reduce our DTM to this these terms:

terms_to_observe <- c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")

DTM_reduced <- as.matrix(DTM[, terms_to_observe])

We create a new variable for decade and we count word frequencies per decade:

crow_title_1935$decade <- paste0(substr(crow_title_1935$Established, 0, 3), "0")
counts_per_decade <- aggregate(DTM_reduced, by = list(decade = crow_title_1935$decade), sum)

We plot the time series:

# give x and y values beautiful names
decades <- counts_per_decade$decade
frequencies <- counts_per_decade[, terms_to_observe]

# plot multiple frequencies
matplot(decades, frequencies, type = "l")

# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)

Since there were relatively few articles prior to 1900, we can reduce the time window and focus on the periodicals established after 1880:

counts_per_decade2 <- counts_per_decade %>% filter(decade > 1880)
# give x and y values beautiful names
decades <- counts_per_decade2$decade
frequencies <- counts_per_decade2[, terms_to_observe]

# plot multiple frequencies
matplot(decades, frequencies, type = "l")

# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)

## Heatmaps

The overlapping of several time series in a plot can become very confusing. Heatmaps provide an alternative for the visualization of multiple frequencies over time. In this visualization method, a time series is mapped as a row in a matrix grid. Each cell of the grid is filled with a color corresponding to the value from the time series. Thus, several time series can be displayed in parallel.

In addition, the time series can be sorted by similarity in a heatmap. In this way, similar frequency sequences with parallel shapes (heat activated cells) can be detected more quickly. Dendrograms can be plotted aside to visualize quantities of similarity.

terms_to_observe <- c("people", "republican", "voice", "public", "citizen", "impartial", "truth", "nation", "national")
DTM_reduced <- as.matrix(DTM[, terms_to_observe])
rownames(DTM_reduced) <- ifelse(as.integer(crow_title_1935$Established) %% 2 == 0, crow_title_1935$Established, "")
heatmap(t(DTM_reduced), scale = "row", Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE, margins = c(5, 10))

terms_to_observe2 <- c("daily", "weekly", "monthly", "semimonthly", "journal", "magazine", "review", "pictorial", "news")
DTM_reduced2 <- as.matrix(DTM[, terms_to_observe2])
rownames(DTM_reduced2) <- ifelse(as.integer(crow_title_1935$Established) %% 2 == 0, crow_title_1935$Established, "")
heatmap(t(DTM_reduced2), scale = "row", Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE, margins = c(5, 10))

Log-likelihood ratio test

We define the target corpus:

targetDTM <- DTM

Then we load the function calculateLogLikelihood

source("https://tm4ss.github.io/calculateLogLikelihood.R")

We create a self loop for extracting keywords for each decade:

crow_decades <- unique(crow_title_1935$decade)
for (decade in crow_decades) {
  
  cat("Extracting terms per decade", decade, "\n")
  
  selector_logical_idx <- crow_title_1935$decade == decade
  
  decadeDTM <- targetDTM[selector_logical_idx, ]
  termCountsTarget <- colSums(decadeDTM)
  
  otherDTM <- targetDTM[!selector_logical_idx, ]
  termCountsComparison <- colSums(otherDTM)
  
  loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)
  
  top100 <- sort(loglik_terms, decreasing = TRUE)[1:100]
  
  fileName <- paste0("wordclouds/", decade, ".pdf")
  pdf(fileName, width = 9, height = 7)
  wordcloud::wordcloud(names(top100), top100, max.words = 100, scale = c(3, .9), colors = RColorBrewer::brewer.pal(8, "Dark2"), random.order = F)
  dev.off()
  
}

## Extracting terms per decade 1890

## Extracting terms per decade 1870

## Extracting terms per decade 1900

## Extracting terms per decade 1910

## Extracting terms per decade 1920

## Extracting terms per decade 1930

## Extracting terms per decade NA0

## Extracting terms per decade 1850

## Extracting terms per decade 1880

## Extracting terms per decade 1860

Naming periodicals in early 1930s China

A study of newspapers’ titles in Crow’s Newspaper Directory of China (1931-5)

Cécile Armand

2021-06-15

Prologue

Prepare data

Simple tokenization

Unigrams

Frequency

Periodicity

TF-IDF

Correlation between periodicals

Language

Correlation between language

Location

Words over time

Wordclouds

Periodicity

Language

Period

Bigrams

Frequency

Bigrams over time

TF-IDF

Co-occurrences

Tidy approach

Quanteda

Preprocessing

Calculation

Network graph

Collocation strength

Simple plot

Dendogram

Biplot

Significance

Multiword tokenization

TF-IDF

Time series

Log-likelihood ratio test