Prologue

This tutorial is the continuation of our series devoted to building and exploring a newspaper corpus with the enpchina and other R packages, taking the returned students as a case study. In the previous tutorial, we learned how to create the corpus. In this new instalment, we apply basic text analysis techniques to approach the content of articles.

This tutorial proceeds in several steps. The first step consists in tokenizing the text of each article. Tokenizing refers to the process of splitting the articles into basic semantic units of various length (sentence, word, n-gram). In this experiment, we explore three kinds of tokens: sentences, words and bigrams (two successive words).

First we load the necessary packages:

library(tidyverse)
library(tidytext)

Next we load the prebuild corpus:

library(readr)
rs_full_text <- read_csv("Data/rs_full_text.csv",
col_types = cols(X1 = col_skip()))

rs_docs_ft <- rs_full_text

kable(head(rs_docs_ft), caption = "First 6 documents") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 documents
DocID	Date	Title	Source	Text
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	Many University Men on Shanghai Mayor’s Staff Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced the appointment of Cabinet of ten members practically all of whom are graduates of Chinese colleges or are returned students from America or Europe The list includes the following Police commissioner Shen Pu Jen Assistant commissioner Chang Zien Chun Graduate of Nanyang College Commissioner of Revenue Shu Zien Fu banker formerly director of Rank of China in Hangchow Comniiss ion of Public Works Dr Shen Chung Ye1’ Graduate of Tung Chi and returned from Germany got his Ph in civil engineering in Germ my Commissioner of Public Utilitv Huang Bah Ziau Graduate of Tung Clri and returned student from Germany formerly er of works in Hankow Commissioner of Charity Dept Wang Han Tsc member of local gentry Commissioner of Harbor Dept Lee N’ien Tse returned student from Germany formerly commissioner of industry in Shansi Province Commissioner of Education Chnu Ching Nung American returned student formerly of Shanghai College and Kuang Hua University Commissioner of agriculture commerce and labor Pan Kong Zu tS JR formerly journalist in Shanghai member of the political committee in Shanghai Commissioner of Land Chcni Ycen Belgian returned student formerly the Chinese director of China- l rench Technical College in Shanghai 10) Commissioner of Health Dr Hu Hou-ki an American returned student who has been connected with the Rockefeller Hospital at Peking In the Department of Health there are also four other returned students who will have charge of the various divisions and laboratories
1326716896	19180629	Editorial Article 1 – No Title	The China Weekly Review	THREE American returned students two British returned students one French returned student one German returned student one graduate of the Shensi University and two Japanese returned students were elected at the first section of the long-expected elections tor the Senate held on June 20’ and 21 in Peking For the first time in the history of the Republic had so many foreign-educated Chinese voted Over four hundred registered and some three hundred cast their votes The majority of them are graduates of Western or Japanese universities They questioned the legality of the new Parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein THOSE who were elected Senators are Tsur former president of the Tsing Hua College Ho Yen-sun secretary to Liang Shih-yi and Chen Huai-chang Confucianist who are returned students from America Lo Hung-nien of the Bank of China and Wang Shih-ching editor of the Peking Daily News returned students from England Wu Chun from Ger many Wu Ching-lien from France former Minister to Italy Ting Yung and Wei Sze-kan from Japan and Hsu Shih graduate of the Shensi University They were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to June 10, 1918, authors whose books have been recognized by the Ministry of Education The voting took place in the Hall of the House of Representa tives and the election was presided over bv Fu Tseng-hsiang Minister of Education
1326716374	19190215	Editorial Article 6 – No Title	The China Weekly Review	IF there is one class of Chinese that will gain from the coming new era in China it is the returned student If the menace of predatory Japan and the sinister effect of the baneful Sphere of Influence is removed and this country is reorganized in both government and business the returned student from the American and European university with technical and scientific education is the man who will win When the railways of China are placed on business-like basis when the currency of the country is scientifically reorganized when civil administration of the laws takes the place of military force when real representative dem ocratic form of government in China is established when modern education comes to China’s masses the returned student with his specialized education will be in demand The demand is likely to suddenly exceed the supply and China’s schools will be called upon for more and more skilled men We often wonder wheth er he thousands of returned students in this country who are now practically inarticulate be cause they have no place in Old China realize the importance of the future and are preparing to meet the responsibility when it comes to their shoulders In Peking the returned students are preparing for conference during the coming month when the following subjects are to be studied and discussed 1. The Economic and Industrial Development of China Abolition of spheres of influence Internationalization of railroads Under what terms should foreign capital be invited to assist China in this phase of her life 2. The Emancipation of China from Militarism 3. Adequate Guarantees for Free Speech and Free Press 4. Law Reform in China and the Sensible Abolition of Extraterritoriality 5. Civil Service Requirements for Government Positions 6. Needed Educational Reform In view of the tremendous importance of the present year for China we should like to see the various organizations of returned students and the various universities and colleges follow the example of the Peking returned students The only way to bear responsibility is to prepare for it by serious study and work In the New China ability and energy instead of watchful waiting and intrigue will count
1416388974	19190902	Correspondence	The Canton Times	Correspondence Communications under the above heading are welcomed but the editor must remain ole judge of their suitability and be doest not undertake to bold himself or this newspapers responsible for the facts or opinions presented I’o the Editor Cunlon Timis 1 The letter of Ono of the Circle 1 in your Saturday’s issue to indicate two points 0. one to crucify tho Western Rsturned students’ Union and tho other is to tho Euro-American Return- Students Association It behoves men therefore to lay out tho plain facts for tho perusal of the general 1 publio Your correspondent says that new organization will not by nny means bring the College men hete more together sincerely deplore his shortsightedness In fact ‘the Western Returned Students Union both in name and scope will bring relation between and inculcate cordial understanding among the Returned Students There is not an obstacle in the Constitution of the Union that will debar returned student from joining tho Union but every one is welcome Therefore you senior’’ or elder aro also elegible as members of the Western Returned Students Union Ho further believe it will ho best for the proposed Union to end as it is bringing whatever reforms the moro active members wish to introduce into the circle or organization through tho proper offices of the Euro-Ameri ca Returned Students Association In truth the more active of the Union have already -diagnosed consumptive symptoms in the Euro- Amcrican Returned Student Associat ion as indicated by its and during the past few yearB of its Thcroforc the same activo members of the Western Returned Students Union thought it too inadvisable to any reforms to tho Euro-American Returned Stud- Association ns it would be great folly to lay tho foundation of tho new organization 011 debris and already shows of decay Your correspondent again states As far ns it is known two of the fourteen or fifteen who attended the first meeting of the proposed West ern Returned Students Union last Wednesday evening were misled by press notice nnd will probably henceforth moro to do with it Whether tho said two per sons as your correspondent asserts will have anything br to do with an organization is unselfish in aims nnd open-door in policy is mat ter that concerns the said two But no one was misled by the press notice as attested by the of those present on that evening to vote for the sanction of the Constitution and its adoption as Moreover each and every one present did for the executive Committee Mr One of the Circle ’’ next t tne when you have anything to write you should think more of facts instead of rashly writing whatever to your mind nnd commit such awkward blunders However may your mis takes of the past serve you as lesson for the future Canton August 31, 1919.
1326717530	19190412	Other 1 – No Title	The China Weekly Review	ONiviiKsi rY MAY 2 11919 jSS rf- library MILLARD S REVIEW OF THE FAR EAST Published Weekly Saturday April 2tH 1919 ii ag 1 1 1 Western Returned Students and China After tke ’War By Paul Reinsch Tke Facts Regarding the Tientsin Incident Tke Hope of tke estern Returned Student By Hollington Tong s3 fe wl 1 TWENTY CENTS COPY Vol VIII No 7
1319877652	19250207	China’s Returned Students	The China Weekly Review	China’s Returned Students BY TSAO LIEN-EN SINCE Dr Yung Wing the father of all Return Students first embarked in sailing vessel at New York in 1847, and propounded after his return the time-honored policy of sending students abroad China has witnessed annually great numbers of young men and women sailing for America and other European countries either through governmental funds or private support for the furtherance of their education these numbers during recent years have been almost stupendous One would certainly be optimistic on the future of China when he sees mostly in summer the crowds of students flocking to the trans-Pacific steamships and probably would mutter to himself that the salvation of China hinges chiefly upon these young people But when he considers the service wrought by these Returned Students and the time and money spent thereof he would only heave sigh China from now on must not look to her Returned Students but her home-bred ones am not writing this as protest against all Returned Students have always high respect toward the earlier Returned Students and there sojourn abroad was very justifiable one because home education in those days was yet not developed and because they really did something for the nation Men like Wu Ting-Fang built their names of honor and did works beyond all bounds leaving behind them foot prints on the sands of time their names will always be cherished by the younger generation But modern Returned Students acquire their education abroad as an ornament of their lives instead of tools to work with and this most strongly denounce Being student have occasion to ask my classmates and schoolmates about their plans after graduation Most of them express the desire of studying abroad They ground their belief upon the fact that their social standing would be raised their earning capacity would be enlarged their oppotunities of seeking occupations would be more that they can render better service to China and some even go so far as to invent the ridiculous fallacy that they could get better wives Previously had the idea of going too but now condemn this thought At the outset let it be understood that am writing this in an unbiased vein and not giving vent to my grudge Returned Students are mostly from America so our attention will be drawn to America also When Chinese students go to America they are devided into two classes The one is the assiduous group who after landing at the far-off country confine themselves in their rooms burying deep in their books without keeping in touch with anything outside and win honors among their foreign classmates This group is promising so long as their knowledge goes but they learn little of foreign culture and their school life differ very little from that at home The other is the social group group of rich fools who assimilated by the bustling foreign city life indulge themselves in the folly of de luxe society The evening finds them in cinemas and midnight finds them in dancing houses And yet folks at home stretch out their necks and hope that these men would lead the people To these Returned Students the future of China is hopelessly -sealed For the latter group no more ink needs be spilt They will turn out to be intellectual vagabonds and first-class ne er-do-wells whom China should apprehend The former group is tolerably better Their scholarship and intellegence strike the foreign students with awe But the trouble with them is that they could not put their dead knowledge into practice after their return which shall prove later on After their return to China the first thing which behooves them is the grasping of the ricebowl Indeed the ricebowl is indispensable to every one and they are not to blame But their fault is they deign not to accept low posi tions always desiring to reach the top of the ladder by sudden leap which of course is impossible and this affords the main reason of their unemployment Most of them among the employed hold teaching positions many hang the capital ready to plunge into the political arena when opportunity offers some receiving pay envelopes from foreign hongs firms and some quite absurdly retire from their busy school and stay at home relying upon their family pensions In all they care too much for dignity wealth and fame and utterly disregard the particular goals which they were so ambitious to attain before their most coveted voyage So it is not surprise to find Returned Student specialized in mechanical engineering or analytic chemistry teaching English in Middle School or working as clerk in foreign trading company What is the use of his technical knowledge Does studying abroad pay One may hazard conjecture that his environments do not favor him But we must not forget that circumstances do not mould man but man moulds his circumstances If anything is lacking in Returned Student Enter prise is undeniably one Knowledge of course they have some even supercede their foreign colleagues But they are not bold enough to risk their fortune and establish some industry on big scale They dare not in the fear of bankruptcy gather shareholders and start some trade that would bring profit to them and credit to the nation Returned Student high sounding term rings in the minds of the populace as something amazing The public opinion has long been that Returned Students are far better than students their knowledge far superior their potentiality far greater their command of English far higher and render more service to China To those who hold this view call them naively realistic There are innumerable students coming out year by year from home universities with equal if not better education and work right at the foot of the ladder with patience perseverence and energy By way of illustration let me cite case One of the home colleges some time ago engaged professor Returned Student in Political Science with very handsome remuneration He gave his lectures his English was abominable his deplorable Shortly afterwards it was soon found out that the knowledge of ti professor could hardly excel that of the students The school was wise enough and avoided him the next semester This instance certainly does not show that all Returned Students are of this type but it at least testifies that not all Returned Students are of sterling quality Home Schools Preferable Time has changed China fifty years ago was different from the China of today Fifty years ago university and college in the modern sense were something unheard of science astronomy physics and chemistry although we have had it for thousands of years was ignored and unnoticed Our fathers and grandfathers studiously in the myriad volumes of antique lore Studying abroad in those days was very justifiable and thanks to those who went across the sea But repeat again time has changed Marked progress has been made in Commerce even more so in Education despite the fact that foreigners accuse us of being slow in going We are striving may it be like snail on willow tree Numerous schools and colleges are springing up from time to time that it rio longer becomes necessity to go abroad Universities with equal equipment equal courses as those abroad have been and are being established by the government private individuals and missionaries Law school Medical colleges schools of Commerce Agriculture and other schools offering special courses have all come into existence These are landmarks of better China Returned Student after the stay of years abroad is more or less ignorant of the existing conditions at home Usually it takes another number of years to him from his already acquired American ways Oft-times he disdains his old country He says his home-town in which he was born is dirty and so finds his abode in some trade ports Now the lamentable state of things is he never thinks of reconstructing it himself and let the incumbency fall upon somebody else When asked how to conduct certain institution he would say While was in the states people there do it in such and such way But he is entirely at loss to see that he is in China and that such and such way does not apply Hence we must think of some other way to suit our purpose Equally true are things in other respects and think am safe to conclude that studying abroad at present is not justifiable at all Special mention needs here be made of the wise appro- of the British share of Boxer Indemnity fund which newspaper indicates is going to be utilized not in sending students to England by which only the lucky few through push and pull can go but in establishing universities of equal standard as those in England in China in 1 establishing museums and public libraries and in exchanging professors of the two countries So the dispensability of going abroad is plain to every reasonable man In the future most sincerely hope that the students will not look upon going abroad as’ their ultimate destination Comparative Law School of China January 20, 1925,

As a reminder, the corpus contains 2739 unique articles spanning from 1841 to 1953, distributed across 12 periodicals. For each article, the table indicates the unique identifier of the document (DocID), the date of publication (Date), the title (Title), the periodical in which it appeared (Source), and the full text we extracted in the previous tutorial (Text).

Sentences

Let’s first tokenize each article into sentences:

rs_docs_sentence <- rs_docs_ft %>% 
  unnest_tokens(output = sentence, 
                input = Text, 
                token = "sentences")

kable(head(rs_docs_sentence), caption = "First 6 documents") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 documents
DocID	Date	Title	Source	sentence
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	many university men on shanghai mayor’s staff gen hwang fu mayor of the chinese area at shanghai and himself scholar of note has announced the appointment of cabinet of ten members practically all of whom are graduates of chinese colleges or are returned students from america or europe the list includes the following police commissioner shen pu jen assistant commissioner chang zien chun graduate of nanyang college commissioner of revenue shu zien fu banker formerly director of rank of china in hangchow comniiss ion of public works dr shen chung ye1’ graduate of tung chi and returned from germany got his ph in civil engineering in germ my commissioner of public utilitv huang bah ziau graduate of tung clri and returned student from germany formerly er of works in hankow commissioner of charity dept wang han tsc member of local gentry commissioner of harbor dept lee n’ien tse returned student from germany formerly commissioner of industry in shansi province commissioner of education chnu ching nung american returned student formerly of shanghai college and kuang hua university commissioner of agriculture commerce and labor pan kong zu ts jr formerly journalist in shanghai member of the political committee in shanghai commissioner of land chcni ycen belgian returned student formerly the chinese director of china- l rench technical college in shanghai 10) commissioner of health dr hu hou-ki an american returned student who has been connected with the rockefeller hospital at peking in the department of health there are also four other returned students who will have charge of the various divisions and laboratories
1326716896	19180629	Editorial Article 1 – No Title	The China Weekly Review	three american returned students two british returned students one french returned student one german returned student one graduate of the shensi university and two japanese returned students were elected at the first section of the long-expected elections tor the senate held on june 20’ and 21 in peking for the first time in the history of the republic had so many foreign-educated chinese voted over four hundred registered and some three hundred cast their votes the majority of them are graduates of western or japanese universities they questioned the legality of the new parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein those who were elected senators are tsur former president of the tsing hua college ho yen-sun secretary to liang shih-yi and chen huai-chang confucianist who are returned students from america lo hung-nien of the bank of china and wang shih-ching editor of the peking daily news returned students from england wu chun from ger many wu ching-lien from france former minister to italy ting yung and wei sze-kan from japan and hsu shih graduate of the shensi university they were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to june 10, 1918, authors whose books have been recognized by the ministry of education the voting took place in the hall of the house of representa tives and the election was presided over bv fu tseng-hsiang minister of education
1326716374	19190215	Editorial Article 6 – No Title	The China Weekly Review	if there is one class of chinese that will gain from the coming new era in china it is the returned student if the menace of predatory japan and the sinister effect of the baneful sphere of influence is removed and this country is reorganized in both government and business the returned student from the american and european university with technical and scientific education is the man who will win when the railways of china are placed on business-like basis when the currency of the country is scientifically reorganized when civil administration of the laws takes the place of military force when real representative dem ocratic form of government in china is established when modern education comes to china’s masses the returned student with his specialized education will be in demand the demand is likely to suddenly exceed the supply and china’s schools will be called upon for more and more skilled men we often wonder wheth er he thousands of returned students in this country who are now practically inarticulate be cause they have no place in old china realize the importance of the future and are preparing to meet the responsibility when it comes to their shoulders in peking the returned students are preparing for conference during the coming month when the following subjects are to be studied and discussed 1.
1326716374	19190215	Editorial Article 6 – No Title	The China Weekly Review	the economic and industrial development of china abolition of spheres of influence internationalization of railroads under what terms should foreign capital be invited to assist china in this phase of her life 2.
1326716374	19190215	Editorial Article 6 – No Title	The China Weekly Review	the emancipation of china from militarism 3.
1326716374	19190215	Editorial Article 6 – No Title	The China Weekly Review	adequate guarantees for free speech and free press 4.

We obtain 10499 sentences distributed across the 2739 articles. The length of articles ranges from 1 to 147 sentences. Very short articles (less than 3 sentences) and very long ones alike point to issues in punctuation and text segmentation. In the former case, the algorithm just extracted the title or a few sentences. In the latter case, it extracted as one block several articles that should be treated separately.

rs_docs_sentence_by_doc <- rs_docs_sentence %>% 
  group_by(DocID, Title) %>% 
  count(sort= TRUE)


kable(head(rs_docs_sentence_by_doc), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
DocID	Title	n
1319842253	Canton Municipal Progress	147
1371557040	Plan To Overthrow China Shown In Noulens Expose	131
1426575399	Table of Contents 1 – No Title	123
1371469000	The Dismemberment of China	122
1320966649	Outline for the Study of Current History, Finance and Commerce of China	97
1319904038	Influences Which Have Produced Leadership in China– An Analysis of Who’s Who in China	82

rs_docs_sentence_by_doc %>%  
  ggplot(aes(x=n)) + 
  geom_histogram(alpha=0.8)+
  labs(title = "Tokenizing the \"returned students\" (RS) press corpus", 
       subtitle = "Number of sentences by article", 
       x = "Number of sentences", 
       y = "Number of articles",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

In general, the longer the article, the less frequent in the corpus.

Sentence distribution by source:

sentence_by_source <- rs_docs_sentence %>% 
  group_by(DocID, Source, Title) %>% 
  count()

sentence_by_source %>% 
  group_by(Source, n) %>% 
  ggplot(aes(reorder(Source, n), n, color = Source)) +
  geom_boxplot(alpha = 0.4, show.legend = TRUE) + 
  coord_flip()+ 
  scale_y_log10()+
  labs(title = "Tokenizing the \"returned students\" (RS) press corpus", 
       subtitle = "Length of articles, by periodical", 
       x = NULL, 
       y = "Number of sentences by article",
       color = "Periodical",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

The longest articles were to be found in the three most prominent periodicals - China Weekly Review (over 150 sentences), North-China Herald (over 120) and China Press (over 100). The shortest articles appeared in the Peking Leader (4 sentences or less), the Shanghai Gazette (6 or less), the Peking Daily News and Canton Times (8 or less), the “Peking Gazette” (less than 15) and the Chinese Recorder (less than 20).

On the histograms below, we filtered out the The China Critic and The Chinese Repository because they produced only one article each (with 8 and 7 sentences, respectively):

integer_breaks <- function(n = 5, ...) {
  fxn <- function(x) {
    breaks <- floor(pretty(x, n, ...))
    names(breaks) <- attr(breaks, "labels")
    breaks
  }
  return(fxn)
}

rs_docs_sentence %>% 
  group_by(DocID, Source, Title) %>%
  filter(!Source %in% c("The China Critic", "The Chinese Repository"))%>% 
  count(sort= TRUE) %>%  
  ggplot(aes(x=n)) + 
  geom_histogram(alpha=0.8)+
  scale_y_continuous(breaks = integer_breaks())+
  facet_wrap(~Source, scales = "free", ncol = 2)+
  labs(title = "Tokenizing the \"returned students\" (RS) press corpus", 
       subtitle = "Length of articles, by periodical", 
       x = NULL, 
       y = "Number of sentences by article",
       color = "Periodical",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

In another document, we explore how the length varied according to the genre of articles over time. But for now, let’s focus on the content, rather than the structure of articles. In the second section, we focus on the words contained in articles.

Words

As a preliminary processing, we create a clean dataset of tokenized text. We remove stop words, numbers and any word with three or less characters in order to eliminate the most frequent OCR errors:

data("stop_words")

rs_token <- rs_docs_ft %>% 
  unnest_tokens(output = word, input = Text) %>% 
  anti_join(stop_words) %>% 
  filter(!str_detect(word, '[0-9]{1,}')) %>% 
  filter(nchar(word) > 3) %>% 
  filter(!word %in% c("tion", "tions", "chin", "ment"))

kable(head(rs_token), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
DocID	Date	Title	Source	word
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	university
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	shanghai
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	mayor’s
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	staff
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	hwang
1324674682	19270716	Many University Men on Shanghai Mayor’s Staff	The China Weekly Review	mayor

A new column has been added with the words contained in articles. Each row represents one single word in a given article. Each article is therefore split into as many rows as the number of words it contains.

Frequency

We can then count the number of words and sort them by their decreasing frequency in the corpus:

rs_token_count <- rs_token %>% 
  group_by(word) %>% 
  tally() %>% 
  arrange(desc(n))

kable(head(rs_token_count), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
word	n
chinese	17038
china	16887
shanghai	7982
students	6941
government	6363
foreign	5452

The five most frequent words in our corpus are: “Chinese”, “China,” “Shanghai”, “students”, “government”. If we discard expected words (China, Chinese, Shanghai, students, returned), we observe that “American”, “government” and “Japanese” are the most frequently mentioned in connection with the returned students

We notice that plural and genitive forms of the same words (e.g. “China”/“China’s”, “day”/“days”) are considered as distinct words. It may be useful to add lemmatization and stemming in the pre-processing stage in order to reduce grammatical variety and gain more meaningful interpretation.

By day

The top five words for each day in the corpus are listed below:

top5_word <- rs_token %>% 
  group_by(Date, word) %>% 
  tally() %>% 
  arrange(Date, desc(n)) %>% 
  group_by(Date) %>% 
  top_n(5)


kable(head(top5_word), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Date	word	n
18410301	japanese	51
18410301	dutch	41
18410301	nagasaki	26
18410301	english	25
18410301	russian	25
18720921	animal	13

We can for instance compare the most frequent words in the first and the last article in our corpus:

top5_word %>% 
  filter(Date %in% c("18410301", "19530601"))%>%
  group_by(Date) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Date)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Date, scales = "free", ncol=2) +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "The ten most frequent words on the first and last day", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

The distribution reflects the specific content of each article. The earliest article (1841) refered to missionary work in Japan while the most recent (1953) dealt with the reorganization of Chinese colleges. Little more can we said from this random example but it may be more useful to compare the word content of articles that addressed similar topics in different periodicals, from different authors or at different time period.

By article

In case several articles appeared on the same day, we can examine the most frequent words in each article:

top5_word_doc <- rs_token %>% 
  group_by(DocID, Title, word) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  group_by(DocID, Title, word) %>% 
  top_n(5)


kable(head(top5_word_doc), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
DocID	Title	word	n
1371056455	CHINA’S COTTON INDUSTRY	cotton	225
1371557040	Plan To Overthrow China Shown In Noulens Expose	workers	206
1326713165	Table of Contents 2 – No Title	china	177
1319868663	Sir John Jordan, British Minister, Retires after 43 Years of Service in China	john	175
1319907071	Canton Contemplating War Declaration Against the Central Government	canton	157
1319868663	Sir John Jordan, British Minister, Retires after 43 Years of Service in China	china	151

Taking the same example, we compare the 10 most frequent words in the earliest and the most recent article:

top5_word_doc %>% 
  filter(DocID %in% c("1426569000", "1371487952"))%>%
  group_by(DocID) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = DocID)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ DocID, scales = "free", ncol=2) +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "The ten most frequent words in the earliest and latest article", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Since there was just one article published on each of these two days, we obtain the very same results as on the previous plots based on the date of publication.

We can finally examine the most frequent words for each year:

top5_word_year <- rs_token %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  arrange(Year, desc(n)) %>% 
  group_by(Year) %>% 
  top_n(5)


kable(head(top5_word_year), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Year	word	n
1841	japanese	51
1841	dutch	41
1841	nagasaki	26
1841	english	25
1841	russian	25
1872	animal	13

Similarly we compare the 10 most frequent words for the first and last year in the corpus:

top5_word_year %>% 
  filter(Year %in% c("1841", "1953"))%>%
  group_by(Year) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Year)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Year, scales = "free", ncol=2) +
  labs(x = "Number of occurrences", y = "Word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "The ten most frequent words in the earliest and latest article", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Since there was just one article published for the first and last year, we obtain the very same results as on the previous plots based on articles and day of publication.

By periodical

rs_token_count_source <- rs_token %>%
group_by(Source, word) %>%
tally() %>%
arrange(desc(n))

rs_token_count_source %>%
  group_by(Source) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=3) +
  labs(x = "number of occurrences", y = "word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "The ten most frequent words, by periodical", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Let’s focus on the most frequent words in the three major periodicals:

rs_token_count_source %>%
  filter(Source %in% c("The China Press", "The China Weekly Review", "The North China Herald"))%>% 
  group_by(Source) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=1) +
  labs(x = "number of occurrences", y = "word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "Ten most frequent words in the three top periodicals", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

If we discard the usual suspects (China, Chinese, Shanghai, returned, students, press), we find words that were more specific to each periodical. The frequent use of the word “apply” in The China Press refers to the large number of advertisements in this periodical. The word “Nanking” suggests a greater emphasis on the central government in Nanjing. The China Weekly Review focused on the American-trained and the returned students’ activities in Guangzhou and Beijing. The North-China Herald featured more general topics (foreign, time, people).

Periodicals outside of Shanghai featured different words:

rs_token_count_source %>%
  filter(Source %in% c("The Canton Times", "The Peking Leader", "Peking Daily News", "Peking Gazette"))%>% 
  group_by(Source) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=2) +
  labs(x = "number of occurrences", y = "word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "Ten most frequent words in non-Shanghai periodicals", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Besides the expected words (returned, students, China, Chinese, Peking in Beijing periodicals and Canton in Guangzhou), we observe that “president”, “government”, “foreign” and “American” are the most frequent in the four periodical. Other words are more specific. periodical The Canton Times for instance devoted more space to students’ association and American returnees, the Peking Daily News to government affairs, while the Peking Gazette emphasized college education and the Peking Leader focused on Japan-related topics.

The most frequent words in Shanghai local papers were:

rs_token_count_source %>%
  filter(Source %in% c("The Shanghai Gazette", "The Shanghai Times"))%>% 
  group_by(Source) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=2) +
  labs(x = "number of occurrences", y = "word", 
       title = "Word frequency in the \"returned students\" (RS) press corpus", 
       subtitle = "Ten most frequent words in local Shanghai papers", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Both periodicals devoted much space to American-returned students (American, Peking) but the Shanghai Times also emphasized British and Japanese connections. Government affairs were more central in the Shanghai Times than in the Shanghai Gazette, which focused on students’ associations (club, president) and social events (party) along with more abstract concepts (people).

Removing the highly expected words would provide a finer view of the vocabulary associated with returned students in each periodical.

Correlation between periodicals

What periodicals were the most similar to each other based on the words they shared?

To answer the question, we compute the pairwise correlation between periodicals based on the words they have in common. For the sake of legibility, the network contains only the strongest correlations (>0.8):

library(widyr)

rs_source_cors <- rs_token %>% 
  group_by(Source, word) %>% 
  count() %>%
  pairwise_cor(Source, word, n, sort = TRUE)

library(ggraph)
library(igraph)
set.seed(2017)

rs_source_cors %>%
  filter(correlation > .8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 12, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

The North-China Herald, the China Weekly Review and the Shanghai Times are the most similar in their textual content (especially the two first ones). Next they are closer to the Canton Times and the Shanghai Gazette. The China Press, the Peking Daily News, and Peking Gazette are more specific from a semantic point of view.

The two missionary publications - Chinese Repository and Chinese Recorder - and the China Critic are too distant from the core group to be represented on the graph.

Correlation analysis suggests that the place of publication (Shanghai, Beijing, Guangzhou) and the editors’ nationality (American, British, Chinese, or mixed) alone did not account for the semantic proximity between periodicals. Article contents were also topic- or context-specific.

Words over time

We can also pick up the most important words and examine how they occurred over time. We can select several words simultaneously and examine how they evolve in parallel over time.

Let’s focus for instance on the word “American” which occurred 5016 times in the corpus and the associated word “America” (2233 occurrences):

rs_token %>%
  filter(word %in% c("american", "america")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Visualizing American-related words in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

These words “American” rarely appeared before WWI. They peaked immediately after the war and remained prominent until the Sino-Japanese war (1937-45). They reflect the growing influence of the United States in China and in the world more generally after the war. More specifically, they also reflect the Chinese students’ growing attraction to the United States since the establishment of the Boxer Indemnity Scholarship (1908).

We can contrast the American with the Japanese and British presence in the press:

rs_token %>%
  filter(word %in% c("japanese", "japan")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Japan-related words in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

The Japanese presence was more scattered over time. Japan-related words appeared in the very first article of the corpus (1841) and remained almost until the end (1952). They peaked in 1918-20 (in connection with the Paris Peace Conference and the Twenty-One Demands), in 1928 (Jinan Incident) and 1936 (growing diplomatic tensions before the outbreak of the war in 1937).

p <- rs_token %>%
  filter(word %in% c("british", "britain")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "British-related words in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

fig <- ggplotly (p)

fig

The British presence peaked in 1919-1920 (Paris Peace Conference), 1926-27 (various topics related to Sino-British relations, including the conflict in South China, Sino-British trade or the British Boxer Indemnity emulating the American precedent) and in 1933-4 (diplomatic tensions in Southwestern China and social events involving British personalities).

As these analyses make clear, not all occurrences directly refer to the returned students. Such generic words as “American”, “Japan” or “British” often reflect the broader political context. It is better to focus on more specific words or to adopt a more-finely grained approach based on concordances (words in context) or collocations.

Imagine we are interested in tracing women’s presence in the corpus:

p4 <- rs_token %>%
  filter(word %in% c("women", "woman", "girl", "girls")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Women in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

fig4 <- ggplotly (p4)

fig4

Women appeared as early as 1841 but it was not until WWI (1917) that they gained prominence in the corpus as they became more active in social life. The three main peaks occurred in 1919-21, between 1926-1931, and in 1936. They reflect the vivid debates related to women education (especially during the May Fourth/New Culture movement) and their involvement in clubs, associations and social work. Notice that the word “girl(s)” emerged later and remained less prominent than “women”. It was essentially connected with education and paralleled the creation of special schools for girls during the Republic.

Words related to the returned students’ social life rose after WWI, when the first students clubs, alumni associations and other professional organizations were established. A second, minor wave occurred in the mid-1930s (1933-37), as new associations were founded or older ones were reorganized:

p5 <- rs_token %>%
  filter(word %in% c("association", "club")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Associations and club life in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

fig5 <- ggplotly (p5)

fig5

The fate of the words “service”, “mission” and “duty” reflects the high (and low) expectations towards the returned students during the Republican period:

rs_token %>%
  filter(word %in% c("service", "duty", "mission")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Service and duty in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

We could rely on n-grams to investigate more complex phrases such as “saving the nation”. We may also focus on more specific professional fields in which the returned students were particularly influential, such as education (especially Columbia-trained educators), engineering (MIT or Cornell graduates) or journalism (Missouri):

rs_token %>%
  filter(word %in% c("education", "engineering", "journalism")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word), show.legend = FALSE)+
  facet_wrap(~ word, ncol = 3, scales = "free") +
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  scale_y_continuous(breaks = integer_breaks())+
  labs(title = "Three major professional outlets for returned students", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Education was the most important field in the corpus and it appeared earlier than the two others. It emerged in the late years of the Qing empire, in connection with the abolition of the imperial examination system and rose after WWI when a new educational system, largely patterned after the American model, was established. Engineering took off during WWI in connection with military needs for transportation, communications and resources. It was not until 1919 that journalism appeared in the corpus, in connection with the growing influence of Missouri-trained American and Chinese journalists. There are serious limitations to such comparisons, however. Education was a broad umbrella term with many applications, whereas journalism was a too narrow and recent term to account for the returned students’ manifold contributions in the field.

Finally, we can explore more abstract notions such as “power* and”influence":

p7 <- rs_token %>%
  filter(word %in% c("power", "influence")) %>%
  mutate(Year = stringr::str_sub(Date,0,4)) %>% 
  group_by(Year, word) %>% 
  tally() %>% 
  ggplot() + 
  geom_col(aes(x = Year, y = n, fill = word))+ 
  scale_x_discrete(breaks = c(1850, 1880, 1900, 1910, 1920, 1930, 1940, 1950))+ 
  labs(title = "Elite-related concepts in the RS press corpus", 
       subtitle = "Number of occurrences over time", 
       x = "Year", 
       y = NULL,
       fill = "Word", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

fig7 <- ggplotly (p7)

fig7

Notions of “power” and “influence” peaked in 1920, 1926-8, and 1934. Further research is needed to determine whether they applied specifically to the returned students or to some prominent leaders (such as Chiang Kai-shek during the Northern Expedition in 1926-7). Sentiment analysis will be used to determine whether and when such influence was perceived as rather positive or negative.

In this brief exploration, word frequency over time largely reflects the uneven distribution of documents in our corpus. Most words will inevitably peak after WWI and decline during the Sino-Japanese-war, while the late 19th-early 20th century will appear depopulated in comparison. There are several ways of correcting these biases (tf-idf, log likelihood ratio, keyword extraction). We will explore some of them in the next tutorials. Moreover, it is always useful to refer to the introductory tutorial which described in detail the corpus coverage in order to better contextualize the results every step along the way.

Co-occurrences

What words most often co-occur in the same articles?

First, we compute the most frequent pairs of words in each article. The table below sort the pairs of words by decreasing frequency of co-occurrence:

word_pairs <- rs_token %>%
  pairwise_count(word, DocID, sort = TRUE)

kable(head(word_pairs), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	n
returned	chinese	2177
chinese	returned	2177
china	returned	2071
returned	china	2071
students	returned	1961
returned	students	1961

As expected, the most frequent pairs are “Chinese/returned”, “China/returned”, “returned/students”, “foreign/returned” and “Shanghai/returned”. The latter reflects the fact that most periodicals were based in Shanghai. The pair “government/returned” reveals the strong connection between the returned students and the government as a major sponsor in sending students abroad and as a major employer upon their return.

We can visualize the co-occurrences as clusters of words in a network graph - the graph shows only the most frequent pairs (n>700):

set.seed(2016)

word_pairs %>%
  filter(n > 700) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, point.padding  = unit(0.2, "lines")) +
  theme_void()+
  labs(title = "Word co-occurrences in the \"returned students\" (RS) press corpus", 
       subtitle = "Most frequent pairs (n>500)", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

We can also focus on specific words we are particularly interested in - e.g. “returned”, “students”, “american” - and explore the words that most often co-occurred with them.

Words that most often co-occur with “returned”:

returned1 <- word_pairs %>%
  filter(item1 == "returned")
returned2 <- word_pairs %>%
  filter(item2 == "returned")
returned_cooc <- bind_rows(returned1, returned2)

returned_cooc

kable(head(returned_cooc), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	n
returned	chinese	2177
returned	china	2071
returned	students	1961
returned	shanghai	1589
returned	time	1467
returned	foreign	1463

Words that most often occur with “students”:

students1 <- word_pairs %>%
  filter(item1 == "students")
students2 <- word_pairs %>%
  filter(item2 == "students")
students_cooc <- bind_rows(students1, students2)


kable(head(students_cooc), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	n
students	returned	1961
students	chinese	1666
students	china	1605
students	shanghai	1187
students	foreign	1151
students	government	1146

Words that most often occur with “american”:

american1 <- word_pairs %>%
  filter(item1 == "american")
american2 <- word_pairs %>%
  filter(item2 == "american")
american_cooc <- bind_rows(american1, american2)


kable(head(american_cooc), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	n
american	returned	1247
american	chinese	1130
american	china	1096
american	students	992
american	shanghai	863
american	foreign	843

Let’s visualize the ten most frequent pairs for each word:

word_pairs %>%
  filter(item1 %in% c("american", "returned", "students")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, n)) %>%
  ggplot(aes(item2, n, fill = item1)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()+
  labs(x = NULL, y = "Number of co-occurrences", 
       title = "Word co-occurrences in the \"returned students\" (RS) press corpus", 
       subtitle = "Most frequent words co-occurring with \"American\" \"returned\", \"students\"", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Most of the words overlap. The main difference lies in their order of importance, i.e. their varying frequency of co-occurrence. This is partly because the three words we are interest in often co-occurred themselves, and because co-occurrences tend to emphasize the most common words regardless of their relative importance in specific contexts.

Co-occurrences refer to words that appear together in the same articles. They are not particularly meaningful since the most frequent pairs involve the most common individual words. Correlation among words is a more meaningful measure since it indicates how often words appear together relative to how often they appear separately.

Correlation in articles

What words were the most stronly correlated in the same articles? As we did for periodicals, we compute the pairwise correlation (Pearson’s correlation or phi coefficient) between words based on how often they appear in the same document.

word_cors <- rs_token %>%
  group_by(word) %>%
  filter(n() >= 1000) %>%
  pairwise_cor(word, DocID, sort = TRUE)

kable(head(word_cors), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	correlation
japan	japanese	0.4703641
japanese	japan	0.4703641
christian	church	0.4474874
church	christian	0.4474874
troops	army	0.4155915
army	troops	0.4155915

The ranking is quite different from the previous one based on just counting the pairs. The most strongly correlated words are Japan/Japanese, Christian/church, troops/army and people/country.

Similarly, we can visualize the correlations between words as a network graph:

set.seed(2021)

word_cors %>%
  filter(correlation > .25) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "orange", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()+
  labs(title = "Word correlation in the \"returned students\" (RS) press corpus", 
       subtitle = "Pairs of words that show at least a .25 correlation of appearing within the article", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

We can also look for the words that are more strongly correlated with the three main words we are interested in:

Words correlated with “returned”:

returned1_cor <- word_cors %>%
  filter(item1 == "returned")
returned2_cor <- word_cors %>%
  filter(item2 == "returned")
returned_cor <- bind_rows(returned1_cor, returned2_cor)


kable(head(returned_cor), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	correlation
returned	students	0.2916052
returned	chinese	0.1994526
returned	student	0.1724735
returned	china	0.1624101
returned	education	0.0952215
returned	association	0.0913367

Words correlated with “students”:

students1_cor <- word_cors %>%
  filter(item1 == "students")
students2_cor <- word_cors %>%
  filter(item2 == "students")
students_cor <- bind_rows(students1_cor, students2_cor)


kable(head(students_cor), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	correlation
students	returned	0.2916052
students	schools	0.2256330
students	education	0.2058109
students	country	0.1824118
students	union	0.1822093
students	people	0.1815354

Words correlated with “American”:

american1_cor <- word_cors %>%
  filter(item1 == "american")
american2_cor <- word_cors %>%
  filter(item2 == "american")
american_cor <- bind_rows(american1_cor, american2_cor)

kable(head(american_cor), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
item1	item2	correlation
american	america	0.2925866
american	united	0.2733689
american	foreign	0.2093022
american	university	0.1951398
american	college	0.1931899
american	association	0.1900224

Let’s plot the ten most important pairs for each word:

word_cors %>%
  filter(item1 %in% c("american", "returned", "students")) %>%
  group_by(item1) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation, fill = item1)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()+
  labs(x = NULL, y = "Phi coefficient (Pearson)", 
       title = "Word correlation in the \"returned students\" (RS) press corpus", 
       subtitle = "Words that were most correlated with \"American\" \"returned\", \"students\"", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Besides synonyms (America, United), the word “American” was strongly correlated with words referring to higher education (college, university), organization (association, president, week) and international relations (British, China, foreign). Besides expected words (students, Chinese, China, student), the word “returned” was strongly correlated with words referring to more general or lower levels of education (schools, education), association and the national capital (Peking). The latter may refer to the headquarter of the national government until 1927 or to the cultural center in which leading educational institutions, such as Tsinghua College (established in 1909 for preparing students departing to the United States) played a crucial role in the study abroad movement.
Besides expected (returned) and common words (education, association), the word “students” was correlated with distinctive words that include political concepts (country, people), religious organizations (Christian, union) and associative life (conference, meeting).

There are many overlap between the three words we are interested in. It may be more useful to examine bigrams (pairs of successive words) instead of single words, in order to reduce such overlap.

Bigrams

First, we tokenize the text into bigrams and we remove non-words (OCR errors, numbers) or function words:

rs_docs_bigram <- rs_docs_ft %>% 
  unnest_tokens(output = bigram, 
                input = Text, 
                token = "ngrams", 
                n = 2)

# separate bigrams into words for cleaning
rs_bigrams_separated <- rs_docs_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# remove stop words and other non words
rs_bigrams_filtered <- rs_bigrams_separated %>% 
  rename(word = word1) %>%
  anti_join(stop_words) %>% 
  filter(!str_detect(word, '[0-9]{1,}')) %>% 
  filter(nchar(word) > 3) %>% 
  filter(!word %in% c("tion", "tions", "chin", "ment")) %>%
  rename(word1 = word) %>% 
  rename(word = word2) %>%
  anti_join(stop_words) %>% 
  filter(!str_detect(word, '[0-9]{1,}')) %>% 
  filter(nchar(word) > 3) %>% 
  filter(!word %in% c("tion", "tions", "chin", "ment")) %>%
  rename(word2 = word)   

# reunite bigrams

rs_bigrams_united <- rs_bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

# count bigrams 

rs_bigram_count <- rs_bigrams_united %>%
  count(bigram, sort = TRUE)

kable(head(rs_bigram_count), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
bigram	n
returned students	2457
returned student	1214
china press	908
chinese students	735
north china	705
foreign affairs	613

Besides expected bigrams - returned student(s), Chinese students, China Press/North China (name of newspapers), the most frequent pairs referred to national politics and diplomacy (foreign affairs, Chinese government, Chinese people, national government, central government). Next, we find “American returned” (reflecting the prominent position of the United States as the favorite destination for Chinese students abroad during the Republic), local places (Nanking Road, referring to a frequent place of meeting for students’ association or to the office address of some company in advertisements), South China (as the birth place of many returned students and the most developed region in China), industry (cotton mill) and students’ union.

What were the most frequent bigrams in each periodical?

First, we compute the tf-idf for each periodical. The tf-idf (frequency–inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. In contrast to mere frequency, it emphasizes the words that are specific to a given document (or periodical), instead of the most common words in general:

rs_bigram_tf_idf_source <- rs_bigrams_united %>%
  count(Source, bigram) %>%
  bind_tf_idf(bigram, Source, n) %>%
  arrange(desc(tf_idf))

kable(head(rs_bigram_tf_idf_source), caption = "First 6 rows") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")

First 6 rows
Source	bigram	n	tf	idf	tf_idf
The China Critic	china invented	2	0.0108696	2.484907	0.0270099
The China Critic	peiyanjr collejre	2	0.0108696	2.484907	0.0270099
The China Critic	xanyanjr college	2	0.0108696	2.484907	0.0270099
The China Critic	construction program	2	0.0108696	1.791759	0.0194756
The Chinese Repository	dutch factory	5	0.0065359	2.484907	0.0162412
The Chinese Repository	heer doeff	5	0.0065359	2.484907	0.0162412

Let’s focus on the three leading periodicals:

rs_bigram_tf_idf_source %>%
  filter(Source %in% c("The China Press", "The China Weekly Review", "The North China Herald"))%>% 
  group_by(Source) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(tf_idf, bigram, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=3) +
  labs(x = "tf-idf", y = "bigram", 
       title = "Highest tf-idf bigrams in \"returned students\" (RS) press corpus", 
       subtitle = "tf-idf by periodical", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

The semantic profile of the China Press is largely shaped by its advertising contents. The China Weekly Review focused on more serious topics related to politics (marshal chang, communist party), international relations (Sino-Japanese) and industry (cotton mill). The North-China Herald, established some fifty years before its competitors, is mostly characterized by words from the late imperial period.

rs_bigram_tf_idf_source %>%
  filter(Source %in% c("The Canton Times", "The Peking Leader", "Peking Daily News", "Peking Gazette"))%>%  
  group_by(Source) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(tf_idf, bigram, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=2) +
  labs(x = "tf-idf", y = "bigram", 
       title = "Highest tf-idf bigrams in \"returned students\" (RS) press corpus", 
       subtitle = "Highest tf-idf bigrams in non-Shanghai periodicals", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

As the birth place of the first returned students’ organizations, The Canton Times emphasized students’ associations, whereas the Peking Leader focused on military affairs, diplomatic relations with Japan and Korea, and women’s education. The Peking Gazette was largely determined by its advertisements, whereas The Peking Daily News emphasized technology along with political affairs. Due to the short lifetime of non-Shanghai periodicals, their vocabulary was more context-sensitive than in the three major periodicals.

rs_bigram_tf_idf_source %>%
  filter(Source %in% c("The Shanghai Gazette", "The Shanghai Times"))%>% 
  group_by(Source) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(tf_idf, bigram, fill = Source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Source, scales = "free", ncol=2) +
  labs(x = "tf-idf", y = "bigram", 
       title = "Highest tf-idf bigrams in \"returned students\" (RS) press corpus", 
       subtitle = "Highest tf-idf bigrams in local Shanghai papers", 
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

The two local newspapers appeared more eclectic in their content. The Shanghai Gazette featured a variety of themes ranging from social life in Shanghai (Hongkew recreation, students union, California Glee), to national and international politics (Japanese warships, Shantung question, congressional party). Besides its advertisements, the Shanghai Times focused on international affairs in South America (Mexico) and Europe (German emperor, bolshevism).

We can finally visualize the corpus as a network of bigrams:

rs_bigrams_count2 <- rs_bigrams_filtered  %>% 
  count(word1, word2, sort = TRUE)

# filter for only relatively common combinations (over 100)
rs_bigram_graph <- rs_bigrams_count2 %>%
  filter(n > 100) %>%
  graph_from_data_frame()

# visualize as a graph

set.seed(2017)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(rs_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "Common bigrams in the RS press corpus",
       caption = "Based on ProQuest \"Chinese Newspapers Collection (CNC)\"")

Concluding remarks

What can we learn from this tutorial (1) on our corpus; (2) on the textual features of historical newspapers; (3); on text analysis with R more generally?

The returned students (in the English-language press) involved a wide spectrum of topics ranging from education to national politics, international relations, industry and social life (clubs, associations).

Beyond topic- or context-sensitive words, this preliminary exploration has pointed to the eclectic nature of newspapers both in their content (topic) and structure (length of articles). We also realized the prominence of advertisements, especially in the China Press, accentuated in frequency analyses by their repetitive character. It would be useful to classify the articles so as to conduct finer analyses based on their genre (i.e. the newspaper section under which they appeared). Text classification will the subject of another tutorial. Then we can choose, for instance, to exclude advertisements, to focus on “people” or “local news” section (to which we can apply Name Entity Recognition for the study of social actors) or on opinion articles (to apply sentiment analysis for the study of cultural perceptions).

The three major periodicals in our corpus (North-China Herald, China Weekly Review, China Press) appeared closer from a semantic perspective. Such proximity reflected their statistical weight and overlapping timescopes as much as their editorial profile (general vs missionary, nation-wide vs local) and place of publication (Shanghai vs Canton, Peking).

Finally, the vocabulary we extracted from these newspapers reflects the modern press in the making in the late imperial-early Republican China. The English press in modern China clearly participated in the global emergence of “the news” as a new item of cultural consumption for foreign and English-speaking Chinese elites, with a particular emphasis on national and international politics, but also a keen interest in local personalities’ mundane life. The most common words in the corpus signaled the rise of the press as a kind of “fourth estate” with its capacity to shape hot topics (the war with Japan, cotton trade) packaged through a standard yet flexible language that circulated between newspapers and changed over time. We will explore the patterns of text reuse and the evolving stylometry of journalism in future tutorials.

A last word on the limitations of word-based approaches to text is in order. N-gram-based, purely quantitative approaches to text (sometimes referred to as “text analytics” or “culturomics”) need to be supplemented by more qualitative methods, such as concordancing. In addition, supervised exploration based on predefined keywords need to be supplemented by unsupervised techniques, such as topic modeling. As we will see in a future tutorial, topic modeling is useful to suggest words that we would not have thought of at first sight. Despite these shortcomings, text analytics remains valuable in that it provides an overarching framework for contextualizing more focused analyses or the close reading of selected articles.

In the next tutorials, we will rely on the package quanteda to apply more sophisticated techniques for handling time series, variation in the size of documents or corpora (log likelihood ratio), multi-word units and collocations.

References

Borin, Lars, Devdatt Dubhashi, Markus Forsberg, Richard Johansson, Dimitrios Kokkinakis, and Pierre Nugues. “Mining Semantics for Culturomics: Towards a Knowledge-Based Approach.” In Proceedings of the 2013 International Workshop on Mining Unstructured Big Data Using Natural Language Processing, 3–10. San Francisco, California, USA: Association for Computing Machinery, 2013.

Field, Andy, Jeremy Miles, and Zoe Field. Discovering Statistics Using R, Sage, 2017.

Lansdall-Welfare T, Sudhahar S, Thompson J, Lewis J, Cristianini N, and FindMyPast Newspaper Team. “Content Analysis of 150 Years of British Periodicals.” Proceedings of the National Academy of Sciences of the United States of America 114, no. 4 (2017): 457.

Morse-Gagne, E. E. “Culturomics: Statistical Traps Muddy the Data.” Science 332, no. 6025 (2011): 35.

Schweinberger, Martin, “Text Analysis and Distant Reading Using R.” Brisbane: The University of Queensland, 2021. url: https://slcladal.github.io/coll.html (Version 2021.04.11).

Silge, Julia, and David Robinson, “Relationships between words: n-grams and correlations” in Text Mining with R: A Tidy Approach. Sebastopol: O’Reilly, 2017. url: https://www.tidytextmining.com/index.html

The golden age of the returned students in China (2)

Text Analysis of the ‘returned students’ (RS) press corpus

Cécile Armand

2021-05-28