Prologue

This tutorial is a continuation of the series devoted to text analysis on the returned students press corpus. In the the previous tutorial, we relied on the packages tidytext/tidyverse to reveal textual patterns in the corpus. In this new instalment, we use the quanteda package to perform basic text statistics on the same corpus.

This experiment is based on these two excellent tutorials devoted to text statistics and frequency analysis.

From text to a corpus object

Set global options. When working with textual data strings, it is recommended to turn R’s automatic conversion of strings into factors off:

options(stringsAsFactors = FALSE)

Load the csv files containing the press articles related to the returned students which we previously extracted with the enpchina package:

library(readr)
rs_full_text <- read_csv("Data/rs_full_text.csv",
col_types = cols(X1 = col_skip()))

textdata <- rs_full_text %>% distinct()

kable(head(textdata), caption = "First 6 documents") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")
First 6 documents
DocID Date Title Source Text
1324674682 19270716 Many University Men on Shanghai Mayor’s Staff The China Weekly Review   Many University Men on Shanghai Mayor’s Staff Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced the appointment of Cabinet of ten members practically all of whom are graduates of Chinese colleges or are returned students from America or Europe The list includes the following Police commissioner Shen Pu Jen Assistant commissioner Chang Zien Chun Graduate of Nanyang College Commissioner of Revenue Shu Zien Fu banker formerly director of Rank of China in Hangchow Comniiss ion of Public Works Dr Shen Chung Ye1’ Graduate of Tung Chi and returned from Germany got his Ph in civil engineering in Germ my Commissioner of Public Utilitv Huang Bah Ziau Graduate of Tung Clri and returned student from Germany formerly er of works in Hankow Commissioner of Charity Dept Wang Han Tsc member of local gentry Commissioner of Harbor Dept Lee N’ien Tse returned student from Germany formerly commissioner of industry in Shansi Province Commissioner of Education Chnu Ching Nung American returned student formerly of Shanghai College and Kuang Hua University Commissioner of agriculture commerce and labor Pan Kong Zu tS JR formerly journalist in Shanghai member of the political committee in Shanghai Commissioner of Land Chcni Ycen Belgian returned student formerly the Chinese director of China- l rench Technical College in Shanghai 10) Commissioner of Health Dr Hu Hou-ki an American returned student who has been connected with the Rockefeller Hospital at Peking In the Department of Health there are also four other returned students who will have charge of the various divisions and laboratories
1326716896 19180629 Editorial Article 1 – No Title The China Weekly Review   THREE American returned students two British returned students one French returned student one German returned student one graduate of the Shensi University and two Japanese returned students were elected at the first section of the long-expected elections tor the Senate held on June 20’ and 21 in Peking For the first time in the history of the Republic had so many foreign-educated Chinese voted Over four hundred registered and some three hundred cast their votes The majority of them are graduates of Western or Japanese universities They questioned the legality of the new Parliament to be convened but they parti in the election nevertheless on the ground that they should get into the new legislature which may be illegal their own representatives who may render some service to the public and at least can keep them informed of what is going on therein THOSE who were elected Senators are Tsur former president of the Tsing Hua College Ho Yen-sun secretary to Liang Shih-yi and Chen Huai-chang Confucianist who are returned students from America Lo Hung-nien of the Bank of China and Wang Shih-ching editor of the Peking Daily News returned students from England Wu Chun from Ger many Wu Ching-lien from France former Minister to Italy Ting Yung and Wei Sze-kan from Japan and Hsu Shih graduate of the Shensi University They were elected from the first section in which both the candidates for senatorship and voters must possess special literary or educational qualifications namely scholars returned students who have been back at least for three years prior to June 10, 1918, authors whose books have been recognized by the Ministry of Education The voting took place in the Hall of the House of Representa tives and the election was presided over bv Fu Tseng-hsiang Minister of Education
1326716374 19190215 Editorial Article 6 – No Title The China Weekly Review   IF there is one class of Chinese that will gain from the coming new era in China it is the returned student If the menace of predatory Japan and the sinister effect of the baneful Sphere of Influence is removed and this country is reorganized in both government and business the returned student from the American and European university with technical and scientific education is the man who will win When the railways of China are placed on business-like basis when the currency of the country is scientifically reorganized when civil administration of the laws takes the place of military force when real representative dem ocratic form of government in China is established when modern education comes to China’s masses the returned student with his specialized education will be in demand The demand is likely to suddenly exceed the supply and China’s schools will be called upon for more and more skilled men We often wonder wheth er he thousands of returned students in this country who are now practically inarticulate be cause they have no place in Old China realize the importance of the future and are preparing to meet the responsibility when it comes to their shoulders In Peking the returned students are preparing for conference during the coming month when the following subjects are to be studied and discussed 1. The Economic and Industrial Development of China Abolition of spheres of influence Internationalization of railroads Under what terms should foreign capital be invited to assist China in this phase of her life 2. The Emancipation of China from Militarism 3. Adequate Guarantees for Free Speech and Free Press 4. Law Reform in China and the Sensible Abolition of Extraterritoriality 5. Civil Service Requirements for Government Positions 6. Needed Educational Reform In view of the tremendous importance of the present year for China we should like to see the various organizations of returned students and the various universities and colleges follow the example of the Peking returned students The only way to bear responsibility is to prepare for it by serious study and work In the New China ability and energy instead of watchful waiting and intrigue will count
1416388974 19190902 Correspondence The Canton Times   Correspondence Communications under the above heading are welcomed but the editor must remain ole judge of their suitability and be doest not undertake to bold himself or this newspapers responsible for the facts or opinions presented I’o the Editor Cunlon Timis 1 The letter of Ono of the Circle 1 in your Saturday’s issue to indicate two points 0. one to crucify tho Western Rsturned students’ Union and tho other is to tho Euro-American Return- Students Association It behoves men therefore to lay out tho plain facts for tho perusal of the general 1 publio Your correspondent says that new organization will not by nny means bring the College men hete more together sincerely deplore his shortsightedness In fact ‘the Western Returned Students Union both in name and scope will bring relation between and inculcate cordial understanding among the Returned Students There is not an obstacle in the Constitution of the Union that will debar returned student from joining tho Union but every one is welcome Therefore you senior’’ or elder aro also elegible as members of the Western Returned Students Union Ho further believe it will ho best for the proposed Union to end as it is bringing whatever reforms the moro active members wish to introduce into the circle or organization through tho proper offices of the Euro-Ameri ca Returned Students Association In truth the more active of the Union have already -diagnosed consumptive symptoms in the Euro- Amcrican Returned Student Associat ion as indicated by its and during the past few yearB of its Thcroforc the same activo members of the Western Returned Students Union thought it too inadvisable to any reforms to tho Euro-American Returned Stud- Association ns it would be great folly to lay tho foundation of tho new organization 011 debris and already shows of decay Your correspondent again states As far ns it is known two of the fourteen or fifteen who attended the first meeting of the proposed West ern Returned Students Union last Wednesday evening were misled by press notice nnd will probably henceforth moro to do with it Whether tho said two per sons as your correspondent asserts will have anything br to do with an organization is unselfish in aims nnd open-door in policy is mat ter that concerns the said two But no one was misled by the press notice as attested by the of those present on that evening to vote for the sanction of the Constitution and its adoption as Moreover each and every one present did for the executive Committee Mr One of the Circle ’’ next t tne when you have anything to write you should think more of facts instead of rashly writing whatever to your mind nnd commit such awkward blunders However may your mis takes of the past serve you as lesson for the future Canton August 31, 1919.
1326717530 19190412 Other 1 – No Title The China Weekly Review   ONiviiKsi rY MAY 2 11919 jSS rf- library MILLARD S REVIEW OF THE FAR EAST Published Weekly Saturday April 2tH 1919 ii ag 1 1 1 Western Returned Students and China After tke ’War By Paul Reinsch Tke Facts Regarding the Tientsin Incident Tke Hope of tke estern Returned Student By Hollington Tong s3 fe wl 1 TWENTY CENTS COPY Vol VIII No 7
1319877652 19250207 China’s Returned Students The China Weekly Review   China’s Returned Students BY TSAO LIEN-EN SINCE Dr Yung Wing the father of all Return Students first embarked in sailing vessel at New York in 1847, and propounded after his return the time-honored policy of sending students abroad China has witnessed annually great numbers of young men and women sailing for America and other European countries either through governmental funds or private support for the furtherance of their education these numbers during recent years have been almost stupendous One would certainly be optimistic on the future of China when he sees mostly in summer the crowds of students flocking to the trans-Pacific steamships and probably would mutter to himself that the salvation of China hinges chiefly upon these young people But when he considers the service wrought by these Returned Students and the time and money spent thereof he would only heave sigh China from now on must not look to her Returned Students but her home-bred ones am not writing this as protest against all Returned Students have always high respect toward the earlier Returned Students and there sojourn abroad was very justifiable one because home education in those days was yet not developed and because they really did something for the nation Men like Wu Ting-Fang built their names of honor and did works beyond all bounds leaving behind them foot prints on the sands of time their names will always be cherished by the younger generation But modern Returned Students acquire their education abroad as an ornament of their lives instead of tools to work with and this most strongly denounce Being student have occasion to ask my classmates and schoolmates about their plans after graduation Most of them express the desire of studying abroad They ground their belief upon the fact that their social standing would be raised their earning capacity would be enlarged their oppotunities of seeking occupations would be more that they can render better service to China and some even go so far as to invent the ridiculous fallacy that they could get better wives Previously had the idea of going too but now condemn this thought At the outset let it be understood that am writing this in an unbiased vein and not giving vent to my grudge Returned Students are mostly from America so our attention will be drawn to America also When Chinese students go to America they are devided into two classes The one is the assiduous group who after landing at the far-off country confine themselves in their rooms burying deep in their books without keeping in touch with anything outside and win honors among their foreign classmates This group is promising so long as their knowledge goes but they learn little of foreign culture and their school life differ very little from that at home The other is the social group group of rich fools who assimilated by the bustling foreign city life indulge themselves in the folly of de luxe society The evening finds them in cinemas and midnight finds them in dancing houses And yet folks at home stretch out their necks and hope that these men would lead the people To these Returned Students the future of China is hopelessly -sealed For the latter group no more ink needs be spilt They will turn out to be intellectual vagabonds and first-class ne er-do-wells whom China should apprehend The former group is tolerably better Their scholarship and intellegence strike the foreign students with awe But the trouble with them is that they could not put their dead knowledge into practice after their return which shall prove later on After their return to China the first thing which behooves them is the grasping of the ricebowl Indeed the ricebowl is indispensable to every one and they are not to blame But their fault is they deign not to accept low posi tions always desiring to reach the top of the ladder by sudden leap which of course is impossible and this affords the main reason of their unemployment Most of them among the employed hold teaching positions many hang the capital ready to plunge into the political arena when opportunity offers some receiving pay envelopes from foreign hongs firms and some quite absurdly retire from their busy school and stay at home relying upon their family pensions In all they care too much for dignity wealth and fame and utterly disregard the particular goals which they were so ambitious to attain before their most coveted voyage So it is not surprise to find Returned Student specialized in mechanical engineering or analytic chemistry teaching English in Middle School or working as clerk in foreign trading company What is the use of his technical knowledge Does studying abroad pay One may hazard conjecture that his environments do not favor him But we must not forget that circumstances do not mould man but man moulds his circumstances If anything is lacking in Returned Student Enter prise is undeniably one Knowledge of course they have some even supercede their foreign colleagues But they are not bold enough to risk their fortune and establish some industry on big scale They dare not in the fear of bankruptcy gather shareholders and start some trade that would bring profit to them and credit to the nation Returned Student high sounding term rings in the minds of the populace as something amazing The public opinion has long been that Returned Students are far better than students their knowledge far superior their potentiality far greater their command of English far higher and render more service to China To those who hold this view call them naively realistic There are innumerable students coming out year by year from home universities with equal if not better education and work right at the foot of the ladder with patience perseverence and energy By way of illustration let me cite case One of the home colleges some time ago engaged professor Returned Student in Political Science with very handsome remuneration He gave his lectures his English was abominable his deplorable Shortly afterwards it was soon found out that the knowledge of ti professor could hardly excel that of the students The school was wise enough and avoided him the next semester This instance certainly does not show that all Returned Students are of this type but it at least testifies that not all Returned Students are of sterling quality Home Schools Preferable Time has changed China fifty years ago was different from the China of today Fifty years ago university and college in the modern sense were something unheard of science astronomy physics and chemistry although we have had it for thousands of years was ignored and unnoticed Our fathers and grandfathers studiously in the myriad volumes of antique lore Studying abroad in those days was very justifiable and thanks to those who went across the sea But repeat again time has changed Marked progress has been made in Commerce even more so in Education despite the fact that foreigners accuse us of being slow in going We are striving may it be like snail on willow tree Numerous schools and colleges are springing up from time to time that it rio longer becomes necessity to go abroad Universities with equal equipment equal courses as those abroad have been and are being established by the government private individuals and missionaries Law school Medical colleges schools of Commerce Agriculture and other schools offering special courses have all come into existence These are landmarks of better China Returned Student after the stay of years abroad is more or less ignorant of the existing conditions at home Usually it takes another number of years to him from his already acquired American ways Oft-times he disdains his old country He says his home-town in which he was born is dirty and so finds his abode in some trade ports Now the lamentable state of things is he never thinks of reconstructing it himself and let the incumbency fall upon somebody else When asked how to conduct certain institution he would say While was in the states people there do it in such and such way But he is entirely at loss to see that he is in China and that such and such way does not apply Hence we must think of some other way to suit our purpose Equally true are things in other respects and think am safe to conclude that studying abroad at present is not justifiable at all Special mention needs here be made of the wise appro- of the British share of Boxer Indemnity fund which newspaper indicates is going to be utilized not in sending students to England by which only the lucky few through push and pull can go but in establishing universities of equal standard as those in England in China in 1 establishing museums and public libraries and in exchanging professors of the two countries So the dispensability of going abroad is plain to every reasonable man In the future most sincerely hope that the students will not look upon going abroad as’ their ultimate destination Comparative Law School of China January 20, 1925,


As a reminder, the corpus contains 2739 unique articles spanning from 1841 to 1953, distributed across 12 periodicals. For each article, the table indicates the unique identifier of the document (DocID), the date of publication (Date), the title (Title), the periodical in which it appeared (Source), and the full text we extracted in the previous tutorial (Text).

The texts are now available in a data.frame together with some metadata (document id, speech type, president). Let us first see how many documents and metadata we have read.

# dimensions of the data frame
dim(textdata)
## [1] 2739    5
# column names of text and metadata
colnames(textdata)
## [1] "DocID"  "Date"   "Title"  "Source" "Text"


The “textdata” dataframe contains 2744 documents (rows) and 5 variables (columns), namely “DocID” “Date” “Title” “Source” “Text”.


How many articles do we have per source (periodical)? This can easily be counted with the command table, which can be used to create a cross table of different values. If we apply it to a column, e.g. periodical of our data frame, we get the counts of the unique periodical values.

table(textdata[, "Source"])
## 
##       Peking Daily News          Peking Gazette        The Canton Times 
##                      65                      59                     112 
##        The China Critic         The China Press The China Weekly Review 
##                       1                     719                     761 
##    The Chinese Recorder  The Chinese Repository  The North China Herald 
##                     132                       1                     740 
##       The Peking Leader    The Shanghai Gazette      The Shanghai Times 
##                      30                      31                      88


Now we want to transfer the loaded text source into a corpus object of the quanteda-package. Quanteda provides a large number of highly efficient convenience functions to process text in R [1]. First we load the package.

require(quanteda)

A corpus object is created with the corpus command. As parameter, the command gets the fulltext of the documents. In our case, this is the Text-column of the textdata-data.frame. The docnames-parameter of the corpus function defines which unique identifier is given to each text example in the input (values from other columns of the data frame could be imported as metadata to each document but we will not use them in this tutorial).

rs_corpus <- corpus(textdata$Text, docnames = textdata$DocID)
# have a look on the new corpus object
summary(rs_corpus)


A corpus object (in quanteda) is an extension of R list objects. With the [[]] brackets, we can access single list elements, here documents, within a corpus. We print the text of the first element of the corpus using the texts command:

cat(texts(rs_corpus[1]))
##                                            Many University Men on Shanghai Mayor's Staff                 Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced the appointment of Cabinet of ten members practically all of whom are graduates of Chinese colleges or are returned students from America or Europe The list includes the following                 Police commissioner Shen Pu Jen Assistant commissioner Chang Zien Chun Graduate of Nanyang College Commissioner of Revenue Shu Zien Fu banker formerly director of Rank of China in Hangchow Comniiss ion of Public Works Dr Shen Chung Ye1' Graduate of Tung Chi and returned from Germany got his Ph in civil engineering in Germ my Commissioner of Public Utilitv Huang Bah Ziau Graduate of Tung Clri and returned student from Germany formerly er of works in Hankow Commissioner of Charity Dept Wang Han Tsc member of local gentry Commissioner of Harbor Dept Lee N'ien Tse returned student from Germany formerly commissioner of industry in Shansi Province Commissioner of Education Chnu Ching Nung American returned student formerly of Shanghai College and Kuang Hua University Commissioner of agriculture commerce and labor Pan Kong Zu tS JR formerly journalist in Shanghai member of the political committee in Shanghai Commissioner of Land Chcni Ycen Belgian returned student formerly the Chinese director of China- l rench Technical College in Shanghai 10) Commissioner of Health Dr Hu Hou-ki an American returned student who has been connected with the Rockefeller Hospital at Peking In the Department of Health there are also four other returned students who will have charge of the various divisions and laboratories


The command cat prints a given character vector with correct line breaks (compare the difference of the output with the print method instead):

print(texts(rs_corpus[1]))
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       1324674682 
## "                                           Many University Men on Shanghai Mayor's Staff                 Gen Hwang Fu Mayor of the Chinese area at Shanghai and himself scholar of note has announced the appointment of Cabinet of ten members practically all of whom are graduates of Chinese colleges or are returned students from America or Europe The list includes the following                 Police commissioner Shen Pu Jen Assistant commissioner Chang Zien Chun Graduate of Nanyang College Commissioner of Revenue Shu Zien Fu banker formerly director of Rank of China in Hangchow Comniiss ion of Public Works Dr Shen Chung Ye1' Graduate of Tung Chi and returned from Germany got his Ph in civil engineering in Germ my Commissioner of Public Utilitv Huang Bah Ziau Graduate of Tung Clri and returned student from Germany formerly er of works in Hankow Commissioner of Charity Dept Wang Han Tsc member of local gentry Commissioner of Harbor Dept Lee N'ien Tse returned student from Germany formerly commissioner of industry in Shansi Province Commissioner of Education Chnu Ching Nung American returned student formerly of Shanghai College and Kuang Hua University Commissioner of agriculture commerce and labor Pan Kong Zu tS JR formerly journalist in Shanghai member of the political committee in Shanghai Commissioner of Land Chcni Ycen Belgian returned student formerly the Chinese director of China- l rench Technical College in Shanghai 10) Commissioner of Health Dr Hu Hou-ki an American returned student who has been connected with the Rockefeller Hospital at Peking In the Department of Health there are also four other returned students who will have charge of the various divisions and laboratories"

Success!!! We now have 2739 articles for further analysis available in a convenient tm corpus object!

Text statistics

A further aim of this exercise is to learn about statistical characteristics of text data. At the moment, our texts are represented as long character strings wrapped in document objects of a corpus. To analyze which word forms the texts contain, they must be tokenized. This means that all the words in the texts need to be identified and separated. Only in this way it is possible to count the frequency of individual word forms. A word form is also called “type”. The occurrence of a type in a text is a “token”.

For text mining, texts are further transformed into a numeric representation. The basic idea is that the texts can be represented as statistics about the contained words (or other content fragments such as sequences of two words). The list of every distinct word form in the entire corpus forms the vocabulary of a corpus.

For each document, we can count how often each word of the vocabulary occurs in it. By this, we get a term frequency vector for each document. The dimensionality of this term vector corresponds to the size of the vocabulary. Hence, the word vectors have the same form for each document in a corpus. Consequently, multiple term vectors representing different documents can be combined into a matrix. This data structure is called document-term matrix (DTM).

The function dfm (Document-Feature-Matrix; Quanteda treats words as features of a text-based dataset) of the quanteda package creates such a DTM. If this command is called without further parameters, the individual word forms are identified by using the tokenizer of quanteda as the word separator (see help(tokens)for details). Quanteda has 3 different word separation methods. The standard and smartest way uses word boundaries and punctuation to separate the text sources. The other methods rely on whitespace information an work significantly faster but not as accurate.

# Create a DTM (may take a while)
DTM <- dfm(rs_corpus)
# Show some information
DTM
## Document-feature matrix of: 2,739 documents, 115,919 features (99.6% sparse).
##             features
## docs         many university men on shanghai mayor's staff gen hwang fu
##   1324674682    1          2   1  1        6       1     1   1     1  2
##   1326716896    2          2   0  3        0       0     0   0     0  1
##   1326716374    0          1   1  1        0       0     0   0     0  0
##   1416388974    0          0   2  1        0       0     0   0     0  0
##   1326717530    0          0   0  0        0       0     0   0     0  0
##   1319877652    1          1   3  6        0       0     0   0     0  0
## [ reached max_ndoc ... 2,733 more documents, reached max_nfeat ... 115,909 more features ]


Dimensionality of the DTM (size of vocabulary):

dim(DTM)
## [1]   2739 115919


The dimensions of the DTM, 2739 rows and 115919 columns, match the number of documents in the corpus and the number of different word forms (types) of the vocabulary.

A first impression of text statistics we can get from a word list. Such a word list represents the frequency counts of all words in all documents. We can get that information easily from the DTM by summing all of its column vectors.

A so-called sparse matrix data structure is used for the document term matrix in the quanteda package (quanteda inherits the Matrix package for sparse matrices). Since most entries in a document term vector are 0, it would be very inefficient to actually store all these values. A sparse data structure instead stores only those values of a vector/matrix different from zero. The Matrix package provides arithmetic operations on sparse DTMs.

# sum columns for word counts
freqs <- colSums(DTM)
# get vocabulary vector
words <- colnames(DTM)
# combine words and their frequencies in a data frame
wordlist <- data.frame(words, freqs)
# re-order the wordlist by decreasing frequency
wordIndexes <- order(wordlist[, "freqs"], decreasing = TRUE)
wordlist <- wordlist[wordIndexes, ]
# show the most frequent words
head(wordlist, 25)


The words in this sorted list have a ranking depending on the position in this list. If the word ranks are plotted on the x axis and all frequencies on the y axis, then the Zipf distribution is obtained. This is a typical property of language data and its distribution is similar for all languages:

plot(wordlist$freqs , type = "l", lwd=2, main = "Rank frequency Plot", xlab="Rank", ylab ="Frequency")


The distribution follows an extreme power law distribution (very few words occur very often, very many words occur very rare). The Zipf law says that the frequency of a word is reciprocal to its rank (1 / r). To make the plot more readable, the axes can be logarithmized:

plot(wordlist$freqs , type = "l", log="xy", lwd=2, main = "Rank-Frequency Plot", xlab="log-Rank", ylab ="log-Frequency")


In the plot, two extreme ranges can be determined. Words in ranks between ca. 10,000 and 115919 can be observed only 10 times or less. Words below rank 100 can be observed more than 1000 times in the documents. The goal of text mining is to automatically find structures in documents. Both mentioned extreme ranges of the vocabulary often are not suitable for this. Words which occur rarely, on very few documents, and words which occur extremely often, in almost every document, do not contribute much to the meaning of a text.

Hence, ignoring very rare / frequent words has many advantages:

reducing the dimensionality of the vocabulary (saves memory) processing speed up *better identification of meaningful structures.

To illustrate the range of ranks best to be used for analysis, we augment information in the rank frequency plot. First, we mark so-called stop words. These are words of a language that normally do not contribute to semantic information about a text. In addition, all words in the word list are identified which occur less than 10 times.

The %in% operator can be used to compare which elements of the first vector are contained in the second vector. At this point, we compare the words in the word list with a loaded stopword list (retrieved by the function stopwords of the tm package) . The result of the %in% operator is a boolean vector which contains TRUE or FALSE values.

A boolean value (or a vector of boolean values) can be inverted with the ! operator (TRUE gets FALSE and vice versa). The which command returns the indices of entries in a boolean vector which contain the value TRUE.

We also compute indices of words, which occur less than 10 times. With a union set operation, we combine both index lists. With a setdiff operation, we reduce a vector of all indices (the sequence 1:nrow(wordlist)) by removing the stopword indices and the low freuent word indices.

With the command “lines” the range of the remaining indices can be drawn into the plot:

plot(wordlist$freqs, type = "l", log="xy",lwd=2, main = "Rank-Frequency plot", xlab="Rank", ylab = "Frequency")
englishStopwords <- stopwords("en")
stopwords_idx <- which(wordlist$words %in% englishStopwords)
low_frequent_idx <- which(wordlist$freqs < 10)
insignificant_idx <- union(stopwords_idx, low_frequent_idx)
meaningful_range_idx <- setdiff(1:nrow(wordlist), insignificant_idx)
lines(meaningful_range_idx, wordlist$freqs[meaningful_range_idx], col = "green", lwd=2, type="p", pch=20)


The green range marks the range of meaningful terms for the collection.

Frequency analysis

For the frequency analyses to come, instead of using the exact date (day) of publication, we will aggregate articles by year and decade.

Preprocessing

First, we create two more columns for the year and the decade. For the year we select a sub string of the four first characters from the date column of the data frame (e.g. extracting “1990” from “1990-02-12”). For the decade we select a sub string of the first three characters and paste a 0 to it. In later parts of the exercise we can use these columns for grouping data:

textdata2 <- rs_full_text %>% distinct()

textdata2$year <- substr(textdata2$Date, 0, 4)
textdata2$decade <- paste0(substr(textdata2$Date, 0, 3), "0")


Our dataframe now contains two additional columns for the year and the decade:

# dimensions of the data frame
dim(textdata2)
## [1] 2739    7
# column names of text and metadata
colnames(textdata2)
## [1] "DocID"  "Date"   "Title"  "Source" "Text"   "year"   "decade"


Then, we create a corpus object again.

This time, we apply different preprocessing steps to the corpus text. remove_punct leaves only alphanumeric characters in the text. remove_numbers removes numeric characters and remove_symbols removes all characters in the Unicode “Symbol” [S] class. Lowercase transformation is performed (need to be applied before lemma replacement!), and finally a set of English stop-words is removed:

rs_corpus_freq <- corpus(textdata2$Text, docnames = textdata2$DocID)

# Create a DTM
corpus_tokens <- rs_corpus_freq %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_remove(pattern = stopwords())


Effects of this preprocessing can be printed in the console (only the first document is displayed):

print(paste0("1: ", substr(paste(corpus_tokens[1],collapse = " "), 0, 400), '...'))
## [1] "1: many university men shanghai mayor's staff gen hwang fu mayor chinese area shanghai scholar note announced appointment cabinet ten members practically graduates chinese colleges returned students america europe list includes following police commissioner shen pu jen assistant commissioner chang zien chun graduate nanyang college commissioner revenue shu zien fu banker formerly director rank china ..."


We see that the text is now a sequence of text features corresponding to the selected methods.

From the preprocessed corpus, we create a new DTM:

DTM2 <- corpus_tokens %>% 
  dfm() 


The resulting DTM should have 2739 rows and 109507 columns:

dim(DTM2)
## [1]   2739 109507

Time series

We now want to measure frequencies of certain terms over time. Frequencies in single decades are plotted as line graphs to follow their trends over time. First, we determine which terms to analyze and reduce our DTM to this these terms:

terms_to_observe <- c("nation", "boxer", "american", "women", "education")

DTM_reduced <- as.matrix(DTM2[, terms_to_observe])


The reduced DTM contains counts for each of our 5 terms and in each of the 2739 documents (rows of the reduced DTM). Since our corpus covers a time span of more than 100 years, we assume that aggregate frequencies in single documents per decade will provide a meaningful representation of term frequencies over time.

In the following, we use textdata2$decade as a grouping parameter for the aggregate function. This function sub-selects rows from the input data (DTM_reduced) for all different decade values given in the by-parameter. Each sub-selection is processed column-wise using the function provided in the third parameter (sum):

counts_per_decade <- aggregate(DTM_reduced, by = list(decade = textdata2$decade), sum)


counts_per_decade now contains sums of term frequencies per decade. Time series for single terms can be plotted either by the simple plot function. Additional time series could be added by the lines-function (see above). A more simple way is to use the matplot-function which can draw multiple lines in one command:

# give x and y values beautiful names
decades <- counts_per_decade$decade
frequencies <- counts_per_decade[, terms_to_observe]

# plot multiple frequencies
matplot(decades, frequencies, type = "l")

# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)  


Since there were relatively few articles prior to 1900, we can reduce the time window and focus on the articles published after 1880 for instance:

counts_per_decade2 <- counts_per_decade %>% filter(decade > 1880)
# give x and y values beautiful names
decades <- counts_per_decade2$decade
frequencies <- counts_per_decade2[, terms_to_observe]

# plot multiple frequencies
matplot(decades, frequencies, type = "l")

# add legend to the plot
l <- length(terms_to_observe)
legend('topleft', legend = terms_to_observe, col=1:l, text.col = 1:l, lty = 1:l)  


Among other things, we can observe peaks in reference to the growing influence of the United States after WWI. The term education rose earlier (in the late 1880s) in relation to educational reforms in the late years of the empire, and peaked in the early years of the Republic (1911-1920). References to women peaked in the early 1920s. The term nation rose after the Boxer rebellion in 1900 but did not reach such a high peak as the three previously mentioned terms. The word Boxer emerged in the context of the Boxer rebellion but it was not until the first Remission of the Boxer Indemnity by the United States (1908) that it gained currency in the press.

The problem with this decennal representation is that there are huge variations in the number of articles per decade - from just one in 1840 to 1010 in 1920. You can check with table(textdata2$decade):

table(textdata2$decade)
## 
## 1840 1870 1880 1890 1900 1910 1920 1930 1940 1950 
##    1    3    1   12   68  622 1010  944   64   14


As a result, the frequency of words essentially reflects the relative frequency of documents in the corpus. In the next tutorial, we will see how we can handle such document discrepancies.

Heatmaps

The overlapping of several time series in a plot can become very confusing. Heatmaps provide an alternative for the visualization of multiple frequencies over time. In this visualization method, a time series is mapped as a row in a matrix grid. Each cell of the grid is filled with a color corresponding to the value from the time series. Thus, several time series can be displayed in parallel.

In addition, the time series can be sorted by similarity in a heatmap. In this way, similar frequency sequences with parallel shapes (heat activated cells) can be detected more quickly. Dendrograms can be plotted aside to visualize quantities of similarity.

terms_to_observe <- c("nation", "boxer", "american", "women", "education", "japan", "war", "peace", "goodwill", "industry", "banking", "economy", "business", "engineering", "service")
DTM_reduced <- as.matrix(DTM2[, terms_to_observe])
rownames(DTM_reduced) <- ifelse(as.integer(textdata2$year) %% 2 == 0, textdata2$year, "")
heatmap(t(DTM_reduced), scale = "row", Colv=NA, col = rev(heat.colors(256)), keep.dendro= FALSE, margins = c(5, 10))


In the next tutorial, we will continue to rely on quanteda to apply more sophisticated methods for handling variation in the size of documents or corpora (log likelihood ratio) and multi-word units.

References

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.