Load data

library(readr)
crow_title_id <- read_csv("crow_title_id.csv", 
    col_types = cols(X1 = col_skip()))

crow_title_1931 <- crow_title_id %>% filter(Year == "1931")
crow_title_1935 <- crow_title_id %>% filter(Year == "1935")

Prepare data

Based on the pre-tokenized text (crow_unigram), we prepare a standard dataset x with two columns containing the unique id of each periodical (doc_id) and the word it contains (lemma). This is the standard format expected by btm algorithm. We just select and rename the original variables “word” into lemma and “id” into doc_id.

data("stop_words")

crow_unigram <- crow_title_1935 %>% 
  unnest_tokens(output = word, input = title) %>% 
  anti_join(stop_words) 

x <- crow_unigram %>% 
  rename(doc_id = id) %>% 
  rename(lemma = word) %>% 
  select(doc_id, lemma)

Build the model

We build a model with 10 topics (k = 10)

set.seed(321)
model10  <- BTM(x, k = 10, beta = 0.01, iter = 1000, trace = 100)

## 2021-07-07 16:27:49 Start Gibbs sampling iteration 1/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 101/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 201/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 301/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 401/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 501/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 601/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 701/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 801/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 901/1000

Inspect the model

Model description:

model10

## Biterm Topic Model
##   trained with 1000 Gibbs iterations, alpha: 5, beta: 0.01
##   topics: 10
##   size of the token vocabulary: 500
##   topic distribution theta: 0.037 0.052 0.129 0.123 0.068 0.149 0.174 0.068 0.158 0.042

model10$theta

##  [1] 0.03707572 0.05221932 0.12898172 0.12323760 0.06788512 0.14882507
##  [7] 0.17389034 0.06788512 0.15822454 0.04177546

The model consists of 500 distinct tokens (words) distributed across 10 topics. The topic distribution theta indicates the conditional probability of each topic (i.e. the probability of each topic to appear in the dataset).

The five most frequent words for each topic, with their condition probability (i.e. the probability to contribute the topic), by decreasing order:

terms(model10)

## [[1]]
##     token probability
## 1 monthly  0.12416058
## 2  weekly  0.09496350
## 3    chin  0.06576642
## 4   movie  0.06576642
## 5     pao  0.03656934
## 
## [[2]]
##   token probability
## 1   pao  0.21030769
## 2   jih  0.13338462
## 3   min  0.04620513
## 4   sin  0.04620513
## 5 chung  0.04107692
## 
## [[3]]
##        token probability
## 1       news  0.27813906
## 2      daily  0.18202454
## 3    evening  0.04910020
## 4 commercial  0.04092025
## 5       east  0.04092025
## 
## [[4]]
##        token probability
## 1      daily  0.20344754
## 2       news  0.18845824
## 3 commercial  0.07496788
## 4      north  0.05141328
## 5      press  0.04927195
## 
## [[5]]
##     token probability
## 1   china  0.11376471
## 2 journal  0.10592157
## 3  weekly  0.09023529
## 4 medical  0.06278431
## 5  review  0.05494118
## 
## [[6]]
##        token probability
## 1      daily  0.29205310
## 2       news  0.29205310
## 3 republican  0.11506195
## 4   shantung  0.01771681
## 5    current  0.01063717
## 
## [[7]]
##     token probability
## 1    news  0.14978820
## 2   press  0.11801815
## 3   daily  0.07414523
## 4 evening  0.06355522
## 5 morning  0.04388805
## 
## [[8]]
##      token probability
## 1  chinese  0.09023529
## 2  journal  0.07062745
## 3 commerce  0.04317647
## 4 economic  0.04317647
## 5 hongkong  0.04317647
## 
## [[9]]
##        token probability
## 1      press  0.21299501
## 2     people  0.13146423
## 3      daily  0.10151414
## 4       news  0.08986689
## 5 commercial  0.04494176
## 
## [[10]]
##      token probability
## 1 shanghai  0.12909677
## 2    times  0.09683871
## 3 tientsin  0.05812903
## 4   review  0.04522581
## 5   sunday  0.04522581

The ten most frequent words for each topic, with their condition probability (i.e. the probability to contribute the topic), by decreasing order:

topicterms10 <- terms(model10, top_n = 10)
topicterms10

## [[1]]
##        token probability
## 1    monthly  0.12416058
## 2     weekly  0.09496350
## 3       chin  0.06576642
## 4      movie  0.06576642
## 5        pao  0.03656934
## 6  education  0.02927007
## 7     public  0.02927007
## 8      radio  0.02927007
## 9     sports  0.02927007
## 10       fen  0.02197080
## 
## [[2]]
##      token probability
## 1      pao  0.21030769
## 2      jih  0.13338462
## 3      min  0.04620513
## 4      sin  0.04620513
## 5    chung  0.04107692
## 6      hwa  0.03594872
## 7    shang  0.03082051
## 8  chengtu  0.02569231
## 9     shih  0.02569231
## 10     kan  0.02056410
## 
## [[3]]
##         token probability
## 1        news  0.27813906
## 2       daily  0.18202454
## 3     evening  0.04910020
## 4  commercial  0.04092025
## 5        east  0.04092025
## 6  industrial  0.03887526
## 7      canton  0.02660532
## 8      tsinan  0.02047035
## 9   pictorial  0.01842536
## 10     social  0.01842536
## 
## [[4]]
##         token probability
## 1       daily  0.20344754
## 2        news  0.18845824
## 3  commercial  0.07496788
## 4       north  0.05141328
## 5       press  0.04927195
## 6  industrial  0.04498929
## 7       china  0.03428266
## 8     eastern  0.03214133
## 9     citizen  0.02785867
## 10     ichang  0.01715203
## 
## [[5]]
##          token probability
## 1        china  0.11376471
## 2      journal  0.10592157
## 3       weekly  0.09023529
## 4      medical  0.06278431
## 5       review  0.05494118
## 6      chinese  0.05101961
## 7        press  0.05101961
## 8     hongkong  0.03141176
## 9        trade  0.02749020
## 10 engineering  0.02356863
## 
## [[6]]
##         token probability
## 1       daily 0.292053097
## 2        news 0.292053097
## 3  republican 0.115061947
## 4    shantung 0.017716814
## 5     current 0.010637168
## 6      shansi 0.010637168
## 7   impartial 0.008867257
## 8    kweichow 0.008867257
## 9   northwest 0.008867257
## 10   overseas 0.008867257
## 
## [[7]]
##       token probability
## 1      news  0.14978820
## 2     press  0.11801815
## 3     daily  0.07414523
## 4   evening  0.06355522
## 5   morning  0.04388805
## 6     south  0.04237519
## 7     china  0.03632375
## 8      post  0.03329803
## 9    public  0.03178517
## 10 tientsin  0.02724660
## 
## [[8]]
##        token probability
## 1    chinese  0.09023529
## 2    journal  0.07062745
## 3   commerce  0.04317647
## 4   economic  0.04317647
## 5   hongkong  0.04317647
## 6    monthly  0.04317647
## 7   shanghai  0.03925490
## 8   magazine  0.03533333
## 9    chamber  0.03141176
## 10 directory  0.03141176
## 
## [[9]]
##         token probability
## 1       press  0.21299501
## 2      people  0.13146423
## 3       daily  0.10151414
## 4        news  0.08986689
## 5  commercial  0.04494176
## 6       voice  0.02663894
## 7     nantung  0.01665557
## 8      common  0.01499168
## 9    chekiang  0.01332779
## 10      river  0.01332779
## 
## [[10]]
##       token probability
## 1  shanghai  0.12909677
## 2     times  0.09683871
## 3  tientsin  0.05812903
## 4    review  0.04522581
## 5    sunday  0.04522581
## 6     nichi  0.03877419
## 7   shimbun  0.03877419
## 8    peking  0.03232258
## 9     guide  0.02587097
## 10   herald  0.02587097

Probability istribution of topics across periodicals (only the ten first rowas are displayed)

scores <- predict(model10, newdata = x)
head(scores, 10)

##                          [,1]         [,2]         [,3]         [,4]
## Andong_Daily_329 1.745165e-05 1.509896e-06 8.463120e-01 1.011944e-01
## Anqing_Daily_359 1.743393e-06 1.224138e-04 4.760552e-07 1.147554e-03
## Anqing_Daily_603 3.142917e-06 1.997545e-08 7.974360e-02 5.771181e-02
## Anqing_Daily_729 1.743393e-06 1.224138e-04 4.760552e-07 1.147554e-03
## Anqing_Daily_772 8.544342e-08 5.999480e-06 4.668612e-05 1.969012e-01
## Anqing_Daily_902 1.716105e-05 9.999287e-01 4.686039e-06 4.909134e-06
## Aomen_Daily_419  7.480627e-06 5.200578e-06 2.042679e-06 6.441181e-04
## Aomen_Daily_443  3.549999e-07 1.024948e-05 9.693716e-08 9.476704e-05
## Aomen_Daily_506  4.446203e-04 4.399638e-04 4.333508e-04 4.335575e-04
## Aomen_Daily_553  9.364620e-05 8.284073e-07 4.254872e-02 3.110605e-02
##                          [,5]         [,6]         [,7]         [,8]
## Andong_Daily_329 1.147835e-06 1.090422e-02 7.006417e-03 2.669429e-05
## Anqing_Daily_359 1.198722e-03 4.155731e-05 2.740121e-03 9.213849e-07
## Anqing_Daily_603 1.518549e-08 8.212997e-01 2.359904e-02 1.518549e-08
## Anqing_Daily_729 1.198722e-03 4.155731e-05 2.740121e-03 9.213849e-07
## Anqing_Daily_772 5.874917e-05 2.036717e-06 1.478565e-01 4.560850e-06
## Anqing_Daily_902 9.069631e-06 4.050182e-06 3.457546e-06 9.069631e-06
## Aomen_Daily_419  3.953519e-06 1.765504e-06 1.056526e-03 3.953519e-06
## Aomen_Daily_443  9.904109e-05 3.479517e-06 2.613029e-02 1.876178e-07
## Aomen_Daily_506  4.373758e-04 4.327608e-04 2.165370e-01 8.791253e-02
## Aomen_Daily_553  6.297616e-07 8.083499e-02 6.341489e-02 6.297616e-07
##                          [,9]        [,10]
## Andong_Daily_329 3.453430e-02 1.911802e-06
## Anqing_Daily_359 9.947450e-01 1.534634e-06
## Anqing_Daily_603 1.764259e-02 2.529254e-08
## Anqing_Daily_729 9.947450e-01 1.534634e-06
## Anqing_Daily_772 6.551241e-01 7.521215e-08
## Anqing_Daily_902 3.805575e-06 1.510613e-05
## Aomen_Daily_419  1.658878e-06 9.982733e-01
## Aomen_Daily_443  9.736612e-01 3.124911e-07
## Aomen_Daily_506  6.924861e-01 4.428023e-04
## Aomen_Daily_553  7.819986e-01 1.048914e-06

Make a histogram of topic distribution:

hist(scores)

Plot the model

Load the packages:

library(textplot)
library(ggraph)
library(concaveman)

Plot the model:

plot(model10, top_n = 10, 
     title = "Name of periodicals in Crow's 'Newspaper Directory of China' (1935)'", 
     subtitle = "Biterm topic model with 10 topics")

Add legend:

plot(model10, top_n = 10, 
     title = "Name of periodicals in Crow's 'Newspaper Directory of China' (1935)", 
     subtitle = "Biterm topic model with 10 topics",
       labels = c("monthly/weekly", "jih pao", "evening daily news", 
                  "commercial daily news", "journal", 
                  "republican daily", "news/press",
                  "commerce economic journal", "people press"))

Naming periodicals in early 1930s China

Biterm topic modeling

Cécile Armand

2021-07-07

Load data

Prepare data

Build the model

Inspect the model

Plot the model

Add metadata