Abstract
This document applies biterm topic modeling to the titles of periodicals in Crow’s Newspaper Directory of China (1935)
library(readr)
crow_title_id <- read_csv("crow_title_id.csv",
col_types = cols(X1 = col_skip()))
crow_title_1931 <- crow_title_id %>% filter(Year == "1931")
crow_title_1935 <- crow_title_id %>% filter(Year == "1935")
Based on the pre-tokenized text (crow_unigram), we prepare a standard dataset x with two columns containing the unique id of each periodical (doc_id) and the word it contains (lemma). This is the standard format expected by btm algorithm. We just select and rename the original variables “word” into lemma and “id” into doc_id.
data("stop_words")
crow_unigram <- crow_title_1935 %>%
unnest_tokens(output = word, input = title) %>%
anti_join(stop_words)
x <- crow_unigram %>%
rename(doc_id = id) %>%
rename(lemma = word) %>%
select(doc_id, lemma)
We build a model with 10 topics (k = 10)
set.seed(321)
model10 <- BTM(x, k = 10, beta = 0.01, iter = 1000, trace = 100)
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 1/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 101/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 201/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 301/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 401/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 501/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 601/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 701/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 801/1000
## 2021-07-07 16:27:49 Start Gibbs sampling iteration 901/1000
Model description:
model10
## Biterm Topic Model
## trained with 1000 Gibbs iterations, alpha: 5, beta: 0.01
## topics: 10
## size of the token vocabulary: 500
## topic distribution theta: 0.037 0.052 0.129 0.123 0.068 0.149 0.174 0.068 0.158 0.042
model10$theta
## [1] 0.03707572 0.05221932 0.12898172 0.12323760 0.06788512 0.14882507
## [7] 0.17389034 0.06788512 0.15822454 0.04177546
The model consists of 500 distinct tokens (words) distributed across 10 topics. The topic distribution theta indicates the conditional probability of each topic (i.e. the probability of each topic to appear in the dataset).
The five most frequent words for each topic, with their condition probability (i.e. the probability to contribute the topic), by decreasing order:
terms(model10)
## [[1]]
## token probability
## 1 monthly 0.12416058
## 2 weekly 0.09496350
## 3 chin 0.06576642
## 4 movie 0.06576642
## 5 pao 0.03656934
##
## [[2]]
## token probability
## 1 pao 0.21030769
## 2 jih 0.13338462
## 3 min 0.04620513
## 4 sin 0.04620513
## 5 chung 0.04107692
##
## [[3]]
## token probability
## 1 news 0.27813906
## 2 daily 0.18202454
## 3 evening 0.04910020
## 4 commercial 0.04092025
## 5 east 0.04092025
##
## [[4]]
## token probability
## 1 daily 0.20344754
## 2 news 0.18845824
## 3 commercial 0.07496788
## 4 north 0.05141328
## 5 press 0.04927195
##
## [[5]]
## token probability
## 1 china 0.11376471
## 2 journal 0.10592157
## 3 weekly 0.09023529
## 4 medical 0.06278431
## 5 review 0.05494118
##
## [[6]]
## token probability
## 1 daily 0.29205310
## 2 news 0.29205310
## 3 republican 0.11506195
## 4 shantung 0.01771681
## 5 current 0.01063717
##
## [[7]]
## token probability
## 1 news 0.14978820
## 2 press 0.11801815
## 3 daily 0.07414523
## 4 evening 0.06355522
## 5 morning 0.04388805
##
## [[8]]
## token probability
## 1 chinese 0.09023529
## 2 journal 0.07062745
## 3 commerce 0.04317647
## 4 economic 0.04317647
## 5 hongkong 0.04317647
##
## [[9]]
## token probability
## 1 press 0.21299501
## 2 people 0.13146423
## 3 daily 0.10151414
## 4 news 0.08986689
## 5 commercial 0.04494176
##
## [[10]]
## token probability
## 1 shanghai 0.12909677
## 2 times 0.09683871
## 3 tientsin 0.05812903
## 4 review 0.04522581
## 5 sunday 0.04522581
The ten most frequent words for each topic, with their condition probability (i.e. the probability to contribute the topic), by decreasing order:
topicterms10 <- terms(model10, top_n = 10)
topicterms10
## [[1]]
## token probability
## 1 monthly 0.12416058
## 2 weekly 0.09496350
## 3 chin 0.06576642
## 4 movie 0.06576642
## 5 pao 0.03656934
## 6 education 0.02927007
## 7 public 0.02927007
## 8 radio 0.02927007
## 9 sports 0.02927007
## 10 fen 0.02197080
##
## [[2]]
## token probability
## 1 pao 0.21030769
## 2 jih 0.13338462
## 3 min 0.04620513
## 4 sin 0.04620513
## 5 chung 0.04107692
## 6 hwa 0.03594872
## 7 shang 0.03082051
## 8 chengtu 0.02569231
## 9 shih 0.02569231
## 10 kan 0.02056410
##
## [[3]]
## token probability
## 1 news 0.27813906
## 2 daily 0.18202454
## 3 evening 0.04910020
## 4 commercial 0.04092025
## 5 east 0.04092025
## 6 industrial 0.03887526
## 7 canton 0.02660532
## 8 tsinan 0.02047035
## 9 pictorial 0.01842536
## 10 social 0.01842536
##
## [[4]]
## token probability
## 1 daily 0.20344754
## 2 news 0.18845824
## 3 commercial 0.07496788
## 4 north 0.05141328
## 5 press 0.04927195
## 6 industrial 0.04498929
## 7 china 0.03428266
## 8 eastern 0.03214133
## 9 citizen 0.02785867
## 10 ichang 0.01715203
##
## [[5]]
## token probability
## 1 china 0.11376471
## 2 journal 0.10592157
## 3 weekly 0.09023529
## 4 medical 0.06278431
## 5 review 0.05494118
## 6 chinese 0.05101961
## 7 press 0.05101961
## 8 hongkong 0.03141176
## 9 trade 0.02749020
## 10 engineering 0.02356863
##
## [[6]]
## token probability
## 1 daily 0.292053097
## 2 news 0.292053097
## 3 republican 0.115061947
## 4 shantung 0.017716814
## 5 current 0.010637168
## 6 shansi 0.010637168
## 7 impartial 0.008867257
## 8 kweichow 0.008867257
## 9 northwest 0.008867257
## 10 overseas 0.008867257
##
## [[7]]
## token probability
## 1 news 0.14978820
## 2 press 0.11801815
## 3 daily 0.07414523
## 4 evening 0.06355522
## 5 morning 0.04388805
## 6 south 0.04237519
## 7 china 0.03632375
## 8 post 0.03329803
## 9 public 0.03178517
## 10 tientsin 0.02724660
##
## [[8]]
## token probability
## 1 chinese 0.09023529
## 2 journal 0.07062745
## 3 commerce 0.04317647
## 4 economic 0.04317647
## 5 hongkong 0.04317647
## 6 monthly 0.04317647
## 7 shanghai 0.03925490
## 8 magazine 0.03533333
## 9 chamber 0.03141176
## 10 directory 0.03141176
##
## [[9]]
## token probability
## 1 press 0.21299501
## 2 people 0.13146423
## 3 daily 0.10151414
## 4 news 0.08986689
## 5 commercial 0.04494176
## 6 voice 0.02663894
## 7 nantung 0.01665557
## 8 common 0.01499168
## 9 chekiang 0.01332779
## 10 river 0.01332779
##
## [[10]]
## token probability
## 1 shanghai 0.12909677
## 2 times 0.09683871
## 3 tientsin 0.05812903
## 4 review 0.04522581
## 5 sunday 0.04522581
## 6 nichi 0.03877419
## 7 shimbun 0.03877419
## 8 peking 0.03232258
## 9 guide 0.02587097
## 10 herald 0.02587097
Probability istribution of topics across periodicals (only the ten first rowas are displayed)
scores <- predict(model10, newdata = x)
head(scores, 10)
## [,1] [,2] [,3] [,4]
## Andong_Daily_329 1.745165e-05 1.509896e-06 8.463120e-01 1.011944e-01
## Anqing_Daily_359 1.743393e-06 1.224138e-04 4.760552e-07 1.147554e-03
## Anqing_Daily_603 3.142917e-06 1.997545e-08 7.974360e-02 5.771181e-02
## Anqing_Daily_729 1.743393e-06 1.224138e-04 4.760552e-07 1.147554e-03
## Anqing_Daily_772 8.544342e-08 5.999480e-06 4.668612e-05 1.969012e-01
## Anqing_Daily_902 1.716105e-05 9.999287e-01 4.686039e-06 4.909134e-06
## Aomen_Daily_419 7.480627e-06 5.200578e-06 2.042679e-06 6.441181e-04
## Aomen_Daily_443 3.549999e-07 1.024948e-05 9.693716e-08 9.476704e-05
## Aomen_Daily_506 4.446203e-04 4.399638e-04 4.333508e-04 4.335575e-04
## Aomen_Daily_553 9.364620e-05 8.284073e-07 4.254872e-02 3.110605e-02
## [,5] [,6] [,7] [,8]
## Andong_Daily_329 1.147835e-06 1.090422e-02 7.006417e-03 2.669429e-05
## Anqing_Daily_359 1.198722e-03 4.155731e-05 2.740121e-03 9.213849e-07
## Anqing_Daily_603 1.518549e-08 8.212997e-01 2.359904e-02 1.518549e-08
## Anqing_Daily_729 1.198722e-03 4.155731e-05 2.740121e-03 9.213849e-07
## Anqing_Daily_772 5.874917e-05 2.036717e-06 1.478565e-01 4.560850e-06
## Anqing_Daily_902 9.069631e-06 4.050182e-06 3.457546e-06 9.069631e-06
## Aomen_Daily_419 3.953519e-06 1.765504e-06 1.056526e-03 3.953519e-06
## Aomen_Daily_443 9.904109e-05 3.479517e-06 2.613029e-02 1.876178e-07
## Aomen_Daily_506 4.373758e-04 4.327608e-04 2.165370e-01 8.791253e-02
## Aomen_Daily_553 6.297616e-07 8.083499e-02 6.341489e-02 6.297616e-07
## [,9] [,10]
## Andong_Daily_329 3.453430e-02 1.911802e-06
## Anqing_Daily_359 9.947450e-01 1.534634e-06
## Anqing_Daily_603 1.764259e-02 2.529254e-08
## Anqing_Daily_729 9.947450e-01 1.534634e-06
## Anqing_Daily_772 6.551241e-01 7.521215e-08
## Anqing_Daily_902 3.805575e-06 1.510613e-05
## Aomen_Daily_419 1.658878e-06 9.982733e-01
## Aomen_Daily_443 9.736612e-01 3.124911e-07
## Aomen_Daily_506 6.924861e-01 4.428023e-04
## Aomen_Daily_553 7.819986e-01 1.048914e-06
Make a histogram of topic distribution:
hist(scores)
Load the packages:
library(textplot)
library(ggraph)
library(concaveman)
Plot the model:
plot(model10, top_n = 10,
title = "Name of periodicals in Crow's 'Newspaper Directory of China' (1935)'",
subtitle = "Biterm topic model with 10 topics")
Add legend:
plot(model10, top_n = 10,
title = "Name of periodicals in Crow's 'Newspaper Directory of China' (1935)",
subtitle = "Biterm topic model with 10 topics",
labels = c("monthly/weekly", "jih pao", "evening daily news",
"commercial daily news", "journal",
"republican daily", "news/press",
"commerce economic journal", "people press"))