2 Set Up

2.1 Installation and configuration

devtools::install_gitlab("enpchina/histtext-r-client", auth_token = "replace with your gitlab token")

Configuration of the package (replace fields with actual server information)

histtext::set_config_file(domain = "https://rapi.enpchina.eu",
                          user = "user_info", password = "user_info_password")

If successfully configured, the following command will return “OK”

histtext::get_server_status()

Now you can upload the library

library(histtext)

2.2 Available Corpora

The function list_corpora serves to list all the corpora available on the Modern China Text Base created by the ENP-China Project. The corpora are stored on a SolR server. Each corpus is labeled with the specific name to be used in the search functions (see below):

histtext::list_corpora()
##  [1] "archives"             "chinajournal-pages"   "csmo-pages"          
##  [4] "dongfangzz"           "elder_workers"        "elder_workers_format"
##  [7] "imh-en"               "imh-zh"               "kmt9k"               
## [10] "ncbras"               "proquest"             "reports-en"          
## [13] "reports-fr"           "scmp-recent"          "shimingru-diary"     
## [16] "shunpao"              "shunpao-revised"      "shunpao-tok"         
## [19] "waiguozaihua"         "wikibio-en"           "wikibio-zh"          
## [22] "zhanggangdiary"


### Brief description

Periodicals:

  • shunpao: Chinese newspaper Shenbao 申報 (1872-1949): original version from the provider (GetHong)
  • shunpao-revised: Chinese newspaper Shenbao 申報 (1872-1949): corrected version by the ENP-China project (date formatting, correction of titles mixed up with text, segmentation of extra-long articles)
  • proquest: English-language periodicals from the ProQuest Chinese Newspapers Collection (CNC)
  • dongfangzz: Dongfang zazhi 東方雜誌 (1904-1948)
  • ncbras: Journal of the North China Branch of the Royal Asiatic Society (1858-1948)
  • chinajournal-pages: The China Journal (1904-1949) (access at page level)
  • csmo-pages: Chinese Students’ Monthly (1906-1931) (access at page level)
  • scmp-recent: South China Morning Post (1954-2000) (subset from the ProQuest collection)
  • cmj: China Medical Journal (1887-1949)
  • elder_workers_format: corpus of interviews of Shanghai workers (1953-1958) [not public]

Other printed sources:

Archives:

Wikipedia:

  • wikibio-en: corpus of biographies of individuals active in modern China extracted from Wikipedia (English)
  • wikibio-zh: corpus of biographies of individuals active in modern China extracted from Wikipedia (Chinese)

Diaries and Memoirs

The content of the Modern China Text Base is expanding continuously. The presentation above may not reflect the most recent state of its collections.

2.2.1 Statistics

Comprehensive statistics on all the corpora have been pre-computed and are available on GitLab. For each corpus, a specific folder provides access to all the CSV tables and the visualizations (see images below).

Statistics Homepage List of folders (ProQuest) Collection statistics for ProQuest North China Herald corpus statistics