1 Introduction

The Modern China Biographical Database (MCBD) constitutes a core initiative to establish a long-term publicly accessible resource for historical research in the China field. The database is the major instrument developed by the ENP-China project. ENP-China stands for Elites, Networks, and Power in Modern Urban China. It is an ERC-funded Advanced Research Grant (ERC no. 788476). From the start, however, we designed the database to serve a much broader purpose. Its object is not limited to elites; it means to include all historical actors. Its usage is not limited to the ENP-China project. It is a platform made available to the whole scholarly community of historians of modern China. Our hope is to make it the most essential biographical resource for the study of modern Chinese history.

1.1 Aim and Scope

The origin and function of MCBD are to lay the grounds for the collection of massive amounts of biographical data and information on historical actors in modern Chinese history. The structure of the database reflects this intellectual ambition. The user manual describes the objectives, structure, and operation of the Modern China Biographical Database. MCBD is based on Heurist, a web database interface that provides both a backend and a frontend for relational research data in the humanities. The ENP-China project is very grateful for the support of the Heurist team, especially to Ian Johnson for his unrelenting assistance in the development phase of MCBD.

MCBD sits at the heart of the project of breaking through the current constraints of historical research. The massive transformation of historical documentation into digital full-text format presents a formidable opportunity for historians. Except for manuscript archives, almost anything else can be turned into a searchable digital document: newspapers, periodicals, books, academic literature, not to mention the millions of pages on the Internet. The challenge is precisely to design the methods and the tools to make a profitable use of the vast quantity of information at the historians’ fingertips. The digital transformation of historical sources opens the way to exploring and exploiting documents to a scale unimaginable in the past. For historical research, it means establishing new practices that will enable the production of data-rich history.

MCBD is a general-purpose database that collects a wide range of biographical information. It revolves primarily around individuals — they are at the center of the database, as well as institutions (any kind of organization: ministry, club, company, etc.), locations (any named human settlement), and events (any form of individual or collective action). In the database, we make the distinction between historical information and historical data. The latter is the refined and trimmed down form of the former. In MCBD, most of the information is transformed into data that find their way into specific data fields. Each field always contains only a single piece of data. Yet, not all historical information lends itself to such reduction or, sometimes it is desirable to retain not just the one word/digit that qualifies an actor or an action, but to preserve this word in its context, in the form of a short sentence. This is also necessary — methodologically — for the kind of raw information that can be extracted from the press.

To address this particular challenge, MCBD includes both “data” fields that contain just a single item and “observations” that receive strings of unstructured text that document more substantially the action of an individual. This structure is designed to meet the needs of historical research on the modern period when mass printing, especially newspaper printing, became the norm. It is not conceivable to transform all the available or collected information into data. Our “observations” constitute an intermediary stage of data collection that records actions attached to a given actor to produce a more substantial biographical record. Ultimately, whenever necessary, such “observations” can be filtered down and turned into “cooked data.”

The temporal coverage of the database is 1830 to 1949, namely to include all the individuals born between 1800 and 1930 who were active in China during this period, regardless of their origin, nationality and the duration of their presence in China. The year 1949 represents the current terminus ad quem of MCBD because our primary purpose is to document the history of modern China before the establishment of the People’s Republic of China. Yet the database documents the selected individuals, as much as possible, from their birth to the time of their death, beyond the 1949 divide. Furthermore, the current temporal coverage is not final. This temporal coverage is linked to the timeline of the ENP-China Project but the database can and will extend incrementally to later periods through future research projects or contributions by scholars involved in the study of contemporary China.

MCDB is meant to serve as a tool for both qualitative and quantitative analysis. The collection and compilation of rich biographical data creates the conditions for the systematic exploration and processing of data from various perspectives and through different methods. The life of individuals is made up of the successive and finely detailed events that marked their existence, including education, life events, career, social activities, etc. These events are related to other individuals, to other institutions and they are grounded in various locations. The data can be used for prosopography, which precisely relies on individual biographical data to study different groups in society. But the data lends itself to a wide variety of approaches and methods as well.

Because the database connects the elements of information that two or more individuals have in common — birthplace, degree from the same university, membership in the same association, participation in the same event, etc. — there is an infinite possibility to explore the connections between individuals, institutions, locations, and events. The range of methodological approaches — separately or in combination — include social network analysis, spatial analysis (GIS), graphs, and all types of visualization. It may even include textual analysis with NLP (natural language processing) tools since in the case of newspapers, MCBD maintains the connection to the full text of articles whenever each document in the source database carries its own identifiable ID (e.g. Proquest historical newspapers).

In the initial stage, the source of the data has come mostly from the documents that the ENP-China team has been processing, especially newspapers and biographical dictionaries. The Institute of Modern History (Academia Sinica) generously shared the data from its 近現代人物資訊整合系統 database. The database also collects the data produced in the course of the case studies of the ENP-China project.

The development of the database also relies on an on-going process of curating and mining biographical-data publications from before and after 1949 (biographical dictionaries, directories, etc.), as well as the incorporation of large datasets produced by the members of the team in previous research. We are also incorporating the rich biographical data that we have extracted from Wikipedia and Baidu, and similar resources.

Whenever relevant, we hope to incorporate external datasets or data produced by external research groups or scholars, each with a clear and manifest identification of provenance. We believe in the benefits and virtues of building a collective database that includes the broadest spectrum of biographical data for modern Chinese history. No small dataset — it can be a simple spreadsheet with data on a few individuals — is negligible. We believe that “Small streams make a big river” and we genuinely welcome contributions from the scholarly community. We can provide guidelines for a smooth and seamless preparation of datasets.

MCBD is in its development stage. We plan to enrich it in the coming years until the ENP-China project completes its course. Thereafter, the database will be entrusted to the care of a major academic institution for maintenance and development.

1.2 Sources

Data in MCBD comes from six main categories of sources. Most printed sources have been digitized and ocerized so that data can be extracted and processed automatically or semi-automatically, even though there may be variation in the quality of OCR and the results of extraction. The cross-compilation of documents allow to solve pending issues of imprecise data (dates) or contradictory information.

1.2.1 Biographical dictionaries

Biographical dictionaries trace the entire life of individuals. Although the quality and quantity of information may vary considerably between dictionaries and between the biographies they contain, they basically provide information on the date and place of birth and death, kinship and family relations, educational background and professional life (positions, works). While some dictionaries were carefully crafted narratives, others were just collections of factual information, chronologically arranged. The construction of MCBD relies on three major reference dictionaries that constitute the backbone of the database: Hummel’s Eminent Chinese of the Qing Period, Boorman’s Biographical Dictionary of Republican China (BDRC) and Klein & Clark’s Biographic Dictionary of Chinese Communism.1 The BDRC in particular has provided a test bench for processing biographical dictionaries.2 These reference books have been supplemented by more specific dictionaries that focused on particular groups of elites, such as Guomindang generals, overseas students or foreign businessmen in China.3

1.2.2 Directories and Who’s who

Directories and Who’s who who focused on living individuals only. Basic information included their date and place of birth, their educational background, and the past and current positions they held in various institutions. Since the individuals were still living at the time of publication, such sources did not cover their entire life. Despite their incompleteness, their added value compared to dictionaries lies in the fact that they provide a highly detailed list of positions, often with their exact date, including participation in social clubs and associations. Like biographical dictionaries, some consisted of literary narratives, whereas others were just collections of factual information arranged by chronological order. Each volume amounts to hundreds of pages of dense printed words that contain very rich and complete information on the individuals that lived and worked in China and Asia. MCBD relies primarily on the rich collection of Who’s who held at the Institute of Modern History, Academia Sinica, Taipei.4 This collection contains more than fifty titles in Chinese, English and Japanese, covering about 133,000 characters from the mid-Qing to the early years of People’s Republic of China (PRC). For foreign elites specifically, the series of Asia Directory and Chronicle documents with precision who was doing what (position, activity), where (location, movement), and when (exact date, fuzzy dating).5

These two major corpora were supplemented by more focused Who’s who including professional minglu (registers) and directories of returned students published by Qinghua University and students’ associations in the United States.6

1.2.3 Newspapers and periodicals

The inclusion of newspaper data is the major innovation of MCBD. The press represented the most complete set of “observations” on everyday life in modern China. In contrast to dictionaries and directories, newspapers and periodicals provided information on a daily basis, allowing for a more finely grained level of observation. They constitute an irreplaceable material for the study of the elites in the making. Moreover, they documented a wider range of actors - not only the personal and social life of individuals, but also institutions, locations, and events. They reported on individuals’ social or political activities, associations and companies meetings, ceremonies, sports competitions, the coming and going of ships and passengers, etc. They documented with high precision the place, time, and nature of actions at the most elementary level. They caught actors in their tracks, from the moment they appeared “on print” to the time when they vanished or the medium itself disappeared.

Newspaper data in MCBD comes from two main corpora. The Shenbao 申報 (1872-1949) forms the backbone of the Chinese-language press. Established in Shanghai, it was one of the largest and most enduring Chinese-language newspaper in modern China, with a daily circulation of 150,000 copies in the early 1930s. Although based in Shanghai, its readership was truly national in scope. The last issue appeared on May 27, 1949. For the English-language press, MCBD relies on ProQuest’s Chinese Newspapers Collection (CNC) This collection consists of twelve English-language periodicals published in China from 1832 to 1953, with variations in their temporal and spatial coverage.7 While most of them were published in Shanghai, the collection also includes one newspaper from Guangzhou - Canton Times - and three from Beijing - Peking Daily News, Peking Gazette, Peking Leader. The three most important were the British North-China Herald and Daily News (1850-1941), the American China Weekly Review (1917-1953) - formerly Millard’s Review of the Far East - and the Sino-American China Press (Dalubao 大陸報) (1925-1938). Although they were based in Shanghai, they reached a national readership as well. The North-China Herald enjoyed a circulation of 10,000 in the early 1930s, with an increasing number of Chinese readers.8 The corpus of English-language periodicals also comprises the South-China Morning Post published in Hong-Kong (1903-2001).9

1.2.4 Wikipedia

Another major contribution of MCBD is the incorporation of data from the web, especially the Chinese Wikipedia. The web represents a precious source for reconstructing the full trajectory of individuals, especially those who survived after 1949, who did not appear in the Republican Who’s who and in more recent biographical dictionaries. Some 200,000 biographies in Chinese, English and Japanese have been identified. The biographical information they contain have been downloaded, along with their metadata (date, etc.), which enables to trace the history of each entry and their modifications over time. The data extracted from Wikipedia provides a unique resource by the massive character of its population, its multilingual perspective and its unprecedented temporal and geographical scope.10

1.2.5 Miscellaneous printed sources

Memoirs and personal diaries: Memoirs and personal diaries offer a more subjective approach to biographical events, as they were experienced or remembered by the individuals themselves or their relatives. Moreover, these sources often contain details that editors of dictionaries and Who’s who may have missed or discarded. An increasing number of diaries has been published, digitized and ocerized since the 1980s, opening new perspectives on the personal life of the Republican elites and their social environment. The selection of diaries has been guided by ERC members’ research concerns so far (e.g. Zhang Gang’s diary used by Guo Weiting, Zhou Fohai’s diary studied by David Serfass, or American returned students’ autobiographies studied by Cécile Armand).11 Others will be added to the collection as historian colleagues join the project with they own materials.

Professional journals: The journals published by professional, academic or students’ associations (e.g. Science Society of China, Society of Chinese Engineers), learned societies (e.g. Royal Asiatic Society) technical bureaus (e.g. Bureau of Economic Information/Foreign Trade) provide a wealth of information on their members, staff and activities (list of members and contributors, contributions, reports of meetings, related institutions). In addition, specialized journals like the Chinese Economic Journal/Bulletin (Zhongguo jingji yuekan 中國經濟月刊) contains lists of companies and related corporate data.12

Other printed sources: This category includes miscellaneous directories, such as the lists of doctoral dissertations by returned students, and other materials often discovered by accident.13

1.2.6 Archival data

In addition, the database welcomes data from non-digitized, unpublished sources (especially archives) that have been processed manually by individual researchers. Archival materials fall under two main categories: institutional archives and personal papers.

Institutional archives include primarily the National Archives of the Republic of China (Historical Archives No.2 in Nanjing, Guoshi guan in Taipei) which among other materials have preserved lists of government officials and staff of technical bureaus. The archives of clubs and associations (e.g. Rotary International) provide additional information on the social life of individuals. Their collections include rosters of members, official publications, correspondence and unpublished reports. In addition, the lists of industrial companies found at the Shanghai Municipal Archives provide an excellent starting point for the study of modern entrepreneurs in China. Although they are rare, scattered and generally uneasy to localize, personal papers constitute irreplaceable sources as they contained first-hand materials such as individuals’ correspondence and unpublished diaries. As for the leading personalities of the Republican period, specifically, their personal papers have been preserved in archival centers and libraries such as the Hoover Institution and Columbia University, along and transcription of interviews.14

1.3 Structure

MCBD is composed of eleven tables with the Person table as the connecting hub for all tables, except for “Company statistics” that connects only to the “Company” table. The structure and content of each table is described in the text below.

Fig. 1. MCBD structure schematic

