1 Introduction


The Liumei Biographical Database (LMBD) is the first biographical database fully devoted to the American-returned students (liumei xuesheng 留美學生) in modern China. It is the cornerstone of a large research project that aims to reevaluate the contribution of American-educated Chinese to the transformation of modern China and to the building of Sino-American relationships since the late Qing dynasty (mid-19th century). Drawing on a wide range of multilingual sources, LMBD relies on advanced techniques of Natural Language Processing (NLP) to automatically extract the relevant historical information from a growing body of digitized sources. Ultimately, the database will enable to reconstruct the students’ biographical trajectories and their social networks from a transnational and multinational perspective. This data analytic approach seeks to complement previous scholarship focused on famous individuals and specific groups through multidimensional analyses that were heretofore not possible.

Acknowledgements. This research has been funded by the Chiang Ching-kuo Foundation for International Scholarly Exchange (Project n°RG004-U-21) and has received the support of the ERC-funded project “Elites, Networks, and Power in modern China” (ENP-China). We are particularly grateful to the ENP-China team (Christian Henriot, Nora Van Den Bosch, Jeremy Auguste, Baptiste Blouin) for their help in building the database and to the Lee-Campbell research group at the Hong Kong University of Science of Technology (HKSUT) for sharing their datasets. We are also very grateful for the support of the Heurist team, especially to Ian Johnson for his unrelenting assistance in the development phase of LMBD.


1.1 Aim and Scope

1.1.1 Context

Between the Opium wars (1839-) and the early years of the People’s Republic of China (PRC), over 40,000 young Chinese went to study in the United States and returned to their country to apply the knowledge they had acquired abroad. The liumei xuesheng 留美學生 (or liumei 留美) represent a unique wave of brain migration whose impact on Chinese and world history still await to be fully examined. Since the relevant sources are scattered across the world, previous scholars have focused on famous individuals or specific groups (cohorts, academic or professional specialization, university-based).

The Liumei Biographical Database (LMBD) is the key to introduce a change of scale in the quantity and quality of information mobilized for the study of this foundational brain migration. It is designed to break through the existing constraints posed by the geographical dispersal of historical sources, their multilingual and heterogeneous nature, and the various levels of expertise they require. It will help establish the entire population of liumei as comprehensively as possible and enable multidimensional analyses on an unprecedented scale. This is crucial to efforts to overcome the group-based approach dominant in past and current scholarship and assess with precision the liumei’s broader influence. It is particularly important for an elite population such as the liumei, who may be important agents of change but account for only a small fraction of the general population (Campbell and Lee, 2020).

1.1.2 Scope

Any Chinese-born person or person of Chinese ancestry who enrolled in an institution of higher education in the United States between the mid-nineteenth and the mid-twentieth centuries will be eligible for inclusion in the LMBD. While focused on the American-educated, LMBD also includes related (non-liumei) actors. Although it is centered on persons, the biographical database also incorporates the institutions, places, and events in which they were involved during their lives. It records every detail related to each individual’s birth and death (date, place), ethnicity, family, education, professional career, political affiliation, membership in clubs or associations, and creative works.

1.1.3 Technical choices

Technically, LMBD relies on Heurist, a web database interface that provides both a backend and a frontend for relational research data in the humanities. Heurist is a relational database based on mySQL language. This structure has proved particularly tailored to the kind of complex data which we are dealing with, such as fuzzy dates, incomplete information, multiple languages, or Chinese multiple names and transliterations. Furthermore, each piece of information can be linked to its original source(s). Because not everything can be turned into data, the database also includes a specific field— “observations”—to preserve the language of sources in their context and to handle potentially conflicting sources of information.

Heurist provides a robust yet flexible infrastructure backed by a responsive team. It provides a user-friendly interface to query the database and useful built-in functions to visualize and export the data. Heurist has been successfully used by several similar projects, including the Modern China Biograhical Database (MCBD)) developed by the ENP-China project, and the Chinese Engineers Relational Database (CERD) developed by Thorben Pelzer at the University of Leipzig.

While LMDB will maintain its autonomy and specificity, it will also be linked to companion databases (MCBD, CERD) through a system of common identifiers. LMBD is hosted by the French Very Large Infrastructure Huma-Num to ensure its long-term maintenance and accessibility in compliance with the FAIR principles.

LMBD is in its development stage. We plan to enrich it in the coming years until the project completes its course. Thereafter, the database will be entrusted to the care of a major academic institution for maintenance and development.

1.1.4 Intended audience

LMBD aims to serve a broad community of scholars interested in modern China, Sino-American relations, and transnational circulations.

We also hope to incorporate external datasets or data produced by research groups or scholars, each with a clear and manifest identification of provenance. We believe in the benefits and virtues of building a collective database that includes the broadest spectrum of biographical data for modern Chinese history. No small dataset — it can be a simple spreadsheet with data on a few individuals — is negligible. We believe that “Small streams make a big river” and we genuinely welcome contributions from the scholarly community. We can provide guidelines for a smooth and seamless preparation of datasets.

1.2 Sources & Methodology

1.2.1 Sources

LMBD draws from a wide range of sources in both Chinese and English. The relevant sources can be grouped into six main categories:

  1. Students’ directories, who’s who publications and biographical dictionaries.
  2. Students’ journals and general periodicals.
  3. Digital-born materials (e.g., Wikipedia, Baidu, Geni, university websites)
  4. Datasets shared by partner projects.
  5. Archival materials (personal papers, archives of universities, students’ associations, clubs, and think tanks such as the China Institute of America).
  6. Second-hand literature.

The detailed list of sources can be found in the References section.


1. Directories and Who’s who publications record individuals’ birth dates and places, ethnicity, family background, education, professional career, and membership in associations. The Principal Investigator (PI) has assembled an extensive collection that include the reference Who’s Who of American Returned Students (Youmei tongxue lu 游美同學錄) published by Tsing Hua College in 1917, the series of directories published by the two most important organizations of Chinese students in the United States, the Chinese Students’ Alliance (CSA) and the Chinese Students’ Christian Association (CSCA). In addition, LMBD takes advantage of the rich collection of who’s who publications digitized and generously shared by the Institute of Modern History at the Academia Sinica (Taipei).

2. Students’ journals and periodicals record the students’ lives in the United States, their business and professional activities upon their return, meetings of clubs and associations, contemporary debates on modernization and other crucial issues, and the broader political context in China and the world. I have collected four major students’ journals, including the bilingual (Chinese-English) World Chinese Students’ Journal (1906–1913) published by the World Chinese Students’ Federation (Huanqiu Zhongguo xuesheng hui 寰球中國學生會), the first and largest organization of returned students in China; the Chinese Students’ Annual/Quarterly (Liumei xuesheng jibao 留美學生季報) in Chinese (1914–1928) in Chinese, and the Chinese Students’ Monthly (1906–1931) in English, both published by the CSA and the Chinese Students’ Christian Journal (1910–1919) published by the CSCA, and the semi-monthly publication China and America (Huamei jixie 華美協進), published by the China Institute of America (1948-1949). In addition, we can take advantage of the vast press corpora created by the ENP-China project, which include the leading Chinese newspaper Shenbao 申報 (1872–1949), the monthly magazine Dongfang zazhi 東方雜誌 (1904–1948), and the ProQuest Chinese Historical Newspapers Collection, which comprises a dozen English-language periodicals published in Shanghai, Beijing, and Hong Kong from the mid-nineteenth century to the twenty-first.

3. Knowledge bases such as Wikipedia and Baidu are crucial to identifying individuals who did not appear in contemporary who’s who publications, and to track their trajectories after 1949. LMBD primarily relies on a collection of 300,000 biographies retrieved from Wikipedia in English and Chinese (Blouin et al., 2021). Other knowledge bases (e.g., Baidu, Geni, Family Search) will also be incorporated in the course of the project.

4. Datasets created by partner projects, particularly the Lee-Campbell Research Group at The Hong Kong University of Science and Technology (HKUST), include: (i) the China University Student Datasets (CUSD), based on university archives and other archival and published materials, record over 400,000 students, including more than 10,000 in the United States; (ii) the China Government Employee Datasets (CGED), based on various ministries’ archives, document more than 370,000 civil and military officials between 1900 and 1949; (iii) the China Professional Occupation Datasets (CPOD), compiled from professional who’s who directories, ministry and university archives, include some 50,000 professionals (accountants, doctors, lawyers, university faculty) since 1905 (Campbell et al., 2020); (iv) the list of 2,782 doctoral dissertations completed by Chinese students in 113 American universities between 1905 and 1960, compiled by Yuan Tong-li in 1961.

5. Archives, personal papers and diaries introduce a more subjective and qualitative perspective on the students’ experiences. In addition to university archives, we rely more specifically on the personal papers of important liumei and American diplomats held at Columbia University and the Hoover Institution at Stanford University. Archival materials are compiled manually on the basis of the case studies pursued in the project. Any external contribution is welcome as well.

6. Second-hand literature. We also take advantage of any biographical information available in previous scholarship which can be converted into tabular data, such as the list of members of the Chinese Educational Mission (1872-1811) compiled by La Fargue (1941) and Rhoads (2011).

1.2.2 Methodology

Our methodology for processing the data follows three steps, adding sequentially three layers of information:

  1. Phase 1 (July 2022-January 2023): We start from lists of names and related attributes drawn from structured sources (directories) and datasets.
  2. Phase 2 (January-September 2023): We complete this basic layer with more complex information retrieved from less structured sources (who’s who, biographical dictionaries) and external knowledge bases (e.g., Wikipedia, Baidu, Geni).
  3. Phase 3 (September 2023-July 2024): Finally, we enrich the two previous layers with information drawn from unstructured sources (students’ journals, general newspapers). This ultimate step aims at providing a finer level of granularity and at retrieving information on less known students, who may have been active at their time but have left few traces in contemporary who’s who and posthumous sources.


We found it more efficient to proceed on a source-by-source basis to reduce the risk of inconsistencies and duplications. We rely as much as possible on natural language processing (NLP) to automatically extract the relevant information from full-text digitized sources. More specifically, we rely on four families of tools:

  1. Named Entity Recognition, Classification and Linking (NERCL) is commonly used for identifying the names of persons, organizations, locations, dates, creative works, and more complex entities such as titles and positions.
  2. Question & Answering (Q&A) is helpful to handle unstructured sources, particularly for retrieving sequential information from biographical sources (educational curricula, professional careers).
  3. Event detection is employed more specifically for retrieving information with a high degree of granularity and complexity (who did what, when, where, with whom) from newspapers and other unstructured sources (Blouin, 2022).
  4. More conventional rule-based methods are useful to extract information based on strictly defined ontologies and repetitive patterns (Key Word In Context, pattern matching, regular expressions).


Every data point created through automatic methods is manually checked and validated before being input into LMBD. Sources that are not computer-readable (archival materials, low-quality reproduction of printed sources) are curated manually. Each time we import new data into the database, we make sure to identify and disambiguate duplicate records thanks to Heurist built-in functions.

1.2.3 State of the Art

As of January 2023, the following datasets have been processed and are gradually being input into the database:

In the coming months, we will focus on students’ directories and who’s who publications. Periodicals will be handled later in phase 3 (late 2023-early 2024).

1.3 Structure


LMBD is composed of twelve tables with the Person table as the connecting hub for all tables, except for “Company statistics” that connects only to the “Company” table. The structure and content of each table is described in the next sections.

Fig. 1. LMBD structure schematic