Linking Methods

Our linking methods aim to make as few false matches as possible (minimize type I errors) while creating as many of the true matches as possible (minimize type II errors).

Figure 1 shows a comparison of different linking methods according to their type I and type II errors.

Chart comparing type I and type II linking errors resulting from different linking methods.

This figure refers to a series of linking methods currently used or soon to be added to this project (e.g., the ABE method, machine learning method, and EM method). Details of these linking methods can be found in the paper “Automated Linking of Historical Data” by Ran Abramitzky, Leah Platt Boustan, Katherine Eriksson, James J. Feigenbaum, and Santiago Pérez.

Each method involves a tradeoff between the number of matches made and the accuracy of the matches (TPR vs PPV). Methods with a lower PPV create more mis-matches. Mis-matches arise due to challenges such as transcription and enumeration errors, mortality1, under-enumeration, common names and international migration between census years. The figure also documents that mis-matches occur in linked datasets created by human linkers. Because the weight placed on sample size versus accuracy may differ based on the research question, we urge users to familiarize themselves with the methods and select the linking algorithm that best fits with their research design.

A set of codes and documentation that can be used to implement each of these methods can be downloaded from our data page or found at this website. These codes are also available as a GitHub repository.


1 Missed links due to mortality will be a larger problem when linking between two Censuses conducted a number of years apart (i.e. matching from the 1850 Census to 1940 Census) and so links created over a long time span should be used with caution.