Introduction
Inexpensive computing and the mass digitization of records have enabled the emergence of innovative “big data” approaches to history and the social sciences (Bloothoft et al. Reference Bloothooft, Christen, Mandemakers and Schraagen2015; Fourie Reference Fourie2016; Gutmann et al. Reference Gutmann, Merchant and Roberts2018; Maxwell-Stewart Reference Maxwell-Stewart2016). A second wave of studies reduces reliance on a single source by linking together multiple series with “machine learning” techniques adapted from computing and information science (Feigenbaum Reference Feigenbaum2018; Ruggles et al. Reference Ruggles, Fitch and Roberts2018). The linking of independently generated data using sophisticated algorithms constitutes an important new source for the study of social mobility, intergenerational and early life influences on adult health, and many other topics. Methodologies used for the systematic linking of records have been carefully developed over the past 50 years (Christen Reference Christen2012; Feigenbaum Reference Feigenbaum2016; Fellegi and Sunter Reference Fellegi and Sunter1969; Ferrie Reference Ferrie1996; Winkler Reference Winkler2006).
Census data afford a useful example. Historical census data, while not perfect, offer the most comprehensive and unbiased representation of many past populations (Hacker Reference Hacker2013; Thorvaldsen Reference Thorvaldsen2017). A well-designed census enumeration characterizes many aspects of a population in a representative manner. The most common method for identifying the same person in successive censuses is to identify one and only one record in both years with the same name, sex, birth year, birthplace, and a consistent marital status (Goeken et al. Reference Goeken, Huynh, Lenius and Vick2011). Records are connected using these time-invariant characteristics to minimize the risk of biasing the linked sample. If we were to rely on a characteristic that might change, for example occupation, the linked data would underrepresent those who changed occupation. Limiting the match criteria to time-invariant characteristics avoids this problem (Ruggles Reference Ruggles2006).
Three potential problems are associated with any implementation of this approach. There is a risk of “false positive” matching, or incorrectly connecting records of different people with near-identical characteristics. A second potential weakness is that the number of unique and exact matches can be small. A low rate of unique matching does not handicap studies of the entire population of a large country such as the United States, but research focusing on smaller countries or on particular subgroups may be constrained by the small number of linked records. A third potential weakness is unrepresentativeness. Are the linked records a balanced representation of the population from which they are drawn?
The three problems are interconnected, insofar as a solution for one problem may aggravate another. For example, relaxing the criteria for identification of the same name, sex, and birthplace will expand the number of links but it also increases the incidence of false positive links and of records linked multiply (rather than uniquely). Alternately, we might expand the number of linked records if we match using additional characteristics that change over time. Here, the risk is that our linked sample will overrepresent people whose characteristics, while mutable, do not in fact change. In this article we focus on the second trade-off. Can the benefit of a larger sample exceed the cost of lost representativeness? Our answer is a “qualified yes.”
Our Approach
We explore the selection bias generated by systematically linking “full count” census records of the entire Canadian population collected at 10-year intervals: 3.5 million records in 1871 rising to 5.4 million in 1901. First, we identify the same person in multiple enumerations using a small set of time-invariant individual characteristics (birth year and place, sex, name) to minimize bias from the linking process (Antonie et al. Reference Antonie, Inwood, Lizotte and Andrew Ross2014, Reference Antonie, Inwood, Andrew Ross, Bloothooft, Christen, Mandemakers and Schraagen2015). Our method, which uses an initial set of known or true links and the classification of all possible matches with support vector machine software, is broadly representative of a commonly used approach to historical record linkage (Christen Reference Christen2012).
With this approach we can identify a unique match in 1881 for less than one-fifth of the 1871 Canadian population. More than one-half of the 1871 records are linked multiply. By this, we mean the 1871 record is matched to more than one 1881 record or it is one of several 1871 records matched to a single 1881 record. Thus, more than half of the 1871 records cannot be used for longitudinal analysis because of an inability to identify which of the multiple matches is the right one. A possible next step would be to use additional information to select the correct match from among the set of multiple or potential links.
We take this step of introducing broader criteria that will recover additional unique links among records that are matched multiple times. The example of Mary Barns in table 1 illustrates our inability to determine the correct link whenever more than one person (in either census year) shares a common age, birthplace, name, and sex. The only way to discriminate among multiple potential links is to rely on additional information. We describe this process with a term borrowed from computing science: disambiguation (ibid.).
We follow other researchers in the choice of family coresidence as a criterion for disambiguation (Fu et al. Reference Fu, Christen and Boot2011, Reference Fu, Boot, Christen and Zhou2014). For example, there might be many records for a “Mary Barns” with the same age and birthplace, but only one of them lived in the same household as a sister named Anastasia. While Mary and Anastasia remained in the same family over the decade, we can disambiguate among the multiple matches. A generalization of this method, described in detail elsewhere (Richards Reference Richards2013), relies on the Jacquard similarity measure. It roughly doubles the size of linked sample with no change in the risk of false positive or mistaken matching (table 2).Footnote 1
The disadvantage of disambiguation with family coresidence is that it creates a bias to families or portions of families that remain together. We can follow Mary and Anastasia from 1871 to 1881 if and only if they are the kind of sisters who continued to live together over 10 years. It will not be possible to use this method for families in which members do not remain together. Consequently, linked data that have been disambiguated in this way will overrepresent people in families that maintain stable patterns of coresidence. Our goal in this article is to assess the nature and extent of this deviation from representativeness.
The problem of representativeness plagues all longitudinal data because not everybody survives over time, and death tends to be selective. Modern longitudinal samples obtained from repeat surveys of an initially representative population typically lose representativeness through selective attrition originating with the death or disappearance of some subjects. The linking of records from historical or administrative sources shares this problem, and in addition has other complications that are peculiar to the historical source. As is well known, even if we restrict the historical link criteria to a small set of time-invariant characteristics, we still expect to be more successful in linking historical people with uncommon names and the kind of people who report more precisely to the census enumerator (Antonie et al. Reference Antonie, Inwood, Andrew Ross, Bloothooft, Christen, Mandemakers and Schraagen2015; Bailey et al. Reference Bailey, Cole, Henderson and Massey2020b; Ferrie Reference Ferrie1996).
Representativeness and Disambiguation with Family Data
Our question in this article is the extent to which representativeness diminishes through disambiguation. During the late nineteenth century the Canadian government enumerated its population at 10-year intervals, at roughly the same time each year (April). Thus, we are able to consider changes in the population between 1871 and 1881, 1881 and 1891, and 1891 to 1901.Footnote 2 In each case we compare the population at the beginning of the decade to the subset of people linked with time-invariant individual characteristics and to the set of links expanded through disambiguation.
As expected, disambiguation increases the number of unique links; the link rate changes roughly from 15 percent of records to nearly 30 percent (table 2). Any increase in sample size is welcome for a population as small as Canada, but the potential cost in terms of lost representativeness remains to be assessed.
Both methods underrepresent women relative to men in each of the three decadal intervals (table 3). On this point there is no difference between the basic and extended methods. Both methods overrepresent married people. Unexpectedly, this bias is smaller for the disambiguated data. Both methods also underrepresent Catholics, and again the disambiguated data are slightly closer to the full population, in each of the three decades.
These examples suggest that disambiguation with family coresidence is less damaging for representativeness than we might have expected. Indeed, the tendency to overrepresent married people is reduced though disambiguation. This effect is even more clear if men and women are examined separately (not shown). At first glance, then, the consequences for representativeness of increasing sample size through disambiguation with family coresidence are remarkably benign. Admittedly, we examine only visible characteristics. Even if disambiguation with family coresidence information does not change the composition of the population in terms of visible characteristics (gender, marital status, religion, and so on), the linked sample still might be different from the population at large in other characteristics that interact with time-varying information used in the linkage. For instance, those who lived with other family members could be less mobile, more risk averse, and less healthy. We have no ability to assess the impact of either linking strategy on characteristics not recorded by the census.
To investigate more closely we turn to a subset of the same data for which additional information is available. More complete information for each person is available in a randomly selected 5 percent of the 1871 and 1891 records. Here we examine only those links that fall within the rich 5 percent samples. Subject to this limitation we are able to consider a broader range of characteristics associated with the propensity to find a unique link. We begin with average link rates by select characteristics, and then report a multivariate logistic regression that identifies the contribution of several characteristics simultaneously to the odds of establishing a link.
In figure 1 substantial variation in the link rate by age is apparent for 1871–81. The pattern for 1881–91 is similar. The most conspicuous effect with the original linked data is an underrepresentation of children and young adults. The adolescent and young adult propensity to reinvent themselves as they leave home reduces the ability of both methods to identify them in the following census. Name changing at marriage for women is a particularly important influence although following young men, who do not change names at marriage, from one census to another is also a challenge. The lower rates of linking for people at ages 30, 40, 45, and 50 years reflect age heaping, which aggravates the problem of multiple links.
The effect of disambiguation varies considerably by age. Unsurprisingly, it is most efficacious at stages of the life course in which pairs or groups of people are more likely to remain together over the decade. Disambiguation has its biggest impact on link rates for young children, many of whom are still with their parents 10 years later. After disambiguation, the youngest children have the highest rate of linking in the population. The rate for adolescents improves but to a lesser extent, as expected. Disambiguation does not appear to reduce the effect of age heaping on the linking of people at select ages.
Additional detail in table 4 shows that with time invariant criteria we link 19 percent in New Brunswick during the 1870s against only 13 percent of the Quebec records. In the following decade, link rates range from 10 percent in western Canada to 25 percent in the small easternmost provinces. In addition to variation by gender, marital status, and religion, link rates differ by ethnicity and literacy. French-Canadians and those who did not read and write are harder to link. Again, we see that linking with time-invariant individual characteristics does not select from the population in a perfectly representative manner. Some of the biases are substantial.
Disambiguation broadly reproduces these biases for gender, ethnicity, religion, and birthplace. Bias by age increases with disambiguation especially for women. In contrast, a priori considerations would not predict the apparent changes in representativeness for province, marital status, and literacy. Overall, there is no obvious generalization about the impact of disambiguation on selection bias. Both strategies for linking, in both periods, deviate from representativeness in complicated ways, and the marginal impact of disambiguation is complex.
Some of these effects may be interconnected. It is worth examining if the patterns of bias survive a multivariate analysis that considers multiple effects simultaneously. We report in table 5 and table 6 the association of different characteristics with the odds of finding a unique link. We estimate multinomial logit regressions with membership in the “individual only” and “disambiguated” samples against the baseline of the full population. We estimate separately by decade and by region (because of the differences identified in the preceding text). A coefficient of 1.0 indicates no effect on the odds of being linked. A coefficient smaller/larger than 1.0 indicates a characteristic that reduces/increases the probability of a record being linked.
* Odds ratio moves .2 or more away from 1.0: increase in bias.
** Odds ratio moves .2 or more towards 1.0: decrease in bias.
* Odds ratio moves .2 or more away from 1.0: increase in bias.
** Odds ratio moves .2 or more towards 1.0: decrease in bias.
The results provide additional detail about the pattern of demographic selections noted previously. Using time-invariant individual information being male and Canadian born tends to increase the likelihood of being linked. The young, singles, French, and illiterate are less likely to be linked. The elderly are also unlikely to be linked presumably because many do not survive into the next census enumeration. Having a high-status occupation reduces slightly the odds of being linked although the effect generally is not significant. There is some variation in these effects by province, especially during the second decade.
Again, by comparing the first and second columns, we are able to consider whether disambiguation exacerbates or reduces the pattern of selection biases evident in the original linking with time-invariant individual characteristics. No simple test statistic permits a straightforward test of the hypothesis that disambiguation increases or diminishes bias. Accordingly, in table 5 and table 6 we identify with a single asterisk the coefficients that move 0.2 or more away from 1.0—an increase in bias due to disambiguation. Identification of a 0.2 threshold difference, or a roughly 20 percent change in the odds ratio, does not derive from formal statistical reasoning; rather it is a heuristic measure of differences that seem large enough to matter (in the spirit of Ziliak and McCloskey Reference Ziliak and McCloskey2004). Coefficients that converge toward 1.0 by a similar magnitude signifying that bias diminished as a result of disambiguation are reported with two asterisks.
By this metric, a majority of the coefficients do not change as a result of disambiguation. Only 9 of the 21 rows in table 5 and 10 of the 40 rows in table 6 see a change in the estimated coefficient larger than 0.2. Of those that do change, disambiguation makes it even more likely to link the Canadian born, that is, there is an increase in the overrepresentation of those born locally. In contrast, the underrepresentation of singles is reversed for the 1880s (not for the 1870s). Underrepresentation of those reporting French ethnicity diminishes in Quebec (where they are a majority of the population) and increases in Ontario (where the French are a minority). Overall, some biases are magnified while others are diminished as a result of disambiguation. The ability to link younger children and the middle aged benefits the most.
In summary, disambiguation does not change the extent of bias for a majority of the comparisons. Where we do see the changes, the effects are rather diverse, and the patterns differ by province. There is no obvious basis for a generalization along the lines of increasing or diminishing the problem of selection bias in linked data as a result of disambiguation.
Observations
Several observations emerge from this brief review of representativeness after linking Canadian census records in a conventional way and then disambiguating with coresident family members. Variations on the particular technique used for linking would produce slightly different results, but the patterns are unlikely to differ qualitatively.
1 Even the most parsimonious linking may inadvertently generate a selection bias that would prejudice the testing of some hypotheses. This is because people who can be followed from one census to another are somewhat atypical even if the census is a perfect representation of the population and the criteria for linking are unbiased.
2 The patterns of bias are complicated and not easily predicted. A number of factors can be seen to make it easier or harder to establish unique links from one census to another. Some of this selectivity originates with characteristics and imperfections in the census. Such influences would include, at a minimum, the size of population sharing a characteristic (e.g., birthplace), age at leaving home, and any nonrandom imprecision with which information is reported.
3 Disambiguation of multiple links increases sample size markedly. While the additional observations are useful, there is a cost in terms of added bias. Fortunately, the marginal increase in selectivity is less severe than anticipated. In some important respects, selectivity is diminished. Disambiguation also helps to reduce the rate of false positive errors (not reported here) without a marked aggravation of selection bias, especially for adults.
4 Reweighting is a useful strategy to mitigate the effect of nonrepresentative linking (Bailey et al. Reference Bailey, Cole and Massey2020a). Disambiguation is helpful in this regard because it expands the number of observations in each cell and thereby enables more precise parameter estimates.
An example illustrates the final point. As mentioned already, we have linked all records in the 1871 Canadian enumeration to all records in 1881, all 1881 records to 1891, and all 1891 to 1901. The use of time invariant characteristics yields more than half a million linked records in each interval, and more than a million records after disambiguation (table 2). Of course, they are not the same people in each decade. The set of people who can be found in each of the four enumerations allowing them to be followed over a full life course is much smaller.
We use these fully linked records in a separate paper to study social mobility, by comparing the occupations of fathers of boys who in 1871 had not yet entered the labor market against the sons’ occupations as adults in 1901 (Antonie et al. Reference Antonie, Inwood, Minns and Summerfield2020). We ignore women because their occupational reporting was inconsistent in this period. In 1901 the sons were roughly the same age, on average, as their fathers had been in 1871. Thus, we compare father and son at similar points in the life cycle. These restrictions are desirable for a study of social mobility, but the sample is reduced to 12,315 records using links established with time invariant characteristics and 22,357 records if we also rely on disambiguation.
Both sets of records are small. The descriptive review confirms that neither set of records is fully representative and that both require reweighting in any analysis. Happily, our focus on a precisely defined demographic group removes some sources of nonrepresentativeness. Given the biases identified in the preceding text, we reweight by province, French ethnicity, and whether or not the individual has left his province or country of birth by 1901.Footnote 3 The question we now ask is if the size of cells suffices for a credible reweighting.
In table 7 we report the distribution of cell sizes for boys who can be followed over 30 years, using the two methods, with cells defined by province (in 1871), French ethnicity, and mover versus stayer. Some cells have a large number of records, for example boys in each province who report the dominant ethnicity of the province and do not move. Other cells have as few as two records. The median cell size is 149 records with time-invariant linking rising to 246 records after disambiguation. The number of cells with fewer than 100 records falls from six to three through disambiguation.
The usefulness of reweighting is limited by the standard errors of the original size of cell. There is no obvious generalization about the size needed to obtain parameter estimates sufficiently precise for hypothesis testing. An appropriate threshold size will depend on the variability of the underlying data and the research question being examined. The burden of small sample size obviously weighs more heavily for research that might focus on the smaller provinces of New Brunswick and Nova Scotia. Even for Ontario, however, if we wish to compare the mobility of French-origin men with some other group, we will be driven to reweight cells with a relatively small number of observations. The larger, disambiguated sample diminishes although it does not eliminate this problem.
In our example, even though we begin the analysis using the entire Canadian population and we limit the reweighting to a small number of categories, the number of linked records is small enough on either linking strategy to reduce the precision of parameter estimates for a number of cells. Many other applications will have fewer records and/or more complicated reweighting than our example. For all of them, as for us, the problem of undersized cells is more severe if we are limited to records linked with time-invariant criteria. The obvious conclusion is that, if we are going to have to reweight anyway to mitigate the effects of selecting linking, then a larger sample achieved through disambiguation is preferable.
Concluding Comments
In this article we have examined Canadian census records. Would the same conclusions emerge from a similar treatment of US and British census records? Of interest would be any differences in enumeration practice that increased or diminished the incidence of people reporting the same name, age, and birthplace. In fact, census practice in the three countries was broadly similar (Dillon Reference Dillon1997, Reference Dillon2000). Broadly similar demographic detail was requested in each country. The enumeration principles and operating procedures of the Canadian census were heavily influenced by British and American practice.
To the extent that national censuses differed in ways that might affect linkage outcomes, the differences are likely to be small. Detail in the British census may have been more precise because the enumeration was de facto rather than de jure.Footnote 4 The de jure principle of enumeration used in North America permitted the recording of information for people absent at the time of enumeration and possibly having departed several months earlier. Indeed, information about some individuals was reported by people who were not family members and not closely familiar with them. Thus, the North American enumerations may harbor a greater incidence of imprecision and error, which in turn could lead to an increased rate of multiple records.
The granularity with which information was reported also influences the incidence of duplication using name, age, and birthplace. In Britain, despite some variation in the reporting of birthplace, there was a strong tendency to identify local communities and individual parishes, which for the most part had smaller populations than the states and provinces used for reporting in the North American censuses. The North American enumerations also include a higher proportion of foreign born, for whom birthplace typically was reported as an entire country. For all these reasons, the North American census is likely to have been less precise, and strategies for disambiguation more important, than for British data.
The small Canadian population makes disambiguation more important than it is for national-level research in a large country such as the United States. A wide range of Canadian analyses will be possible with disambiguated but not with standard linked data, simply because of the constraint of sample size. Even research about the United States, if it targets states, regions, or subpopulations defined in some other way, would find it useful to increase the size of linked samples through some kind of disambiguation.
A strategy for reweighting ameliorates the concern for bias on characteristics visible in the census, but it does not help with any bias in characteristics not recorded in the census. For example, disambiguation inevitably is less helpful for people at an age for which new households are being formed. Disambiguation raises sample size by a smaller proportion for this group. More importantly, those who can be disambiguated with continuous coresidence will differ from other young people in ways that are invisible to the researcher, to the extent that early marriage is selective on characteristics not reported in the census.
Disambiguation may be less useful for research questions focusing on young people than is it for analysis targeting incredibly young children or mature adults, whose sample mass can be expanded to a greater extent and with limited additional selection bias. More generally, we cannot expect to investigate family composition after disambiguating with family coresidence, just as social mobility cannot be examined if occupation is used to link or to disambiguate the data, and spatial mobility cannot be examined credibly if remaining in one place is a criterion for establishing a match. Recognition of these biases will limit some kinds of longitudinal research, although it may also suggest new research possibilities.
Clearly, there is no preferred method of constructing longitudinal data for all research questions. Rather individual investigators will benefit from a choice of link strategy that relies on criteria most appropriate for their own research projects. If care is taken to match link criteria with the hypotheses being examined, disambiguation will deliver larger samples and enable research that otherwise would not be possible. Fortunately, the advance of computational power makes custom linking by individual researchers a realistic possibility.
Acknowledgments
We gratefully acknowledge financial support of this research from the Canadian Foundation for Innovation, the Natural Sciences and Engineering Research Council of Canada and the Social Sciences and Humanities Research Council of Canada. The paper has benefited considerably from comments by participants at the 2017 European Historical Economics Society conference; specialized workshops at Northwestern University (2019), the University of Guelph (2018), and Cambridge University (2016); and the anonymous referees of this journal.