1 Introduction
Within American politics, racial identification is a key variable across a wide range of sub-fields, though individual measures of racial identification are not always readily available. To address the lack of data on racial identification, past research first turned to ecological inference (see king Reference King1997; Robinson Reference Robinson1950) before marked improvements in the form of “Bayesian improved surname geocoding” (BISG) to use individual names and locations to impute the race of a single individual (Elliott et al. Reference Elliott, Fremont, Morrison, Pantoja and Lurie2008).
In their 2016 Political Analysis article, Imai and Khanna (Reference Imai and Khanna2016) extended the utility of BISG by increasing its ease of use to a wider field of scholarship via their R package wru while simultaneously reducing imputation error and bias in estimating an individual’s racial identification. Their framework has been applied to a variety of research questions where individual-level racial data are difficult to obtain (Alvarez, Katz, and Kim Reference Alvarez, Katz and Kim2020; Edwards, Esposito, and Lee Reference Edwards, Esposito and Lee2018; Signorella Reference Signorella2020). Table 1 highlights that variety and further demonstrates the range of geocoding and nongeocoding methods used to implement BISG.Footnote 1
While imputing race is more accessible than ever, there are two notable limitations that remain unaddressed. First, the trade-offs in accuracy when employing varying levels of geography to estimate racial identification using wru are unclear. Second, geocoding—which involves turning address data into coordinates (Swift, Goldberg, and Wilson Reference Swift, Goldberg and Wilson2008) can be costly in both time and resources and results in substantial missing data (Amos and McDonald Reference Amos and McDonald2020).Footnote 2 Barriers to estimation arising from geocoding costs are most notable when using the voter file format employed by Imai and Khanna (Reference Imai and Khanna2016) at the census tract level and below.
We address these issues by extending wru in two ways. First, we introduce a new level of geography in the form of ZIP codes as a means to conduct BISG without geocoding. Second, we conduct BISG for every level of geography available in wru—and ZIP codes—to test the accuracy of racial estimates using various geographic levels. We test the accuracy of each level using the Georgia voter file, one of seven states that includes the self-reported race of registered voters, by examining accuracy differences by race for a combination of geocoded (tracts and blocks) and nongeocoded geographic units (surname alone, county, and ZIP code). Overall, we find that the degree of accuracy in racial identification estimation is effectively indistinguishable between census tracts and ZIP codes, with ZIP codes emerging the preferred alternative given its reduction of data missingness and its avoidance of geocoding altogether. However, when estimating Hispanic, Asian, and “other” races,Footnote 3 census block level estimates are preferred.
This letter contributes to existing work utilizing BISG in two ways. First, we clarify the trade-offs of using various levels of geography—like Census blocks, Census tracts, and counties—when estimating racial identification using BISG. Secondly, we extend the accessibility of BISG to researchers through the introduction of ZIP codes in BISG estimation. The evidence presented here demonstrates that ZIP codes, as the smallest unit of publicly known geography, can meet existing levels of accuracy without the added need of geocoding. This allows researchers to simplify the estimation of racial identification by using nongeocoding methods first and more costly geocoding methods only when necessary.
2 Validating Different Levels of Geography
To clarify the trade-offs in accuracy when estimating racial identification using BISG using a variety of geographies, we predict the probability an individual is of a given race using five approaches. Specifically, we conduct a BISG analysis using the nongeocoded methods of surname only, county, and ZIP code and geocoded methods using census tract and census blocks. To test the accuracy of each method, we predict the race of each voter in the Georgia voter file and validate our prediction using the self-reported racial identification data found in the same source. The Georgia voter file contains 7,346,219 records, with 3,123,112 unique addresses (Clark, Curiel, and Steelman Reference Clark, Curiel and Steelman2021).
We introduce ZIP codes to BISG analyses given that they are the smallest unit of publicly known geography and therefore do not need to be geocoded for use. There are approximately 30,000 ZIP codes in the United States. While there is some variance in the population of individual ZIP codes, the overall population distribution is strikingly similar to the population distribution of census tracts (Curiel and Steelman Reference Curiel and Steelman2018). The Georgia voter file includes a ZIP code for each registered voter in the state. Although it is not uncommon for states to also include each voter’s county, counties have a much higher degree of population variance and significantly higher levels of racial segregation when compared to ZIP codes (Nall Reference Nall2018; Nemerever and Rogers Reference Nemerever and Rogers2021).
For the purposes of our analysis, we create an R package called zipWRUext, which takes the framework of wru and supplements the existing structure to work with ZIP code level census and American Community Survey (ACS) data. This extension calculates the joint probability of racial identification given the ZIP code and surname of an individual. Furthermore, this package allows users to specify a given year from 2010 to 2018 with either census or ACS data to improve the imputation of an individual’s racial identification to align with the user’s research context. For the purpose of this analysis, we employ estimates for ZIP codes using 2010 census and 2018 ACS data. Missing data were restricted to the 2,065 registered voters for whom no ZIP code was available and represent 0.03% of the data.
The package wru includes two geocoding alternatives: census tracts and blocks. Regardless of which geographic unit is used, both incur the same standard costs in time and money to geocode. Amos and McDonald (Reference Amos and McDonald2020) demonstrate the most recent advances in proper geocoding for spatial audit purposes and find that even powerful computers can take several hours to complete a full geocoding process and will still result in a missing rate of 1–3 percentage points (6). Swift et al. (Reference Swift, Goldberg and Wilson2008) report that the five best geocoders at the time of writing—including ESRI—feature a missing rate of around 5%.
Although geocoded data can produce more accurate imputations of racial identification than nongeocoded alternatives, users continue to incur costs related to geocoding and must eventually use nongeocoded data to locate individuals that cannot be geocoded. In this letter, we made use of the ESRI 2013 street address and postal address geocoders to locate Georgian addresses which took several hours to complete on 3,123,112 unique addresses.Footnote 4 This method was unable to geocode 4.6% of addresses. In the end, the Georgia voter file placed voters in 3,113 unique census tracts and 156,301 census blocks. The two computationally demanding processes are the geocoding of addresses and the overlaying of those addresses onto census geographic data, followed by the importing of the census demographics at the block level. Overall, the geocoding process took 7.28 hr, as specified in Tables 3 and 4 in the Supplementary Material.
Using these geocoded and nongeocoded data, we use the wru package to impute the racial identification for all Georgian voters into the categories of White, Black, Hispanic, Asian, and others.Footnote 5 We then bootstrap 10,000 draws, with samples of 1,000 in each draw, to determine the accuracy of each method relative to self-reported racial identification. We store the actual number for each race drawn in addition to the sum of the race probability estimates for each of the BISG level estimates. Finally, we calculate the absolute difference between these BISG estimates to report the distributional difference both as percent differences. This allows us to create a distribution of uncertainty as opposed to a static state-level population estimate.
3 Results
We report the detailed results of the 10,000 bootstraps in Table 3 in the Supplementary Material. Overall, the results are such that ZIP codes, tracts, and blocks produce the most accurate results on the whole. Estimates for ZIP codes and census tracts are virtually identical, with exceptions being marginal and not substantive when imputing White and Black racial identification. In addition, census blocks are the most useful geocoded geographic unit when imputing Asian and Hispanic racial identification.
Figure 1 visualizes the differences in accuracy using the absolute difference in count out of 1,000 from the reported race drawn from the 10,000 bootstraps for blocks, ZIP codes, and counties. By race, we see that the three plotted methods are highly clustered when estimating White racial identification, although counties are the least accurate. Blocks and ZIP codes reach similar levels of accuracy when estimating Black racial identification with counties again underperforming. We also find that ZIP codes outperform counties when estimating Asian and Hispanic identification, although blocks continue to be more accurate in both cases. The estimates are such that blocks and ZIP codes are substantively the same for estimating White and Black racial identification within the bootstrapped data.
ZIP codes and census blocks produce very similar levels of accuracy even though ZIP codes do not require complex geocoding. The median difference between ZIP codes and blocks for White and Blacks are $-$ 1.22 and 1.54 per 1,000 draws, respectively. However, the difference in medians between ZIP codes and blocks reaches 9.21 and 23.35 per 1,000 draws for Asian and Hispanic racial identification, respectively. As a comparison to the next level of precision used in BISG, tracts, the median difference between ZIP codes and tracts for White and Black racial identification are $-$ 0.15 and 0.12 per 1,000 draws, respectively. For Asian and Hispanic racial identification, the difference in medians between ZIP codes and tracts amounts to 0.05 and 4.81 per 1,000 draws, respectively. Therefore, blocks, tracts, and ZIP codes are effectively equivalent in accuracy for estimating White and Black populations. For Asian and Hispanic populations, block estimates exceed that of both tracts and ZIP codes, with the latter two being substantively equivalent.
Given the cost in geocoding and processing times, which took 7.28 hr to complete in this analysis, the gain in reducing the median difference in error per hour are 1.27 and 3.21 for Asian and Hispanic racial identification, respectively. Such gains can be seen as substantive and worth the added investment of geocoding. For White and Black racial identification, the rates are $-$ 0.17 and 0.21. The gains for tracts over ZIP codes, in turn, are effectively zero despite the additional hours necessary to geocode.
4 Discussion of Applications
Our findings provide a path forward for the use of BISG in estimating racial group identification in a variety of new applications. Through our analysis of the Georgia voter file, we are able to confirm that the wru implementation of Imai and Khanna (Reference Imai and Khanna2016) continues to perform exceedingly well, and the extension provided here will further its applicability. This letter can serve as a guide for future researchers trying to distinguish between where geocoding is necessary and where nongeocoding alternatives—like ZIP codes—can be employed without sacrificing accuracy.
It is important to note two potential challenges that can characterize this type of analysis. First, modifiable areal unit problems (MAUPs) can introduce noise into any analysis that relies on geographic shape files (Duque, Laniado, and Polo Reference Duque, Laniado and Polo2018). As a result, the accuracy findings presented in this letter can be considered more conservative estimates. Furthermore, the noise introduced by MAUPs in the case of Georgia may not apply in the same way to other geographic contexts. In addition, racial identification as a fluid concept can introduce error when using self-reported racial identification, especially among Asian and Hispanic individuals (Masuoka Reference Masuoka2006; Masuoka, Ramanathan, and Junn Reference Masuoka, Ramanathan and Junn2019). Special attention should be paid when computing racial identification to ensure that imputations are a reflection of the lived experience of the subjects being studied and account for the trichotomy of race, ethnicity, and nationality (Masuoka Reference Masuoka2006).
Future work should increase the transparency of how analysts implement BISG when discussing their geocoding process, rationale, and geographic units. While we were able to identify the primary level of BISG geocoding in most articles citing Imai and Khanna (Reference Imai and Khanna2016), it was often unclear what geographic unit was used and how missing data were handled. As noted in this letter, the trade-offs of using geocoded versus nongeocoded imputation methods come with significant costs to data missingness and accuracy. Transparency must be a central tenant of any research utilizing these methods.
ZIP codes are the superior alternative to other nongeocoded BISG processes. Furthermore, ZIP codes are effectively on par to estimates derived using census tracts without the added costs associated with geocoding. Should a researcher find themselves in a situation where geocoding their data is either necessary or preferred, census blocks should be used as opposed to census tracts given their higher accuracy when estimating racial identification. County and surname only estimates are error prone and should only be used when no other alternative is viable. However, we recognize that context matters; in cases where researchers are only estimating the racial identification of White and Black individuals, ZIP code estimates are effectively indistinguishable from block-level estimates while allowing researchers to avoid geocoding and spatial overlap costs.Footnote 6
We recommend that researchers incorporate ZIP codes into future BISG research. Researchers should only use more computationally costly geocoded alternatives when required; surname only and county-level analyses should only be used when all other alternatives have been examined. Such sequential BISG predictions can robustly reduce estimation error when imputing the racial identification of individuals.
Acknowledgments
We would like to thank the two anonymous reviewers and the Editor, Jeff Gill, for their thoughtful comments and discussion.
Data Availability Statement
The replication materials for this paper can be found on the Harvard Dataverse at Clark et al. (Reference Clark, Curiel and Steelman2021). For privacy reasons, personal identifying information is redacted from the replication materials.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2021.31.